Thesis
Thesis
6G Networks
I, Hoai Nam Chu, declare that this thesis is submitted in fulfilment of the require-
ments for the award of Doctor of Philosophy, in the Faculty of Engineering and
Information Technology at the University of Technology Sydney.
Signature:
Date: February 22, 2024
ABSTRACT
6G Networks
by
Hoai Nam Chu
The above results demonstrate the great potential of advanced machine learning
in addressing the emerging issues of 6G and enabling new applications/services. As
future works, one may look into the applications of Generative AI to 6G and how
to design 6G systems to enable Generative AI as a service.
Acknowledgements
I would like to thank all my colleagues and friends at the University of Technology
Sydney for their support, discussion, and friendship. Additionally, I extend my
thanks to the SEDE admin team for efficiently handling all the paperwork and
forms during my PhD study.
I would like to express my heartfelt gratitude to my family and friends for their
endless love and unwavering support, which gives me the strength to overcome
life’s difficulties. I am especially grateful to my beloved wife, Thao Pham, and
our wonderful son, Viet Chu (aka. Beau). My parents and my sister also deserve
heartfelt thanks for their belief in my abilities and constant motivation. Their
support, understanding, and encouragement throughout this journey have been the
cornerstone of my success. Thank you all from the bottom of my heart.
Contents
Abstract ii
Acknowledgments iv
Table of Contents v
List of Publications xi
Abbreviation xvii
1.2.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 24
CONTENTS vi
1.2.4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 30
2 Background 33
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography 193
List of Publications
Journal Papers
J-7. Hai M. Nguyen, Nam H. Chu, Diep N. Nguyen, Dinh Thai Hoang, Minh
Hoàng Hà, Eryk Dutkiewicz, “Optimal Privacy Preserving in Wireless Fed-
erated Learning over Mobile Edge Computing,” submitted to IEEE/ACM
Transactions on Networking, under review.
J-8. Thai T. Vu, Nam H. Chu, Khoa T. Phan, Dinh Thai Hoang, Diep N. Nguyen,
and Eryk Dutkiewicz “Energy-based Proportional Fairness in Cooperative
Edge Computing,” submitted to IEEE Transactions on Mobile Computing,
under review.
Conference Papers
Book Chapters
B-2 Diep N. Nguyen, Nam H. Chu, Dinh Thai Hoang, Octavia A. Dobre, Dusit
Niyato, and Petar Popovski, “Generative AI for Communications Systems:
Fundamentals, Applications, and Prospects,” John Wiley & Sons, in produc-
tion.
List of Figures
3.4 (a) Source MDP, (b) the first target MDP, and (c) the second target
MDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 The ICAS system model in which the ICAS-AV maintains a data
communication with AVX based on IEEE 802.11ad. At the same
time, the ICAS-AV senses its surrounding environment by utilizing
echoes of its transmitted waveforms. . . . . . . . . . . . . . . . . . . 87
4.4 Varying the data arrival rate λ under normal channel condition, i.e.,
pnc = [0.2, 0.6, 0.2], with the weight vector W1 = [0.05, 0.4, 0.5]. . . . . 107
4.5 Varying the data arrival rate λ under normal channel condition, i.e.,
pnc = [0.2, 0.6, 0.2], with the weight vector W2 = [0.025, 0.8, 0.5]. . . . 108
4.6 Varying the data arrival rate λ under poor channel condition, i.e.,
ppc = [0.6, 0.2, 0.2], with the weight vector W1 = [0.05, 0.4, 0.5]. . . . . 109
4.7 Varying the data arrival rate λ under poor channel condition, i.e.,
ppc = [0.6, 0.2, 0.2], with the weight vector W2 = [0.025, 0.8, 0.5]. . . . 110
4.8 Varying the data arrival rate λ under strong channel condition, i.e.,
pgc = [0.2, 0.2, 0.6], with the weight vector W1 = [0.05, 0.4, 0.5]. . . . . 111
4.9 Varying the data arrival rate λ under strong channel condition, i.e.,
pgc = [0.2, 0.2, 0.6], with the weight vector W2 = [0.025, 0.8, 0.5]. . . . 112
5.7 The acceptance probability per class when varying the immediate
reward of class-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4 The maximum achievable AmB rate when varying (a) θ0 and (b) the
transmitter-to-receiver link SNR, i.e., αdt . . . . . . . . . . . . . . . . . 172
6.5 The upper bound of the expected number of guesses, i.e., E[G(X)],
vs. the message splitting ratio β. . . . . . . . . . . . . . . . . . . . . 174
6.7 Varying (a) the transmitter-receiver SNR αdt and (b) the
tag-receiver SNR αbt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.9 Reliability of learning process with different training datasets’ sizes. . 180
Chapter 1
This chapter first overviews the specifications of the sixth-generation (6G) networks
with its emerging services and challenges. Then, the existing solutions for handling
these problems are comprehensively reviewed. After that, this chapter highlights
the main contributions of this thesis. Finally, the thesis structure is provided.
1.1 Motivations
Over the past decade, wireless communication networks have experienced re-
markable growth. While the fourth generation (4G) can only offer typical download
speeds of around 100 Mbps, the latest generation of cellular network, i.e., 5G, is
expected to deliver multi-Gbps peak data rates (such as 20 Gbps downlink and
10 Gbps uplink) and ultra-low latency communications (with delays as low as one
millisecond) [1–3]. Notably, the fifth-generation (5G) shifts the focus from data rate-
enabler for emerging applications, e.g., Internet of Things (IoT) and smart cities.
However, existing 5G systems thus far do not reach the promised revolution.
difficult to meet the 5G expectation rate [1]. Moreover, new services, ranging from
fit well the 5G focus of facilitating small packet and sensing based services [1]. For
1.1 Motivations 2
Table 1.1 : Comparison of technology standards for 5G, Beyond 5G, and 6G [1].
5G 6G
New applications:
New applications:
- mBRLIC
- eMBB
Application types - mURLIC
- URLLC
- HCS
- mMTC
- MPS
- Sensors and DII devices
- Smartphones
- CRAS
Device types - Sensors
- XR and BCI equipment
- Drones
- Smart implants.
Rate requirements 1 Gb/s 1 Tb/s
End-to-end delay 5 ms <1 ms
Radio-only delay 100 ns 10 ns
Processing delay 100 ns 10 ns
End-to-end reliability 99.999% 99.99999%
- Sub-6 GHz
- Sub-6 GHz - MmWave for mobile access
Frequency bands
- MmWave for fixed access. - THz bands (above 300 GHz)
- Non-RF (e.g., optical and VLC)
example, extended reality (XR)-based services (especially with the rise of Meta-
intentionally use large packet [1, 2]. Additionally, they also demand wireless com-
munications that can sustain high data rates (Terabit per second) for both up and
latency) [2, 3]. To address these issues and facilitate emerging services, a disruptive
6G is required.
cial Intelligence (AI), mmWave and even higher frequency bands, unmanned aerial
tering for energy saving) are expected to be deployed [2], as illustrated in Figure. 1.1.
By doing so, 6G can offer various new use cases, such as Metaverse/holographic
Ambient Network
Backscatter Automation
Communications
Enabling Technologies
Emerging Use Cases
UAVs
Autonomous Cyber-
physical Systems
Metaverse/
Holographic Teleportation
Figure 1.1 : 6G’s enabling technologies and emerging use cases [1–3].
tems [1–3]. The integration of these technologies, on the one hand, provides an op-
However, on the other hand, it also introduces new challenges due to the heightened
munications systems since they can provide seamless coverage as well as extend the
even unavailable. Thus, NTNs offer promising solutions for collecting data from IoT
devices in such areas, especially UAV-based solutions due to its flexibility and low
cost. In particular, when UAVs act as on-demand flying access points (APs), thanks
to their aerial superiority, they can establish good line-of-sight (LoS) links for the
IoT nodes. In remote areas without access to terrestrial infrastructures, UAVs can
1.1 Motivations 4
provide a much more economical solution to collect IoT data than other approaches,
in NTNs. Due to the flexibility, mobility, and low operational cost, UAVs have been
being deployed as flying APs for some real-world projects, e.g., Google’s Loon and
Facebook’s Aquila [4,5]. However, there are still some challenges that hinder the ap-
solutions for collecting IoT data (e.g., deploying fixed APs), UAVs have limited
energy resources supplied by batteries. When the UAVs’ batteries are depleted,
they must replenish their energy by flying back to the charging stations to charge
or replace their batteries. It is worth noting that given a fixed working duration,
the more time the energy replenishment process takes, the less time the UAVs can
spend for collecting IoT data. Alternatively, the energy replenishment process is
highly dynamic since it depends on the distance between the UAV and the charging
stations. Therefore, optimizing energy usage and the energy replenishment process
Moreover, the UAVs often fly around to collect IoT data, while IoT nodes are stati-
cally allocated over different zones, and their sensing data are random depending on
zones to maximize data collection efficiency is another major challenge that needs
to be considered.
ICAS also emerges as a promising solution for Autonomous Vehicles (AVs), a use
case of 6G [2], where sensing and data communications are two important functions
that often operate simultaneously. The sensing function enables AVs to detect ob-
jects around them and estimate their distance and velocity for safety management,
(e.g., collision avoidance) or for other use cases of 6G, such as Metaverse/holographic
1.1 Motivations 5
Non-
Data Collection Networks Terrestrial Ubiquitous Joint Communication and Sensing
UAV-assisted Networks sensors
Chapter 3-[J-1]
Advanced Chapter 4 -[J-2]
Machine
Learning
for 6G
Networks
tion with other AVs or infrastructure via Internet of Vehicles (IoV). For example,
they can send/receive safety messages and even their own raw sensing data (e.g.,
traffic data around the AV) for applications such as transportation safety, trans-
portation monitoring, and user services distributed to the AVs [6]. Although auto-
motive sensing and vehicular communication can share many commonalities (e.g.,
signal processing algorithms and the system architecture [7]), they are typically de-
ity, and radio spectrum resources. These challenges can be effectively addressed
by combining both communication and sensing functions into a unified system, i.e.,
ICAS that can offer an efficient spectrum sharing mechanisms to avoid interference
and coexist within a transmitter or other users. However, optimizing the wave-
form structure is one of the most challenging tasks due to strong influences between
1.1 Motivations 6
Thirdly, 6G will support a variety of emerging services and use cases with differ-
ent performance metrics, such as throughput, latency, and reliability. It also needs
to provide a consistent and satisfactory quality of service and user experience for
all the users and applications. Recently, Metaverse has just been attracting more
attention from academia and industry in the last few years, thanks to the recent
advances in technologies (e.g., extended reality and edge intelligence) along with
great efforts of many big corporations such as Facebook [8] and Microsoft [9]. The
Metaverse is expected to bring a new revolution to the digital world [10]. Unlike
existing virtual worlds (e.g., Second Life and Roblox), where the users’ presentations
(e.g., avatars/characters) and assets are limited in specific worlds, the Metaverse can
world in the Metaverse can be created for a certain application, such as entertain-
ment, education, and healthcare. Similar to our real lives, Metaverse users can bring
their assets from one to another virtual world while preserving their values, and vice
versa. Moreover, the Metaverse is expected to further integrate digital and phys-
ical worlds, e.g., digitizing the physical environment by the digital twin [12]. For
example, in the Metaverse, we can create our virtual objects, such as outfits and
paintings, and then bring them to any virtual world to share or trade with others.
We can also share virtual copies of a real object in different virtual worlds. Thus,
the Metaverse will bring total new experiences that can change many aspects of our
resource management for the underlying infrastructure one of the biggest challenges
with diverse requirements and sensitivities, exposing new surfaces for cyber attacks.
of the most common types of wireless attacks. To perform the attack, the eaves-
dropper usually stays close to the victim system to “wiretap” the legitimate wire-
less channel and acquire exchanged information. Since the eavesdropper operates
passively without introducing noise or altering transmit signals, detecting and pre-
tion and transportation layers [16]. Nevertheless, these approaches face several issues
crypt the encrypted data, especially with the recent advances in quantum comput-
defeat many cryptographic schemes, even those with very robust schemes [16, 18].
tographic keys [19]. However, this approach often requires prior information of the
domains, such as search engines, speech recognition, medical diagnosis, and com-
bilities (e.g., trainable radios), optimize network resources for enhanced perfor-
1.1 Motivations 8
machine learning approaches (e.g., deep learning and reinforcement learning) usu-
ally require a large amount of high-quality data for the training process [20]. This
requirement makes them less efficient in practice when data is expensive and/or con-
tains noise due to the wireless environment’s dynamics and uncertainty. Thus, this
• How can UAVs be dynamically and optimally controlled for speed and energy
works, considering the limitations of UAV energy capacity and the uncertainty
and how does this address the intricate relationship between these functions?
This section begins by overviewing current studies in tackling the above prob-
lems. Then, it highlight gaps in the literature. Finally, this section emphasizes the
In the literature, several works study the UAV’s energy replenishment process for
prolong UAV serving time [21–25]. To minimize the age of information (AoI) under
the constraint of the UAV’s charging rate and battery capacity, the authors in [21]
based trajectory. They also point out that the UAV’s charging rate has much more
influence on the low bound of AoI than the UAV’s battery capacity. The study in [22]
aims to minimize the data collection time by optimizing the UAV trajectory and the
order of IoT devices that the UAV is going to visit. Specifically, they employ a deep
and a Q-learning based scheduler to determine the order of visiting positions where
the IoT devices or charging stations are located. Furthermore, a transfer learning
of the proposed transfer learning technique is not well investigated. Similarly, the
study in [23] aims to minimize the total time that the UAVs need to collect data from
the task, they can return to a charging station for charging. The authors first use
the Gaussian mixture model to group IoT nodes into different clusters and formulate
reinforcement learning (DRL) approaches are proposed to find the optimal policy
1.2 Literature Review and Contributions 10
In [24], the authors consider that a UAV is wirelessly being charged during the
data collection task. They first formulate the problem as a Markov decision process
(MDP), then propose a Q-learning algorithm to maximize the energy efficiency and
framework to provide security for IoT data collection networks. A charging coin
is introduced to reward UAVs when they successfully collect IoT data. Then, the
UAVs can use collected coins to recharge their batteries at a charging station. In
addition, they develop an adaptive linear prediction model to reduce the number of
All of the above works [21–25] assume that the UAVs always fly at a constant
speed during the data collection process. However, in practice, a UAV can choose
different speeds during its data collection process depending on its surrounding
environment. Alternatively, the UAV’s speed can strongly influence the system’s
efficiency because it has a substantial impact on energy consumption during the data
collection process [26]. Thus, optimally controlling UAVs’ speed can significantly
improve the energy usage and data collection efficiency of the system, especially
in UAV-assisted IoT data collection networks where UAVs have limited battery
capacities. Unfortunately, this important factor is not investigated in all the above
studies.
Notably, only a few works investigate the speed control problem for UAV-assisted
IoT data collection networks [27–30]. In [27], the authors aim to minimize the flight
time for a data collection task by jointly optimizing the UAV’s speed, data collection
duration, and the IoT devices’ transmit power. Their numerical results show that
the UAV’s optimal speed depends on the distance between sensors, sensors’ energy,
and the data upload requirements. In [28] and [29], the authors aim to maximize
1.2 Literature Review and Contributions 11
the data collection efficiency by controlling UAV’s speed according to the IoT device
density. In particular, the authors in [28] first introduce an analytical model for the
transmission between the UAV and IoT nodes, then the UAV’s speed is optimized
based on this model. In [29], the authors reveal a tradeoff between system through-
put and IoT devices’ energy efficiency. By optimizing the UAV’s speed, altitude, as
well as the MAC layer frame length, we can achieve the balance between the two
conflict factors.
All the above works (i.e., [27–29]) apply the conventional optimization theories,
which statically optimize the UAV’s speed during the IoT data collection process.
Therefore, their algorithms need to rerun whenever the environments are changed,
tions are inefficient in addressing the high dimensional state space as that in the
the complete information about the surrounding environment is unknown, like what
we consider in this work (e.g., packet arrival probabilities for the whole network
and flight time for replacing the battery). In this context, reinforcement learning
emerges as the best approach to address the highly dynamic and uncertainty of the
environment since it can help the UAV adapt its behavior according to the environ-
ment’s changes. In [30], the authors employ deep Q-learning to control the UAV’s
speed during its data collection task, where the UAV can also wirelessly charge the
IoT devices while collecting their data. This work aims to minimize the data packet
loss by selecting the best devices to be charged and interrogated, together with the
optimal UAV’s speed. Their simulation results show that the UAV’s speed is pro-
portional to the number of IoT devices and inversely proportional to the data queue
lengths of IoT devices. Similar to [27–29], the study in [30] does not consider the im-
pacts of UAV’s energy consumption and energy replenishment processes during the
data collection task. It is worth highlighting that the energy replenishment process
1.2 Literature Review and Contributions 12
is a critical factor that cannot be ignored since the UAVs’ energy is limited.
It can be observed that all of the aforementioned works do not jointly optimize
are among the most important factors to achieve high efficiency in terms of energy
and data collection for UAV-assisted IoT data collection networks. In addition,
for UAV-based collectors (i.e., [22–24, 30]) rely on conventional Q-learning or deep
This problem can make the learning process unstable [31]. In addition, the work
in [22] applies transfer learning to speed up the learning process of the proposed
DRL algorithm. However, the impacts of transfer learning are not well investigated.
Note that transfer learning does not always improve or even can cause nega-
tive impacts on the learning process [32]. Furthermore, the study in [22] does not
consider the UAV’s speed control, one of the most important factors influencing to
decisions of energy replenishment. To fill these gaps, this study develops a highly
efficient solution based on deep reinforcement transfer learning for UAV-assisted IoT
data collection networks. Specifically, our proposed approach can effectively address
the overestimation problem and stabilize the learning process by adopting recent ad-
vanced techniques in RL, including deep Q-learning [33], deep double Q-learning [31],
and dueling neural network architecture [34]. In addition, the proposed solution can
simultaneously optimize the UAV’s speed and energy replenishment processes and
UAVs. Thus, leveraging the transfer learning technique can improve the learning
quality and reduce the learning time, thereby leading to a decrease of computa-
on UAVs.
1.2 Literature Review and Contributions 13
1.2.1.2 Contributions
Given the above, to jointly optimize the speed control and battery replacement
activities for a considered UAV under the dynamic and uncertainty of IoT data col-
lection process, we propose a dynamic decision solution leveraging the Markov de-
cision process (MDP) framework. This framework allows the UAV to make optimal
decisions (regarding the flying speed and battery replacement activities) based on
can be used to find the optimal policy for the UAV, its convergence rate is slow,
especially in a highly complex problem as the one considered in this problem where
we need to jointly optimize the speed and energy replenishment activities for the
problems when estimating action values, especially for complicated problems with
hybrid actions like what we consider in this problem (i.e., speed selection and energy
ble Q-learning (D3QL) to address these challenges. The key ideas of D3QL are to
(1) separately and simultaneously estimate the state values and action advantages,
making the learning process more stable [34], and (2) address the overestimation
by using two estimators (e.g., deep neural networks), resulting in the stability of
To further reduce the learning time and enhance the learning quality, we develop
transfer learning techniques to allow the UAV to learn more knowledge from other
UAVs learning in similar environments. In addition, these techniques also help the
UAV leverage knowledge obtained from different environments to improve its policy,
making our solution more applicable and scalable. Therefore, our proposed solution
results demonstrate that our proposed solution, i.e., D3QL with transfer learning
(D3QL-TL), can simultaneously optimize the energy usage and data collection, and
1.2 Literature Review and Contributions 14
thereby leading to the best performance compared to other methods. To the best of
our knowledge, this is the first study investigating a UAV operation control approach
taking the dynamic of the IoT data collection, energy limitation, and impact of the
• We propose a novel framework that allows the UAV to jointly optimize its
flying speed and battery replacement activities under the dynamic and un-
this framework can not only allow the UAV to dynamically and automatically
deep Q-learning, deep double Q-learning, and dueling neural network archi-
• To reduce the learning time and improving learning quality for the UAV, we
develop advanced transfer learning techniques that allow UAVs to “share” and
making our approach more scalable and applicable in practice, e.g., scenarios
proposed approaches and reveal critical elements that can significantly impact
Currently, two standards operating at 5.9 GHz for vehicular communication net-
works are C-ITS based on IEEE 802.11bd in Europe [35] and DSRC based on IEEE
802.11p in the U.S. [36]. Unfortunately, their data rates (i.e., up to 27 Mbps) do
not meet the requirements of AVs’ applications. For example, precise navigation
that needs to download a high definition three-dimension map and raw sensor data
exchange between AVs to support fully automated driving may require connections
in ICAS systems operating at sub-6 GHz is limited due to the bandwidth availabil-
ity [38]. In this context, millimeter wave (mmWave), whose frequency is from 30
GHz to 300 GHz, has been emerging as a promising solution to address the above
challenges in ICAS systems [39]. First, owing to the high-resolution sensing and
Radar (LRR) [40]. Second, an mmWave system, e.g., a wireless local area network
(WLAN) operating at the 60 GHz band, can provide a very high data rate to meet
However, several challenges are hindering the applications of mmWave ICAS sys-
tems in AVs. In particular, unlike the conventional approaches where sensing and
communication are separated, the ICAS-AV leverages a single waveform for both
sensing and communication functions. Thus, it needs to jointly optimize these two
sensing for ICAS systems. In addition, since the ICAS operates while ICAS-AVs are
moving, the surrounding environments of AVs are highly dynamic and uncertain.
by wireless environments than those of the sub-6 GHz bands [41]. Therefore, the
1.2 Literature Review and Contributions 16
ical challenge that needs to be addressed. To that end, mmWave ICAS systems
are demanding an effective and flexible solution that can not only jointly optimize
communication and sensing functions but also adaptively handle the highly dynamic
high data rate communication links (given the highly directional mmWave commu-
nications) and sensing accuracy, e.g., low target miss detection probability and low
waveforms for ICAS systems [38, 42–46]. In [42] and [43], the authors exploit a
single IEEE 802.11ad data communication frame to provide the sensing function.
Specifically, the authors in [42] propose to use the preamble of the Single Carrier
The simulation results show that this approach can achieve a data rate of up to
1 Gbps with high accuracy in target detection and range estimation. However,
the velocity estimation is poor because the preamble is short. Particularly, the
proposed approach achieves the desired velocity accuracy (i.e., 0.1 m/s) only when
the Signal-to-Noise Ratio (SNR) is high, i.e., greater than 28 dB. In [43], the authors
aim to overcome this issue by using the IEEE 802.11ad Control Physical Layer (C-
PHY) frame that has a longer preamble than that of IEEE 802.11ad SC-PHY.
However, it is still not large enough to improve the velocity estimation, whereas the
data rate is only 27.5 Mbps, significantly lower than the desired data rate for AV’s
communication [37]. These results from the above studies (i.e., [42] and [43]) suggest
that a single frame processing is unable to satisfy the desired velocity estimation
for ICAS systems to improve sensing information extracted from targets’ echoes,
1.2 Literature Review and Contributions 17
e.g., [38, 44–46]. The authors in [44] propose velocity estimation algorithms that
Their results demonstrate that the proposed solution can achieve the desired velocity
accuracy of AVs (i.e., 0.1 m/s [40]) when the number of frames is greater than 20. In
[45], the authors develop a similar multi-frame processing method to embed sensing
(V2I) scenario. By doing so, they can reduce the beam training time of 802.11ad up
to 83%. Instead of using the 802.11ad standard, the authors in [46] propose a ICAS
a bi-static automotive ICAS system in which the sensing area is extended to non-
in this work, the maximum communication data rate is only up to 0.1 Mbps which
Recently, reinforcement learning has been leveraged to address the dynamic and
ICAS [47], spectrum sensing [48], and anti-jamming in wireless sensor networks [49].
In particular, the authors in [47] consider the ICAS system that is the combination
algorithm to optimally allocate resource blocks (i.e., a tuple of beam, channel, and
[47] is that it does not consider the dynamic and uncertainty in the data arrival
process and wireless channel, which are addressed in our framework. In practice,
these dynamics and uncertainties are very important and cannot be ignored since
autonomous vehicles are of high mobility, and the transmission demand of users and
the AV system varies over time. Moreover, the system in [47] requiring separate
beams and channels (i.e., frequencies) for communications and sensing is not as
1.2 Literature Review and Contributions 18
spectrum-efficient as the ICAS system considered in our work that leverages the
A common drawback in the above studies (i.e., [44–46]) is that the waveform
structures (e.g., number of frames in CPI) are not optimized. Instead, these param-
eters are manually set. In practice, the dynamic and uncertainty of the ICAS’s envi-
ronment (e.g., SNR and data arrival rate) can significantly influence the ICAS’s data
and timely adapting the selected structure with the dynamics of the surrounding en-
vironment play vital roles. To address this problem, in [38], the authors propose an
adaptive virtual waveform design for mmWave ICAS based on the 802.11ad standard
to achieve the optimal waveform structure (i.e., number of frames in CPI) that can
balance between communication and sensing performance. The results show that
given a fixed length of CPI, increasing the number of frames in CPI can increase the
sensing performance, but it will degrade the communication performance (i.e., data
rate). However, this approach requires complete information about the surrounding
their proposed solution needs to be rerun from scratch if there is any change in the
environment.
In addition, none of the above studies (i.e., [38,42–46]) considers the dynamic and
uncertainty problem of the information and environment, e.g., the changes of the
wireless channel quality and the arrival rate of data that need to be transmitted via
ICAS. This problem is critical to the performance of the ICAS system because the
ular, the rapid change of the wireless channel quality (e.g., SNR) highly impacts the
ICAS’s communication efficiency (i.e., packet loss due to transmission failure) and
sensing performance (i.e., target detection and targets’ range and speed estimation
1.2 Literature Review and Contributions 19
accuracy). The problem is even more critical for mmWave systems that are highly
(e.g., navigation and automated driving). When the data arrival rate at the AV’s
ICAS system is higher than its maximum transmission rate, data starts to pile up
in a data queue/buffer. Since a data queue/buffer size is always limited, packet loss
will occur when the queue is full. This problem can cause serious issues for AVs
as they cannot communicate with other AVs and infrastructure. Given the above,
jointly optimize both sensing and communication performance but also effectively
deal with the dynamic and uncertainty of the surrounding environment. However, to
the best of our knowledge, this approach has not been investigated in the literature.
1.2.2.2 Contributions
To fill this gap, this thesis aims to propose a novel framework to maximize the
under the dynamic and uncertainty of the surrounding environment when the AV is
moving. It is worth noting that since the sensing processing for the 802.11ad-based
ICAS is well-investigated in [38, 42–45], this study only focuses on addressing the
waveform structure optimization problem for mmWave ICAS AVs under the dy-
the problem as a Markov Decision Process (MDP) because it can allow the AV
to determine the optimal waveform (e.g., the number of frames in CPI) based on
its current observation (e.g., channel state and number of data packets in the data
queue). Then, we adopt the Q-learning algorithm, which is widely used in Reinforce-
ment Learning (RL) due to its simplicity and convergence guarantee, to help the
ICAS-AV gradually learn the optimal policy via interactions with the surrounding
1.2 Literature Review and Contributions 20
environment. However, Q-learning may face the curse of dimensionality and overes-
timation problems that lead to a low converge rate and an unstable learning process
when the state space is large [31]. In our case, the state space that consists of all
requires fast learning to promptly respond to the highly dynamic and uncertainty
algorithm based on the most recent advances in RL, namely i-ICS, to deal with
these problems. First, i-ICS addresses the high dimensional state space problem by
utilizing a deep neural network (DNN) to estimate the values of states [33]. Second,
the overestimation is handled by using the deep double Q-learning [31]. Finally, the
learning process is further stabilized and accelerated by leveraging the dueling neu-
ral network architecture that separately and simultaneously estimates the advantage
values and state values [34]. Our major contributions are as follows.
• Design a novel framework by which the ICAS-AV can dynamically and au-
tomatically optimize the waveform structure under the highly dynamic and
cation efficiency and sensing accuracy, thereby maximizing the ICAS’s perfor-
mance.
and dueling neural network architecture, that can help the ICAS-AV quickly
solution under different scenarios and reveal key factors that can significantly
that is impeding the deployment of the Metaverse [10]. To fulfil the Quality-of-
Service (QoS) and meet the user experience requirements in the Metaverse, it de-
mands enormous resources that may have never been seen before. First, Metaverse
cation can host a hundred thousand users simultaneously. For example, the peak
number of concurrent players of Counter Strike - Global Offensive is more than one
million in 2021 [50]. It is forecasted that data usage on networks can be expanded
more than 20 times by the operation of Metaverse [51]. Second, unlike the cur-
rent online platforms (e.g., massive multiplayer online role-playing games where the
uplink throughput can be much lower than that of downlink throughput [52]), the
Metaverse requires extremely-high throughput for both uplink and downlink trans-
mission links. The reason is that Metaverse users can create their digital objects
and then share/trade them via this innovation platform. Therefore, to maintain the
Quality-of-Experience (QoE) for users, the Metaverse’s demand for resources (e.g.,
computing, networking, and storage) likely exceeds that of any existing massive
More importantly, the required resource types are highly correlated. In partic-
ular, the Extended Reality (XR) technology is believed to be fully integrated into
Metaverse’s applications such that users can interact with virtual and physical ob-
jects via their digital avatars, e.g., digital twin [10]. Therefore, it requires not only
data collected from perceived networks, e.g., the Internet of Things (IoT), but also
intensive resources are required not only to contain and operate Metaverse applica-
tions but also to support massive data forwarding over networks. Given the above,
size of resources of different types and the correlations between these types.
In this context, although deploying the Metaverse on the cloud is a possible so-
lution, it leads to several challenges. First, the cloud is often located in a physical
of users connect at once. Second, since users come from around the world, a huge
results in high delay, which severely impacts the Metaverse since the delay is one
of the crucial drivers of user experience [53]. In this context, multi-tier resource
cation capabilities are distributed along the path from end-users to the cloud, is a
In the literature, there are only a few attempts to investigate the Metaverse
ing resource allocation for a single-edge computing architecture that has limited
computing resources to allocate for some nearby Metaverse users. Similarly, in [55]
and [56], resource allocation at the edge is considered, but more resource types, i.e.,
Virtual Reality (VR) services between end-users and VR service providers. In [56],
the authors address the stochastic demand problem for an application of education
method to minimize the cost for the virtual service provider. Unlike the above works,
perception networks (e.g., IoT) that are used to collect data for the Metaverse.
1.2 Literature Review and Contributions 23
It can be observed that none of the above studies considers the multi-tier com-
puting architecture for resource allocation problem. Instead, their approaches are
as analyzed above, due to the extremely high resource demands of Metaverse ap-
plications, the single-tier computing resource model may not be appropriate and
tee optimal performance for Metaverse applications. The available resources at the
edge (i.e., near end-users) are often much lower than those of the cloud and may
not satisfy the intensive resource requirements of Metaverse applications [58]. This
can lead to high computational latency or even disrupting services due to the lack
of resources. Second, all resources in the single-tier model concentrate in one place,
possibly resulting in a single point of failure problem, and thus this single archi-
tecture has a low ability of scalability and flexibility. Third, the single-tier edge
such, if the user moves far away from that location, the system cannot guarantee a
good Quality of Experience (QoE) [58]. Given the above, the single-tier architecture
cations may share some common functions. For example, a digital map is indeed a
functions among applications has already been made. For instance, Google Map’s
map, check-in, display live data synching with location [59]) that can be shared
among many applications, e.g., Pokemon Go [60], Wooorld [61], and CNN iReport
Map [62]. This special feature of the Metaverse indeed can be leveraged to max-
imize resource utilization for Metaverse applications. However, none of the above
works can exploit the similarity among applications to improve resource utilization
1.2 Literature Review and Contributions 24
for the Metaverse. Moreover, in practice, users join and leave the Metaverse at any
time, leading to the high uncertainty and dynamic of resource demands. Among the
aforementioned works, only the study in [56] addresses the stochastic demands of
allocation and could not leverage the similarities among Metaverse applications’
for an effective and comprehensive solution for the Metaverse to handle not only the
massive resource usage but also the dynamic and uncertain resource demand.
1.2.3.2 Contributions
resources, and thereby maximizing the whole system performance. Firstly, we intro-
duce the idea of decomposing an application into multiple functions to facilitate the
tier in the system according to functions’ requirements and tiers’ available resources.
For example, functions with low latency requirements can be placed at a low tier
(e.g., tier-1), while those with low update frequency can be placed at a higher tier.
By doing so, the application decomposition can not only provide a flexible solution
for deploying Metaverse applications but also utilize all networks’ resources from
into a MetaInstance, and the common functions will be shared among these appli-
cations instead of creating one for each application. Therefore, this technique can
save more resources. Finally, to address the uncertainty and dynamic of resource
with a reinforcement learning algorithm that can automatically find out an optimal
Markov decision process that can capture the high dynamic and uncertainty
admission control policy for the Metaverse without requiring the complete
proposed framework but also to gain insights into the key factors that can
Communications
theless, these approaches possess several issues that significantly limit their practical
making them less practical or even infeasible for resource-limited devices such as IoT
devices [63]. Secondly, distributing and managing cryptographic keys also require
tional capacity can decrypt the encrypted data, especially with the recent advances
eavesdropper can defeat many cryptographic schemes, even those with very robust
disrupt eavesdroppers’ signal reception [65–67]. However, this approach may not
always ensure a positive secrecy rate, which is the difference of capacity between the
legitimate channel (from the transmitter to the intended receiver) and the “tapped”
channel (from the transmitter to the eavesdropper). Moreover, the generated inter-
1.2 Literature Review and Contributions 27
ferences may also severely degrade signal receptions at neighboring legitimate de-
mode, relays can also inject noises (i.e., jamming signals) to confuse eavesdroppers
in the relays’ coverages. Whereas in the cooperative beamforming mode, these re-
the signal strength at eavesdroppers. The main drawback of friendly jamming- and
(e.g., energy and computing) for generating jamming signals and performing beam-
forming.
Given the above, this article leverages the cooperative Ambient Backscatter
(AmB) communications in which the transmitter is equipped with an AmB tag that
can backscatter ambient radio signals to convey information to the receiver [69]. In
the literature, the AmB communications have been well investigated in various as-
pects, such as hardware design [70], performance improvement [71,72], power reduc-
tion [73, 74], ambient backscatter-based applications [75, 76], and security [77–79].
However, utilizing AmB technology to counter eavesdropping attacks has not yet
been well investigated. For example, the works in [77, 78] investigate the security
and reliability of AmB-based network with imperfect hardware elements, while the
authors in [79] analyze the application of physical layer security for AmB commu-
nications. Unlike these works in the literature, our proposed solution exploits the
message is split into two parts: (i) the active message transmitted by the transmitter
using conventional active transmissions and (ii) the AmB message transmitted by
the backscatter tag using AmB transmissions. Then, the receiver reconstructs the
original message based on both active and AmB messages. Note that the AmB tag
operates in a passive manner, meaning that it does not actively transmit signals.
Instead, it utilizes the active signals from the transmitter to backscatter the AmB
message without requiring additional power. In this way, the AmB message can
be transmitted on the same frequency and at the same time as the active message.
However, the AmB signal strength at the receiver is significantly lower than that of
the active signal. Hence, the AmB signal can be considered as pseudo background
noise for the active signal [63, 80]. As such, without knowledge about the system
in advance (i.e., the settings of AmB transmission), the eavesdropper is even not
aware of the existence of the AmB message. Without accessing the AmB message,
the values of resistors and capacitors in the AmB circuit are different for different
backscatter rates. As such, even if the eavesdropper deploys the AmB circuit but
do not know the exact backscatter rate, it still cannot decode the backscatter sig-
nals [70]. In addition, in the worse case in which the eavesdropper knows the exact
backscatter rate in advance, it still does not know how to construct the original mes-
sage based on the active and AmB messages since the message encoding technique
solution in the worst case, this thesis considers the guessing entropy metric [81]. It
is worth noting that due to the lower rate of AmB transmission compared to that
of the active transmission, the AmB message’s size is smaller than that of the ac-
1.2 Literature Review and Contributions 29
tive message. Thus, in a system with sufficient computing and energy capacities for
encryption/decryption, the AmB message can be used to carry the encryption key,
ploy MLK, which may require complex mathematical models and perfect CSI to
achieve high detection performance [82, 83]. Thus, this approach introduces signif-
detect the backscattered signals. The rationale behind is that DL has the ability
to learn directly from data (e.g., received signals), eliminating the need of complex
mathematical models and perfect CSI. Note that in the literature, several works
consider using DL for signal detection and channel estimation [84–88]. However,
most of them consider conventional signals (e.g., OFDM, BPSK, and QAM64 sig-
nals), and thus it may not perform well for very weak signals like AmB signals. In
particular, the authors in [84, 85] consider DL-based detectors for OFDM signals.
Whereas, the studies in [86] and [87] consider the signal classification task, i.e., pre-
dicting the modulation type of signals. Unlike the above works, in [88], the authors
propose an AmB signal detector based on the Long Short-Term Memory (LSTM)
architecture. Since LSTM requires more computing capability than that of the con-
ventional architecture, e.g., fully connected architecture [20], it does not fit well in
poses a new detector with low complexity elements, such as tanh activation, and a
few small-size fully connected hidden layers. By doing so, our DL-based detector
offers a more practical and efficient solution for backscatter signal detection at the
receiver.
Although DL-based detectors can achieve good detection performance, they usu-
ally require a large amount of high-quality data (i.e., a collection of received signals)
1.2 Literature Review and Contributions 30
for the training process [89]. This makes DL less efficient in practice when data
is expensive and/or contains noise due to the wireless environment’s dynamics and
uncertainty. For example, new objects (e.g. the passing of a bus) can significantly
impact wireless channel conditions, even changing links from line-of-sight (LOS) to
retrained from scratch with newly collected data, and thus it is a time-consuming
task [90]. In this context, meta-learning (i.e., learning how to learn) emerges as a
promising approach to quickly learn a new task with limited training data [91].
1.2.4.2 Contributions
gating eavesdropping attacks. They also show that the proposed DL-based signal
detector, without requiring perfect CSI, can attain a comparable Bit Error Ratio
mathematical model and perfect CSI. The main contributions of this work are:
tering right on the transmit signals. This solution is expected to open a new
• Develop the DL-based detector to detect the AmB signals at the receiver to
mance in new environments with little knowledge. The main idea of meta-
to reduce the size of the necessary training dataset, while still preserving the
quality of learning.
and robustness. We also analyze the security level of our proposed framework
based on the guessing entropy for the worse case when the eavesdropper has
particular, Sections 2.1 and 2.2 provide a background of deep learning (DL)
learning (DRL) and advanced machine learning techniques, i.e., transfer learn-
ing and meta learning, are discussed in Sections 2.2.3 and 2.3, respectively.
• Chapter 3: This chapter discusses our proposed framework that allows a UAV
to jointly optimize its flying speed and battery replacement activities under
the dynamic and uncertainty of data collection and energy replenishment pro-
cesses. Specifically, the system mode is described in Section 3.1. Section 3.2
results are then discussed in Section 3.4. Finally, conclusions are given in
Section 3.5.
1.3 Thesis Organization 32
• Chapter 4: This chapter presents our proposed solution that can automati-
In particular, Sections 4.1 and 4.2 introduce the ICAS system model and the
posed i-ICS algorithms are proposed in Section 4.3. In Section 4.4, simulation
that, the Metaverse application analysis and the proposed deep reinforcement
tion results are analyzed. Finally, Section 5.5 provides the conclusion of this
work.
Our proposed anti-eavesdropping system and the channel model are discussed
in Sections 6.1 and 6.2. Then, Sections 6.3 and 6.4 present the MLK-based
detector and our proposed DL-based detector for the AmB signal, respectively.
Section 6.6 discusses our simulation results. Finally, Section 6.7 wraps up our
• Chapter 7: This chapter draws the conclusions and highlights future research
directions.
33
Chapter 2
Background
Machine learning (ML) is poised to play a pivotal role in the evolution and op-
heterogeneity, and dynamism that these networks are expected to exhibit. The en-
but not limited to, ultra-massive MIMO (Multiple Input Multiple Output) systems,
terahertz (THz) communications, and dense networks of small cells. All of these
may struggle to address effectively. Conventional solutions, which often rely on pre-
defined parameters, often fall short in dynamically adapting to the rapidly changing
to its ability to learn from data, predict outcomes, and adapt in real-time, ML
allocation, and improve service delivery through predictive analytics and intelligent
decision-making processes. This adaptability is crucial for managing the intricate in-
terplay between network elements in 6G, ensuring optimal performance, and meeting
the high expectations for speed, reliability, and latency that define next-generation
wireless networks.
Given the above, this thesis aims to develop advanced machine learning-based
solutions for addressing various challenges in 6G networks. In this chapter, the fun-
2.1 Deep Learning 34
damentals of deep learning is first presented. Then, the advanced machine learning
Deep learning (DL) is an area in machine learning in which neural networks are
trained from a vast amount of data to perform some tasks automatically, e.g., clas-
sification and prediction. Although the concept of neural networks has been around
for decades, recent advances in computing power, availability of large datasets, and
gence of interest and remarkable progress in DL. Nowadays, DL has been success-
fully applied in many applications supporting our daily lives, ranging from face and
is not only a time-consuming task but also may not able to capture complex
learn hierarchical representations from raw data, thus eliminating the need for
directly.
ing, DL is able to effectively handle unstructured data (e.g., images and audio)
struggle with big data due to computational limitations and memory con-
2.1 Deep Learning 35
of data, making them well-suited for big data tasks, such as optimizing large-
• Model Reusable: In DL, trained models (i.e., DNNs) can be partially or fully
classify signal modulations, e.g., (BPSK, QPSK, and 64QAM) [93]. Such a
often require a dataset trained from the previous learning as well as the new
dataset [94].
In DL, a deep neural network (DNN) and a learning algorithm are two requisite
components. Inspiring by the structure and function of the biological neuron net-
work in the human brain, a DNN organizes neurons in multiple layers (hence the
word “deep”), and each neuron can connect to one or more neurons, as shown in
performs a mathematical operation on its input and then passes the result through
and bias) to minimize a loss function L(f (d; θ), ψ). Generally, a loss function cap-
tures the error between the DNN’s output f (d; θ) and the ground truth ψ, as follows:
min E L(f (d; θ), ψ) , (2.1)
θ
where d is the input vector. There are many types of activation functions (e.g.,
2.1 Deep Learning 36
Output
Input layer Hidden layers
layers
...
...
...
...
X1
X2 Neuron
X3
Figure 2.1 : An example of Deep Neural Network, a feed forward neural net-
work(FNN).
sigmoid, ReLU, and Tanh) and loss functions (e.g., the cross-entropy loss, mean
squared error loss, and Hinge loss) that can be used depending on specific prob-
lems [95].
The loss function L(f (d; θ) can be minimized by the Gradient Descent (GD),
which is the fundamental algorithm for minimizing deep learning loss functions
gradient and cost function for all data points at each time step, yielding a very high
processing time for large data. To that end, this thesis proposes to use the Stochastic
Gradient Descent (SGD) that only needs to compute gradients and cost function
for a mini-batch data, i.e., b, sampled uniformly from the dataset. By doing so,
SGD can accelerate the convergence rate while still guaranteeing the convergence of
learning [97]. Thus, SGD (presented in Algorithm 2.1) is widely used to optimize the
DNN parameters due to the simplicity in implementation while still achieving good
from the training dataset. Then, the cost function Jt is computed as follows:
1 X
Jt = L f (d; θt ), ψ . (2.2)
|bt |
(d,ψ)∈bt
where γt is the step size controlling how much the parameters are updated, and
∇θt (·) is the gradient operator with respect to the DNN parameters θt .
often uses the backpropagation algorithm. To do that, it calculates the output error
(i.e., a difference between predicted and actual values) when passing data from the
input to the output layers, i.e., forward pass. Then, this output error is propagated
backward through the network layer (i.e., back pass) to compute gradients. Through
the training process, DL can learn complex patterns in the dataset, making them
Note that the above discussion regards to the Feed Forward Networks (FNNs),
one of the most fundamental and widely used DNN architectures. Besides FNNs,
there are various types of DNNs (such as, Recurrent Neural Networks (RNNs),
tailored to different types of data and tasks. Among existing DNN architectures,
the FNNs are the simplest and have the fastest inference latency [95]. Thus, this
thesis leverages the typical FNNs that are simple and suitable to be implemented
in wireless systems, which can well operate in rapidly changing environments with
learning, where the model is provided with labelled data, and unsupervised learning,
which deals with unlabelled data, RL operates in situations when we do not have
data to train. As a result, in RL, an agent needs to interact with its surrounding
take actions and interact with the environment. Once an action is taken, the agent
receives the immediate reward and observes the subsequent state of the environ-
ment, as shown in Figure 2.2. Then, the agent use these observations to learn and
derive the optimal policy. In this way, RL is a powerful approach for training agents
to make intelligent decisions in dynamic and complex environments, and its poten-
tial for solving real-world problems. In RL, underlying environments are typically
erally, an MDP is described by four elements: (i) S ≜ {s} denoting the state space,
2.2 Reinforcement Learning 39
Immediate reward
Agent Environment
Action
Policy
State
Observing
(ii) A ≜ {a} referring to the action space, (iii) Pa denoting the transition proba-
bility from the current state s to the next state s′ by taking action a, and (iv) ra
a mapping from the state space to the action space, i.e., π : S → A, indicating
the decisions made by the agent. The MDP aims to get the optimal policy π ∗ that
Note that the MDP divides time in equal time slots, and decisions are made
at every time slot. As such, MDP may work inefficiently in system with stringent
in addition to the state space S and action space A, a SMDP is also characterized
by the a decision epoch ti (i.e., points of time at which decisions are made) and T ,
which refers to the state transition probabilities and also captures the duration of
time spent in each state [98]. While a conventional MDP makes decisions at every
time slot, an SMDP makes decisions whenever an event occurs, making it more
2.2.2 Q-learning
In RL, the value of state s at any time step t under policy π : S → A is calculated
∞
X
m
Vπ (s) = Eπ η rt+m st = s , ∀s ∈ S, (2.4)
m=0
where Eπ [·] is the expectation under policy π, and η is the discount factor that
indicates the importance of future rewards. Similarly, the state-action value function
under policy π evaluates how good to performing action a at state s then following
∞
X
m
Qπ (s, a) = Eπ η rt+m st = s, at = a , (2.5)
m=0
Under the optimal policy π ∗ , the optimal state-action value function, i.e., Q∗ (s, a),
is given by [99]:
h i
Q∗ (s, a) = E rt + η max
′
Q ∗
(s t+1 , a′
) s t = s, at = a . (2.6)
a ∈A
Thus, once Q∗ (s, a) is obtained, the optimal policy is achieved by taking actions
value function Q∗ (s, a), also known as the optimal Q-function. Since then, Q-
learning has been one of the most widely used algorithms because it can guarantee
to converge to the optimal policy after the learning process [100]. The detail of the
a table, namely Q-table, making its implementation simple. Each cell of the Q-table
keeps the estimated value of Q-function (named Q-value) for taking action a at state
s, denoted by Q(s, a). The Q-table is iteratively updated based on interactions with
2.2 Reinforcement Learning 41
4: The agent observes next state st+1 and reward rt , then updates Q(st , at )
by (2.7) and reduces ϵ.
5: end for
the surrounding environment. Given action at is selected under the ϵ-policy (2.9) at
state st at time t, the agent obtains the immediate reward rt and observes a next
Q(st , at ) ←Q(st , at )+
αt rt (st , at ) + η max Q(st+1 , at+1 ) −Q(st , at ) , (2.7)
at+1
| {z }
Target Q-value Yt
| {z }
Temporal difference (TD)
where αt is the learning rate that controls how important of new knowledge (i.e., TD)
to the update of Q-function. By using (2.7) and αt that satisfies (2.8) to iteratively
updating the Q-function, it is proven that q(s, a) will converge to q ∗ (s, a) [100].
∞
X ∞
X
αt ∈ [0, 1), αt = ∞, and (αt )2 < ∞. (2.8)
t=1 t=1
problem that results in a long time for learning, especially for challenging problems
in 6G networks that often pose high dimensional state spaces, e.g., hundred thousand
2.2 Reinforcement Learning 42
(e.g., the probability of frame loss, the packet arrival rate, and unprecedented re-
source demand) make it more challenging for the agent to achieve an optimal policy.
To that end, we will introduce the deep reinforcement learning approach that can
effectively address this problem to quickly achieve the optimal operation policy for
DQN algorithm that uses DNNs as function approximators to handle large and con-
when values of states are discrete, but the state in many problem can consists of
real numbers. DRL addresses this limitation by leveraging DNNs to approximate the
However, both DQN and Q-learning have the problem of overestimation when
estimating Q-values [31]. This issue makes the learning process unstable or even
states [102]. To that end, this thesis introduces a DRL algorithm, namely Deep Du-
eling Double Q-learning, that can address these above issues effectively by adopting
three innovation techniques in RL, including (i) Deep Q-Network (DQ) [33], (ii) Dou-
ble Deep Q-learning (DDQ) [31], and (iii) dueling architecture [34]. First, the DNN
Environment
Action
Interaction loop
Epsilon
greedy
Experience State Q-network Estimated
Q-values
Update parameters
Learning loop
Update
Loss
function
Q-target
Mini-batch
Buffer B
in Q-learning can be overcome by using DDQ that separates the action selection
and action evaluation processes instead of combining them in the Q-learning [31].
Finally, the learning process is stabilized by adopting the dueling neural network
architecture where the state value function and advantage value function are esti-
mated separately and simultaneously [34]. In this way, our proposed approach can
inherit all advantages of these techniques, thereby stabilizing the learning process,
The details of Deep Dueling Double Q-learning (D3QL) are provided in Algo-
rithm 2.3. Suppose that the learning phase consists of T time steps. At time step t,
the agent observes the current state st and takes action at according to the ϵ-policy.
After that, it observes a next state st+1 and gets a reward rt . This experience data,
represented by a tuple (st , at , st+1 , rt ), should not be used directly to train the DNN
2.2 Reinforcement Learning 44
since the consecutive experiences are highly correlated, which may lead to a slow
convergence rate, unstable, or even divergence [33,103]. As such, the memory replay
Figure 2.3. Then, at each time step, experiences are sampled uniformly at random
to train the DNN. By doing so, correlations among experiences can be removed,
thereby accelerating the learning process. Moreover, as one data point can be used
multiple times to train the DNN, this mechanism can indeed improve the data usage
efficiency.
define the input and output layers of the DNN according to the state- and action-
space dimensions, respectively. Specifically, feeding a state s to the DNN will return
Q-values for all actions at this state, each given by a neuron at the DNN’s output
layer. To improve the stability and increase the convergence rate, we propose to
2.2 Reinforcement Learning 45
Value stream
Agregrated by (19)
Input layer
Q-values
Optimal action
Advantage stream
use the state-of-the-art dueling neural network architecture [34] for D3QL’s DNN,
as shown in Figure 2.4. In particular, the dueling architecture divides the DNN into
two streams. The first one estimates the state-value function V(s), which indicates
the value of being at a state s. The second stream estimates the advantage function
state s.
Recall that the state-action value function Q(s, a) (namely Q-function) expresses
the value of taking an action a at state s, i.e., Q-value. Thus, the advantage function
under policy π can be given as Dπ (s, a) = Qπ (s, a)−V π (s) [34]. Then, we can obtain
the estimated Q-function for feeding state s to the Deep Dueling Neural Network
(DDNN) by
where ζ and β are the parameters of the state-value and advantage streams, respec-
tively. It can be observed that given Q, V and D could not be determined uniquely.
As such, using (2.11) directly may result in a poor performance of the algorithm.
2.2 Reinforcement Learning 46
Therefore, similar to [34], we propose to use the following output of the advantage
stream:
′
Q(s, a; ζ, β) = V(s; ζ) + D(s, a; β) − max
′
D(s, a ; β) . (2.12)
a ∈A
the Q(s, a∗s ; ζ, β) is forced to equal V(s, ζ). However, (2.12) still faces an issue, i.e.,
the advantage function changes at the same speed at which the advantage of the
address this issue, the max operator is replaced by the mean as follows [34]:
1 X
Q(s, a; ζ, β) = V(s; ζ) + D(s, a; β) − D(s, a′ ; β) . (2.13)
|A| a′ ∈A
The root of the overestimation problem in Q-learning and DQN comes from the
max operation when estimating the target Q-value at time t as follows [31]:
handle this issue, we adopt the deep double Q-learning algorithm [31] that leverages
two identical deep dueling neural networks. One deep dueling neural network Q is
for action selection, namely Q-network, and the other Q̂ is for action evaluation,
where ϕ and ϕ− are the parameters of the Q-network and target Q-network, respec-
2.2 Reinforcement Learning 47
tively.
As the aim of training the Q-network is to minimize the gap between the target
Q-value and the current estimated Q-value, the loss function at time t is given by [33]
2
Lt (ϕt ) = E(s,a,r,s′ ) Yt − Q(s, a; ϕt ) , (2.16)
where E[·] is the expectation according to a data point (s, a, r, s′ ) in the buffer B.
To minimize the loss function Lt (ϕt ), this thesis proposes to use the Stochastic Gra-
dient Descent (SGD), discussed in Section 2.1. Recall that SGD is one of the most
popular algorithms for minimizing deep learning loss functions due to its simplicity
It is worth mentioning that even though the target Q-value Yt in (2.16) looks
like labels that are used in the supervised learning, Yt is not fixed before starting
the learning process. Moreover, it changes at the same rate as that of the target
that end, instead of updating the parameters of Q̂ at every time step, ϕ− is only
It is also important to note that the value of ϵ in the D3QL controls the environ-
ment exploration. Specifically, the higher the value of ϵ is, the more frequently the
agent takes a random action. The reason for decreasing ϵ stems from the fact that an
RL agent does not have complete information about its environment in advance. As
such, in the beginning, the agent should explore its environment by taking actions
randomly. By doing so, it can obtain information about the environment via the
feedback of its actions, e.g., corresponding rewards and next states. Then, the agent
adjusts its policy according to these experiences, i.e., the state, action, reward, and
the next state. Therefore, to converge to an optimal policy, the agent should take
2.2 Reinforcement Learning 48
actions based on its policy more frequently than the random action [99], leading to
In RL, if a linear function approximates the value function, the learning process
ear function (e.g., a neural network) is used instead, it may not converge to the
optimal one [104]. In our proposed learning algorithms, we adopt two innovative
techniques (i.e., the dueling architecture and double Q-learning) and three trans-
fer learning approaches to stabilize the learning process and improve the learning
quality, thereby improving the convergence rate. Thus, even though the optimal-
ity of our deep reinforcement learning algorithm, i.e., D3QL, could not be proven
theoretically, the intensive simulation results show that D3QL obtains a stable and
We now discuss the complexities of D3QL, which mainly depend on the training
process of the Q-network. In the Q-network, there is one input layer Li , one hidden
layer Lh , and two output layers Lv and La corresponding to the state value and
carried out with matrix multiplication. Thus, the complexity of feeding a mini batch
with size Sb to the Q-network is O Sb |Li ||Lh | + |Lh ||Lv | + |Lh ||Lv | , where |.| is
the layer’s size, i.e., a number of neurons in a layer. Given the training process
takes T iterations, the complexity of the proposed algorithms are O T Sb |Li ||Lh | +
|Lh ||Lv | + |Lh ||Lv | .
one with a complex architecture. However, the DNN in our proposed algorithms
is only contains a few (e.g., four) layers in which only one is fully connected (i.e.,
the hidden layer). Therefore, the decision time (i.e., the inference time of the Q-
2.3 Transfer Learning and Meta Learning 49
like model pruning, model compression, and quantization become essential to navi-
devices [105]. A few deep learning applications have been deployed in AVs, such as
Tesla Autopilot [106] and ALVINN [107]. Thus, they are clear evidence of the effec-
reduce the computational complexity of machine learning, such as deep learning and
several shortcomings. First, the training process of DNN often requires a huge
amount of collected data for the training process to achieve good performance [89].
Thus, this makes DL less efficient in practice, especially when data is expensive and
contains noise due to the environment’s dynamic and uncertainty, as in the 6G net-
face the over-fitting problem, i.e., performing excellently on training datasets but
very poorly on test datasets, if the training dataset does not contain enough samples
to represent all possible scenarios. Due to the dynamic nature of the wireless envi-
ronment, the channel conditions may vary significantly over time. For example, a
moving bus may change channel conditions from LoS to NLoS and vice versa. Con-
sequently, real-time data may greatly differ from training data, making these above
2.3 Transfer Learning and Meta Learning 50
DNNs hinges on several key factors that reflect the dynamic nature of wireless envi-
deployment and configuration changes, variations in traffic patterns and user mo-
bility, and performance degradation. Third, wireless channel conditions can also be
very different at different areas due to their landscapes, so the DL model trained
at one area may not perform well at other areas. As such, different sites may need
training different models from scratch, which is time-consuming and costly [90]. In
this context, transfer learning and meta learning emerge as promising solutions to
a source task in a source domain to enhance the learning process of target tasks in
task, called the source domain, is adapted or fine-tuned to work on a target task,
known as the target domain. The idea behind transfer learning is to transfer the
knowledge gained from the source domain to the target domain, thereby reducing
the need for large amounts of data in the target domain and potentially improving
Typically, a domain contains labelled or unlabelled data given before the con-
between the agent and its surrounding environment. As a result, both the domain
and task can be represented by an MDP, as shown in Figure 2.5. Note that transfer
learning for DRL may look like supervised learning since they both use existing
data, but they are very different. In particular, all DRL data used to train a DNN
are unlabelled and on-the-fly data generated by interactions between an agent and
its surrounding environment. Although the source data are collected in advance for
the agent, they are just observations of the agent about the source environment,
which do not have any label to indicate which action the agent should take. Thus,
the agent in the target domain still needs learning algorithms (e.g., DQN) to learn
To measure the effectiveness of transfer learning, we can use three metrics, in-
ticular, jump-start measures how much the agent’s performance at the beginning
of the learning process can be improved by applying TL, while the asymptotic per-
formance measures this improvement at the end of the learning process. The third
metric, i.e., time-to-threshold, measures how fast TL can help the agent achieve a
predefined performance level compared with the scenario without TL. It is worth
curve. Notably, TL may even negatively impact the learning in the target domain
if the transfer knowledge is not carefully chosen. Thus, Chapter 3 will explore
the effectiveness of transfer learning that allows UAVs to “share” and “transfer”
2.3.2 Meta-Learning
algorithm [90]. The main idea of meta-learning is to train the model on a collection of
similar tasks, e.g., image classifications. By doing so, it enables the model to acquire
generalization capabilities. Therefore, the trained model can quickly perform well
in a new task only after a few update iterations, even when provided with a small
• In the inner loop, a base-learning algorithm (e.g., SGD or Adam) solves task
algorithm, such as stochastic gradient descent (SGD). Thus, the inner loop
denote the DNN’s parameters. After p steps of learning at the inner loop,
the resulting parameters are denoted by θ̄ = Uτp (θ), where Uτp (·) denotes the
update operator.
• Then, at the outer loop, a meta-learning algorithm (e.g., Reptile [89] and
where ηo is the outer step size controlling how much the model parameters are
Note that if p = 1, i.e., performing a single step of gradient descent in the inner
loop, meta-learning becomes a joint training on the mixture of all tasks, which may
not learn a good initialization for meta-learning [89]. In this thesis, Chapter 6 will
2.3 Transfer Learning and Meta Learning 53
further examine the effectiveness of meta-learning when being used to help a DL-
based signal detector quickly achieve good performance in new environments with
minimal knowledge.
Given the above, transfer learning and meta-learning are both AI techniques
that aim to improve the performance and generalization of models on new tasks.
However, they differ in their objectives and approaches. Transfer learning focuses
aims to improve the model’s learning process itself so that it can quickly adapt
to new tasks with limited data. Both techniques have their unique strengths and
applications and are essential tools in advancing machine learning capabilities in the
and thus, this thesis will explore their applicability in addressing emerging challenges
Table 2.1 : Summary of Machine Learning Types and Their Suitability for Wireless
Networks
ML Type Advantages Disadvantages Suitability for Wire-
less Networks
Deep Learn- Effective for signal
• Excellent at handling • Require significant
ing processing, image
large datasets. computational power.
and speech recogni-
• Can model complex • Prone to over-fitting
tion, and complex
non-linear relation- without sufficient
decision-making tasks
ships. data.
in advanced wireless
networks.
Reinforcement Well-suited for dynamic
• Learn through inter- • Require a lot of com-
Learning resource allocation and
action with the envi- putational resources.
network optimization in
ronment. • Learning can be slow.
real-time.
• Can adapt to chang-
ing conditions.
Chapter 3
troduce a novel framework based on deep reinforcement transfer learning that can
jointly optimize the speed and energy replenishment process of the Unmanned aerial
vehicle (UAV). Recently, UAV-assisted data collection has been emerging as a promi-
nent application due to its flexibility, mobility, and low operational cost. However,
under the dynamic and uncertainty of IoT data collection and energy replenishment
Thus, this work introduces a novel framework that jointly optimizes the flying speed
and energy replenishment for each UAV to significantly improve the overall system
performance (e.g., data collection and energy usage efficiency). Specifically, we first
develop a Markov decision process to help the UAV automatically and dynamically
make optimal decisions under the dynamics and uncertainties of the environment.
Q-learning can help the UAV to obtain the optimal policy, they often take a long
capacity and energy resource. To that end, we develop advanced transfer learning
techniques that allow UAVs to “share” and “transfer” learning knowledge, thereby
3.1 System Model 56
Battery level
Broadcasting Transmition
Time slot
h
y
(x,y)
x
O Station
IoT devices
UAV’s trajectory
Figure 3.1 : System model for UAV-assisted IoT data collection network.
reducing the computational complexity and resource consumption for the UAV as
simulations demonstrate that our proposed solution can improve the average data
collection performance of the system up to 200% and reduce the convergence time
The rest of this chapter is structured as follows. The system model and operation
control formulation are described in Sections 3.1 and 3.2, respectively. Section 3.3
presents the proposed learning algorithms. Then, the simulation results are analyzed
In this work, we consider a UAV-assisted IoT data collection system where a UAV
is deployed to collect IoT data over a considered area, as illustrated in Figure 3.1.
We assume that the considered area is divided into N zones. The IoT devices
are distributed randomly over these zones to execute various tasks, e.g., sensing
temperature and humidity. In practice, the numbers of IoT nodes in these zones are
3.1 System Model 57
slotted (as in [28, 29]) with an equal duration, and each time slot is split into two
interval, the UAV uses a dedicated channel to broadcast a wake-up signal [111] to
all IoT nodes in its communication range. After acquiring this signal, these nodes
will send their data to the UAV during this transmission interval. The use of a
dedicated channel for wake-up signals is critical for reducing energy consumption
among IoT devices, as it allows these devices to remain in a low-power state until
they are activated by the wake-up signal. This strategy is particularly important for
IoT deployments in remote or inaccessible areas where power sources may be limited
or non-renewable. We assume that the communication link from IoT devices to the
UAV adopts the OFDMA technique, while the communication link from the UAV
to IoT devices uses the OFDM technique, as in [112]. In this way, the IoT devices
can simultaneously transmit data to the UAV. This dual approach leverages the
strengths of both OFDMA and OFDM to handle multiple access channels efficiently,
system in dynamic and densely populated IoT environments. Let pn denote the
probability of a data packet successfully collected by the UAV in a time slot in zone
n. Because the IoT nodes are distributed unevenly over N zones, pn may vary over
these zones. In the considered a UAV-assisted IoT data collection network, the UAV
flies at a fixed altitude h (similar as that of [27, 28, 112–114]). Similar to the studies
in [27–29], we assume that the UAV follows a predefined trajectory to sweep through
However, unlike [27–29], we consider a more realistic scenario where the UAV
is equipped with a battery that has limited energy storage. It is worth mentioning
that the energy consumption for the wireless data collection process (i.e., broadcast-
ing the wake-up signal and receiving data packets) is much lower than that of the
3.1 System Model 58
flying operation [26]. In addition, the decision of the UAV at each time slot (i.e.,
speed selecting or returning for energy replenishment) does not influence the power
consumption for the wireless data collection process. Thus, our model can straight-
the UAV’s energy consumption at each time slot, similar to that in [24]. To that
end, in this work, we only focus on optimizing energy consumption for the UAV’s
In a time slot, we assume that the UAV’s velocity is constant (as in [27, 112]),
but in different time slots the UAV can choose to fly at different speeds, e.g., vw =
{v1 , . . . , vA }. Each speed may cost a different amount of energy. For example, if
the UAV flies faster, it may use more energy per time slot [26]. Note that UAVs in
UAV-assisted IoT data collection systems often fly at a low speed to maintain the
When the UAV’s energy is depleted, it will fly back to the charging station placed
during the flight, the UAV can decide to go back to the charging station to change
the battery, for example, when it is near the charging station and its energy level
is low. Once the energy replenishment process is accomplished, the UAV will fly
back to its trajectory and continue its task. Here, we consider that the UAV has
a maximum of E energy units for its operation (i.e., flying to collect IoT data)
and a backup energy storage for flying back to the charging station for battery
flights and battery replacement, the UAV cannot collect data. Suppose that it
takes the UAV tf and tb time slots to fly from its current location to the station
and to replace the battery, respectively. Indeed, the battery replacement time,
i.e., tb , may be known in advance, while the return flight time, i.e., tf , is highly
dynamic depending on the distance between UAV’s current location and the charging
3.2 Optimal Operation Control Formulation 59
station. In addition, tf also depends on the return speed of the UAV, denoted by
vr . Assuming that UAV flies with a constant speed when returning to the station,
Therefore, the energy replenishment process is also dynamic due to the dynamic of
flying time tf .
ically, the UAV does not know the probabilities of receiving a packet in different areas
note that the UAV may collect more data when moving in a zone with a high proba-
bility of receiving packets, i.e., a high value of pn . As a result, to maximize the data
collection efficiency, the UAV must gradually learn this knowledge in order to adapt
its operations accordingly, e.g., flying speed and energy level status. Moreover, the
returning flight time depends on the distance between the UAV’s current position
and the station, which is highly dynamic. Therefore, if the UAV appropriately de-
cides when to return for battery replacement (e.g., when it is near the station and
its energy level is low), the energy replenishment time will be reduced significantly,
resulting in high system performance. In contrast, if the UAV goes back to replace
its battery when its energy level is high and it is far from the station, it will waste
both time and energy, leading to low system performance. Thus, optimizing the
ing task. In the following sections, we will present our proposed learning algorithms
that can effectively and quickly obtain the UAV’s optimal operation policy under
the limited energy of the UAV and the uncertainty of the data collection process.
To overcome the uncertainty and highly dynamic of the data collection and en-
ergy replenishment processes under the limited energy storage of the UAV, we for-
3.2 Optimal Operation Control Formulation 60
Symbol Description
S State Space
A Action Space
r Immediate Reward Function
rta Speed Selection Reward Function at Time t
rtb Battery Replacement Reward Function at Time t
E Maximum Energy Capacity of UAV
pn Probability of Data Packet Collection in Zone n
vw Set of Possible Flying Speeds
vr Return Speed of UAV to Charging Station
tf Time to Fly from Current Location to Charging Station
tb Battery Replacement Time
te Total Energy Replenishment Time
Ω Working Reward for Data Collection
w1 , w 2 Weights in Reward Function
mat Energy Consumption for Action a at Time t
dst Number of Collected Data Packets at State s at Time t
mulate the UAV’s operation control problem as the Markov decision process (MDP)
space A, and immediate reward function r. Based on the MDP framework, at each
time slot the UAV can dynamically make the best actions (e.g., flying at appropri-
ate speeds or returning for battery replacement) based on its current observations
(i.e., its location and energy level) to maximize its long-term average reward with-
out requiring complete information about data collection and energy replenishment
processes in advance.
In this work, we aim to maximize the efficiency of collecting data and energy
usage efficiency, and thus there are some important factors which we need to take
into considerations. The first important factor is the current location of the UAV.
The main reason is that the UAV’s location can reveal important information about
the expected amount of data that can be collected by the UAV and the time it takes
if the UAV chooses to fly back to the station for battery replacement. Specifically,
3.2 Optimal Operation Control Formulation 61
the UAV will likely collect more data when moving in a zone with a high probability
of receiving data than in a zone with a low probability of receiving data. In addition,
the farther the distance between the UAV and the station is, the more time it takes
to travel between these two positions. As mentioned above, the UAV always flies
at a fixed altitude so that the UAV’s position can be given by its 2D projection on
the ground, i.e., (x, y) coordinates. The second crucial factor is the current UAV’s
energy level, denoted by e, which affects the decision of the UAV at every time slot.
For example, the UAV should not select the battery replacement action (i.e., return
the station to replace the battery) unless its energy level is low. Otherwise, it may
waste time and energy for the flying back trip. To that end, this information is
embedded into the state space of the UAV, which can be defined as follow:
n
S = (x, y, e) : x ∈ {0, . . . , X}; y ∈ {0, . . . , Y };
o (3.1)
and e ∈ {0, . . . , E} ∪ {(−1, −1, −1)},
where X and Y are the maximum x and y coordinates of the UAV’s trajectory, and
E is the maximum energy capacity of the UAV. As a result, the system state can
ishment process, it is necessary to introduce a special state, i.e., s = (−1, −1, −1).
This special state is only visited when UAV’s energy is depleted (i.e., UAV’s state
is (x, y, 0)) or if the UAV selects the battery replacement action. Then, after the
energy replenishment process completes, the UAV will return to the previous po-
sition (where it decided to go back for battery replacement or where its energy is
depleted) with a full battery, i.e., s = (x, y, E). This design ensures that the system
usage and data collection efficiency, the UAV needs to not only choose the most
suitable flying speed but also decide when to go back to the station to replace the
battery. It is worth mentioning that given different states, the action spaces for these
states may be different. For example, at a non-working state, i.e., s = (−1, −1, −1),
the UAV cannot select a flying speed. Instead, the possible action at this state is
to stay “idle” until the UAV returns to its trajectory with a full battery. In other
words, the UAV will stay at the non-working state after performing an “idle” action
until the energy replenishment process completes. As a result, we can define the
that the UAV will choose to return to the station for replacing the battery, namely
battery replacement action. Actions a = {1, . . . , A} are to represent the speed level
that the UAV selects to fly at the current time slot. In addition, given the state
s ∈ S the action space based on state s, i.e., AS , consists of all possible actions that
As discussed above, two main actions (i.e., flying speed and battery replacement
an appropriate flying speed at each time slot can maximize the efficiency of the
collecting data process as well as energy usage. Alternatively, selecting the right
time to return for battery replacement can reduce the energy replenishment time,
thereby improving the overall system performance. For example, when the UAV is
flying near the charging station and its energy is low, it should return to the charging
station for battery replacement. Therefore, our proposed immediate reward function
consists of (i) speed selection reward function, i.e., rta , and (ii) battery replacement
ra (st , at ), if at ∈ {A \ {0}},
t
rt (st , at ) = rtb (st , at ), if at = 0, (3.4)
0, otherwise,
between the data collection efficiency and energy usage efficiency, the speed selection
reward function needs to capture this information. We define data collection effi-
ciency as the number of collected data packets over a time slot and operation status
of the UAV. For example, given a time slot, if the UAV is moving to collect data,
it will receive a working reward, denoted by Ω > 0. Otherwise, it will not receive
the working reward, i.e., Ω = 0. In this way, the working reward will encourage the
UAV to collect data rather than return and then wait at the station for the battery
replacement. The energy usage efficiency can be represented by the cost of choosing
a flying speed, i.e., energy consumed by the UAV to fly at speed a during a time
slot t. Clearly, the selected speed determines the energy consumption per time slot
of the UAV, e.g., a low speed will cost the UAV less energy per time slot than that
3.2 Optimal Operation Control Formulation 64
of a high speed [26]. At time slot t, the cost of performing action a is denoted by
mat , and the number of collected data packets at the current state s is denoted by
dst . Thus, the speed selection reward function can be expressed by:
where w1 and w2 are the weights to balance between collected data and energy
consumption of the UAV. It is worth noting that these weights can be defined in
advance based on the service provider’s requirements. For example, in case if the
data is more important and valuable than energy, we can set the value of w1 to be
higher than that of w2 . In contrast, if the energy is scarce, we can set the value of
w1 to be lower than that of w2 . Therefore, the speed selection reward function can
capture the UAV’s data collection efficiency and energy consumption efficiency.
Although the speed selection reward function gives the UAV sufficient informa-
tion to learn the optimal speed control, it cannot help the UAV learn when it is good
to return to the station for the battery replacement. First, if the UAV performs the
battery replacement action, the values of Ω and dst in (3.5) will be zero, leading
to a negative value of rta . Consequently, the UAV may consider this action as a
bad choice and will not choose it in the future. Second, the speed selection reward
function is unable to guide the UAV to learn where is good to return for replacing
its battery, e.g., the further from the station the UAV is, the smaller reward for bat-
tery replacement action it may receive. Therefore, battery replacement action needs
a different reward function, which needs to take into account of both the UAV’s
current energy level, i.e., e, and the distance between its current position and the
tion for battery replacement action because the complex relationship between e and
3.2 Optimal Operation Control Formulation 65
10
5
5
0 0
Reward
−5 −5
−10
−10
−15
−15
−20
−25 −20
0 0
10 10
l (m20 20 its)
y un
) 30 30
erg
n
40 40 e(e
l makes more difficulties for the UAV to decide whether it should go back or keep
flying to collect data. In particular, the UAV may choose the battery replacement
action if both e and l are small, while the UAV should continue its collection task
if any of these factors is large. To that end, in the following, we propose a reward
function for battery replacement action that can address this problem, and to the
best of our knowledge, this is the first work in the literature addressing the battery
Suppose the UAV decides to return for battery replacement at time t, at which its
energy and the distance to the station are et and lt , respectively. Then, its immediate
where c is a constant controlling the maximum value of rtb , which may affect the
learning policy of the UAV. For example, if c is smaller than the smallest value of
rta , the value of returning reward function is always lower than that of the speed
3.2 Optimal Operation Control Formulation 66
selection reward function, making the return action always “worse” than those of the
speed selection actions. The second and third terms in (3.6) express the influence of
the energy level et and the distance lt to the UAV’s decision for battery replacement.
The tradeoff between the energy level et and the current distance lt is controlled by
w6 = 0.08. It can be observed that as e and l are large (e.g., greater than 36 and 17,
respectively), the UAV will receive negative rewards, meaning that the UAV is not
encouraged to return to replace its battery if its energy is high or it is far from the
station. In this way, this function will encourage the UAV to return to the station
for replacing the battery when its current energy and distance to the station are
small.
In this work, our aim is to maximize the average long-term reward function by
current energy and location, π ∗ determines an action that maximizes the long-term
T
1X
max R(π) = lim E rt (st , π(st )) , (3.7)
π T →∞ T
t=1
where R(π) is the long-term average reward obtained by the UAV according to
the policy π, π(st ) is the selected action at state s at time slot t based on policy
π, and rt (st , π(st )) is the immediate reward by following policy π at time slot t.
Thus, the optimal policy π ∗ will assist the UAV in dynamically making the optimal
action according to its current observation, i.e., its position and energy level. More-
over, Proposition 1 shows the optimality of the proposed immediate reward function
rt (st , at ).
3.3 Optimal Operation Policy for UAVs with Deep Reinforcement Transfer
Learning 67
function.”
Proof: We first prove the optimality for each component of the immediate
reward function, i.e., rta and rtb . From (3.5), the speed selection reward function is a
linear function of energy consumption mat and an amount of collected data packet
dst . Since the values of mat and dst are always positive and finite, there always exists
a maximum value of the speed selection reward function rta (st , at ). For the second
2
w3 w4 exp(w4 et ) 0
H= . (3.9)
0 w5 w62 exp(w6 lt )
Then, the determinant of H is computed by: detH = w3 w42 exp(w4 et )w5 w62 exp(w6 lt ).
Since the weights in the immediate function are always positive, we have detH > 0.
Therefore, H is positive definite and f (et , lt ) is convex, meaning that there always
exists a minimum value of f (et , lt ). As a result, rtb (st , at ) always has a global optimal
value. Given the above, there always exists a maximum value of the immediate
the surrounding environment in advance (e.g., the data arrival probabilities and the
3.3 Optimal Operation Policy for UAVs with Deep Reinforcement Transfer
Learning 68
charging process) to obtain the optimal policy. In addition, the limited resource of
UAV requires a low computational complexity solution. Thus, Q-learning and D3QL
algorithms, discussed in Section 2.2.2, can be used to learn the optimal policy for
the UAV.
poses some drawbacks inherited from conventional DRL when addressing scenarios
with high sample complexity, as the considered problem in this work where the
a lot of time to train DNN, e.g., DQN’s training time is up to 38 days for each
Atari game [33]. If the environment dynamics (such as the data arrival process) or
the trajectory of the UAV changes, the DNN may need to be fine-tuned or even
is unable to deploy on a UAV that has very limited energy and computing resources.
Secondly, since the UAV should only return for replacing its battery when it is close
to the station or its energy level is low, it needs sufficient experiences in this region,
especially when flying over the station. However, as the UAV flies over its fixed
trajectory, experiences obtained from this region are often very small compared
with all obtained over the entire considered area. Therefore, the UAV may not
a source task in a source domain to enhance the learning process of target tasks in
data given before the considered training process starts. However, data in RL is
3.3 Optimal Operation Policy for UAVs with Deep Reinforcement Transfer
Learning 69
obtained via interactions between the agent (i.e., the UAV) and its surrounding
Definition 3.1. Transfer learning in RL: Suppose the source and target MDPs are
from the source MDP, i.e., the policy, the environment dynamics, and the data, as
Note that transfer learning for DRL may look like supervised learning since they
both use existing data, but they are very different. In particular, all DRL data
used to train a DNN are unlabelled and on-the-fly data generated by interactions
between an agent (e.g., the UAV) and its surrounding environment. Although the
source data are collected in advance for the target UAV, they are just observations
of the source UAV about the source environment, which do not have any label
to indicate which action the UAV should take. Thus, the target UAV still needs
learning algorithms (e.g., D3QL) to learn the optimal policy gradually. Our proposed
TL can help the UAV utilize the knowledge from the source domain to avoid bad
decisions at the beginning of the learning process when the UAV is exploring the
environment by taking a random action, thereby improving the learning rate and
learning quality. Recall that to measure the effectiveness of TL, we can use three
p4 p3 p4 p3 p5 p8 Environment
…
+
p1 p2 p1 p2 … p6 p7
Transfer
+
Experiences
( , , , )
Action 𝒂
𝒕
Action 𝒂𝒕
p4 p3 p4 p3
…
…
p1 p2 p1 p2
Minibatch
𝑠, 𝑎, 𝑟, 𝑠 ′ ∼ 𝑈(𝑩) Minibatch 𝑠, 𝑎, 𝑟, 𝑠 ′ ∼ 𝑈(𝑩)
Experience 𝑼𝐓 Experience
Memory B
Memory B (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )
process. It may even negatively impact the learning in the target MDP if the transfer
transfer learning framework that can reduce the learning time and learning quality
for D3QL.
can be leveraged to help a new UAV UT effectively learn the optimal policy for work-
can be in the form of the policy and/or experiences of the source UAV, i.e., US .
the source MDP, i.e., MS , to improve the learning process of the target UAV,
to the memory buffer of the target UAV. Then, these transferred experiences
and target UAV’s new experiences are used to train the Q-network. In this
manner, the target UAV can quickly get adequate information, and thereby
experiences also affects the learning process. For example, an experience does
not have much value if it is easy to be obtained by the target UAV. In contrast,
impacts the system performance. For example, experiences obtained when the
3.3 Optimal Operation Policy for UAVs with Deep Reinforcement Transfer
Learning 72
UAV is near the station may have high values because they not only contain
but also may reveal value information about the right time to take the battery
replacement action.
• Policy Transfer (PT): This approach directly transfers the policy of a source
UAV to a target UAV. In particular, UT starts the learning process with the
of UT obtained in the target MDP MT . Thus, starting with the source policy
can help UT to avoid random decisions caused by the randomness of action se-
• Hybrid : This approach aims to leverage the benefits of both experience and
policy transfer types. Particularly, the hybrid scheme can improve not only
Note that the efficiency of each transferring technique depends on the relation-
ship between the source and the target MDPs. For example, if the source and target
MDPs are very similar, policy transfer may yield a better result in terms of con-
vergence rate than that of the experience transfer technique. In contrast, when the
source and target MDP are not similar, e.g., differences in environment dynamics,
Parameters Ω w1 w2 c1 c2 c3
Value 1 1 0.3226 5 0.5 0.5
Parameters c4 c5 E p tb vr
Value 0.022 0.2 300 [0.1, 0.25, 0.6, 0.15] 10 1
We first evaluate our proposed approaches in an IoT system, in which a UAV flies
This area is divided into four zones such that the UAV’s travel distance in each zone
i.e., (0, 0). The probabilities of packet arrival in these zones are given by a vector
p = [p1 , p2 , p3 , p4 ], e.g., p1 is the packet arrival probability when the UAV is flying
over zone 1 and so on. Since the UAV collects data while flying, it often flies at a
low speed (e.g., 5 m/s) to maintain the reliability of the data collection process [28].
Therefore, we consider that the UAV has three speeds: 1, 3, and 5 (m/s). The UAV
consumes 2, 3, and 4 (energy units/time slot) when flying at 1m/s, 3m/s, and 5m/s,
respectively. Note that our proposed MDP framework and learning algorithms can
help the UAV learn the optimal policy according to its observation (i.e., its current
position and energy level) regardless of the UAV’s specifications, i.e., its available
speeds and corresponding energy consumption. Here, the energy unit is used to
quantify an amount of energy (as in [115, 116]), and energy consumption values
are only to demonstrate how much energy the UAV uses at each time slot for its
specifications, but the proposed algorithm still can obtain the optimal policy for the
UAV. The parameters of the reward function are provided in Table. 3.2.
3.4 Performance Evaluation 74
LJ LJ LJ
ϲϬ ϲϬ
ϮϬ
K ϲϬ dž K ϲϬ dž K ϲϬ dž
Figure 3.4 : (a) Source MDP, (b) the first target MDP, and (c) the second target
MDP.
The settings for our proposed algorithms are set as follows. For the ϵ-greedy
policy, ϵ is first set at 1, then gradually decreased to 0.01. For all the proposed
algorithms, the discount factor is set at 0.9. In Q-learning, the learning rate β is 0.1.
The architectures of Q-network and target Q-network are illustrated in Figure 2.4.
We use typical hyperparameters for training DNN, e.g., the learning rate and the
frequency update of Q̂ are set to 10−4 and 104 , respectively, as those in [33, 95].
transferred based on their valuable information. Recall that if the UAV performs the
battery replacement action when it is far from the station, it always receives a very
small reward compared with those of other actions, as in (3.4). Therefore, the UAV’s
experiences in this area can imply that it should not take the battery replacement
action. In contrast, experiences obtained when the UAV is near the station can
contain both information, i.e., when it not worth to return for charging (e.g., current
energy level is high) and when it is worth to take the battery replacement action
(e.g., current energy level is low). Thus, we choose experiences obtained when the
We study a scheme where the UAV does not have complete information about
the surrounding environment in advance, e.g., the data arrival probabilities and the
3.4 Performance Evaluation 75
energy replenishment process. Hence, we compare our approach with three other
deterministic policies, i.e., the UAV always flies at (1) lowest speed (1m/s), (2)
middle speed (3m/s), and (3) highest speed (5m/s). In addition, the Q-learning
and D3QL are selected as the baseline methods to demonstrate the effectiveness of
our proposed TL techniques since they are widely used algorithms and the most
namely D3QL-NoRA, where the D3QL will still be implemented on the UAV, but
In the simulation, we first evaluate the performance of our proposed learning al-
gorithm, i.e., D3QL-TL, by examining the convergence rate and the obtained policy.
Then, we evaluate the system performance when varying some important param-
eters (e.g., battery replacement time, UAV’s energy capacity, return speed, and
For the D3QL-TL, the experience transfer type is chosen because it can leverage
the experiences obtained during the learning phases of other algorithms, i.e., D3QL
and Q-learning. Finally, to gain more insights into the effectiveness of three trans-
of average rewards. At the beginning of the learning processes, the average rewards
of proposed approaches are close to each other, approximately 0.35. However, only
after 4, 000 iterations, D3QL-TL’s average reward is nearly 170% greater than those
of other approaches. Then, D3QL-TL almost converges to the optimal policy after
3.4 Performance Evaluation 76
0.8
0.7
Average Reward (rewards/time slot)
0.6
0.5
0.4
0.3
0.2
D3QL-TL
D3QL
0.1
D3QL-NoRA
Q-learning
0.0
1 10 20 30 40 50 60 70
Iteration (2x103)
7.5×104 iterations, and its average reward becomes stable at around 0.78, which is
more than 190% greater than those of other learning algorithms. Interestingly, D3QL
and D3QL-NoRA converge to policies that achieve similar average rewards. This
suggests that D3QL is unable to take advantage of the battery replacement action.
situation. This result implies the outperformance of our proposed algorithm, i.e.,
D3QL-TL, compared with other methods when addressing the extremely complex
Figure 3.5 (b). In particular, a point indicates the energy level of the UAV at the
beginning of a time slot. The slope of a straight line between two points indicates the
selected action, e.g., the steeper this line is, the higher speed is selected. Generally,
the lowest speed is selected in a zone that has a high probability of successfully
collecting a packet. In contrast, the highest speed is selected in a zone that has
[0.1, 0.25, 0.6, 0.15], as in Table. 3.2, the D3QL-TL selects the lowest speed in zone
3, and the highest speed in zones 1, 2, and 4. More interestingly, the UAV experiences
3.4 Performance Evaluation 77
the middle and high speeds in zone 1. In particular, it travels at the highest speed
until its energy decreases to 55 energy units at time slot 88, then the middle speed
is selected. When its energy drops to 40 energy units at time slot 93, equivalent
to 13.3% energy level, the UAV takes the battery replacement action. Note that
Figure 3.5 (b) also reveals information of the location where the UAV should take
given te = 12 time slots, the location that the UAV decides to return only 1m away
from the station. This result demonstrates the impacts of location and energy level
parameters are set to be the same as those in 3.4.2.1. The policies of proposed
learning algorithms, including Q-learning, D3QL, and D3QL-TL, are obtained after
1.5×105 iterations.
In Figure 3.6, we vary the battery replacement time, i.e., tb . Clearly, the av-
erage reward and throughput of all policies decrease as the battery replacement
time increases from 5 to 50 time slots. This is stemmed from the fact that given a
fixed duration, the less time the UAV needs to replace the battery, the more time
it can spend collecting data. As a result, the data collection efficiency of the sys-
tem reduces. It can be observed that D3QL-TL can significantly outperform other
approaches in terms of average reward and throughput, while it still obtains a rea-
sonable energy consumption per time slot. In particular, the average reward and
throughput achieved by D3QL-TL are up to 200% and 185% greater than those of
Next, we vary the return speed vr of the UAV to observe the system’s perfor-
3.4 Performance Evaluation 78
0.4 3.0
0.6
2.5
0.3
2.0
0.4
0.2 1.5
D3QL-TL
D3QL
1.0
0.2 D3QL-NoRA
0.1 Q-learning
0.5 Lowest speed
Middle speed
Highest speed
0.0 0.0 0.0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Battery Replacement Time (time slots) Battery Replacement Time (time slots) Battery Replacement Time (time slots)
(a) Average reward (b) Average throughput (c) Average energy consumption
mance in terms of average reward, throughput, and energy consumption. Figures 3.7
(a) and (b) show that as the return speed vr increases from 1(m/s) to 10(m/s), all
policies have upward trends in terms of the average reward and throughput, except
that of the D3QL-TL. The reason is that the increase of vr leads to the decrease
of time for returning to the station, i.e., tf , and thus the UAV has more time to
collect data in a fixed duration. More interestingly, when the return speed is low
(e.g., lower than 3(m/s)), the lowest speed policy obtains a higher average reward
than that of the highest speed policy, as observed in Figure 3.7 (a). Nevertheless,
the lowest speed policy obtains the lowest performance (i.e., the average reward is
lowest) when the return speed is large. This emanates from the fact that given a
fixed serving time, the energy consumption of the highest speed is higher than that
of the lowest speed. Therefore, the UAV has to replace its battery more frequently if
it flies at the highest speed rather than if it flies at the lowest speed. Consequently,
as the return speed increases, the highest speed policy eventually performs better
than that of the lowest speed policy. Unlike other policies, D3QL-TL achieves a
stable average reward, approximately 0.8, that is always much higher than those
of other policies, as shown in Figure 3.7 (a). This is because the UAV equipped
with D3QL-TL can learn an excellent policy, e.g., taking battery replacement action
when the UAV is close to the station, making it more adaptable to the changes of
0.8 0.4
3.5
3.0
0.6 0.3
2.5
2.0
0.4 0.2
D3QL-TL D3QL-TL D3QL-TL
D3QL D3QL 1.5 D3QL
D3QL-NoRA D3QL-NoRA D3QL-NoRA
Q-learning Q-learning 1.0 Q-learning
0.2 0.1
Lowest speed Lowest speed Lowest speed
Middle speed Middle speed 0.5 Middle speed
Highest speed Highest speed Highest speed
0.0 0.0 0.0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Return Speed (m/time slots) Return Speed (m/time slots) Return Speed (m/time slot)
(a) Average reward (b) Average throughput (c) Average energy consumption
Q-learning Q-learning
0.5
Average Reward (rewards/time slot)
0.4 3.0
0.6
2.5
0.3
2.0
0.4
0.2 1.5
D3QL-TL
D3QL
1.0
0.2 D3QL-NoRA
0.1 Q-learning
0.5 Lowest speed
Middle speed
Highest speed
0.0 0.0 0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Packet Arrival Probability in Cell 3 Packet Arrival Probability in Cell 3 Packet Arrival Probability in Cell 3
(a) Average reward (b) Average throughput (c) Average energy consumption
We then vary the packet arrival probability p3 of zone 3 while those of other
zones are unchanged, as provided in Table 3.2, and observe the performance of our
proposed approaches. Figures 3.8 (a) and (b) clearly show the increase of average
rewards and throughputs for all policies when p3 increases from 0.1 to 1.0. Interest-
ingly, as shown in Figure 3.8 (a), when p3 is small, e.g., lower than 0.4, the lowest
speed policy obtains lower rewards than those of the highest speed. However, when
p3 becomes larger, the lowest speed achieves a higher average reward than that of
the highest speed. This implies that the UAV should fly at the lowest speed if the
packet arrival probability is high, and vice versa. Figure 3.8 (c) demonstrates that
our proposed algorithm, i.e., D3QL-TL, can learn the environment’s dynamic, e.g.,
packet is low, e.g., less than 0.4, the UAV’s average energy consumption is high,
approximately 3.65 energy units/time slot, indicating that the highest speed is se-
3.4 Performance Evaluation 80
0.4 3.0
0.6
2.5
0.3
2.0
0.4
0.2 1.5
D3QL-TL D3QL-TL
D3QL D3QL
1.0
0.2 D3QL-NoRA D3QL-NoRA
Q-learning 0.1 Q-learning
Lowest speed 0.5 Lowest speed
Middle speed Middle speed
Highest speed Highest speed
0.0 0.0 0.0
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
UAV's Energy Storage Capacity(energy units) UAV's Energy Storage Capacity(energy units) UAV's Energy Storage Capacity(energy units)
(a) Average reward (b) Average throughput (c) Average energy consumption
lected more frequently than the lowest speed. In contrast, when this probability
becomes higher, e.g., larger than 0.5, the UAV’s energy consumption decreases to
around 2.6, implying that the lowest speed is the most frequently selected speed.
To that end, D3QL-TL can leverage this knowledge to consistently obtain the best
Finally, in Figure 3.9, we vary the UAV’s energy storage capacity E to study
its impact on the system performance. In particular, when E is varied from 100
to 1000 energy units, the average rewards and throughputs of all policies increase,
as shown in Figure 3.9 (a) and (b), respectively. It is worth highlighting that if
E is small, e.g., less than 500, the performance of the lowest speed is better than
that of the highest speed, as illustrated in Figure 3.9 (a). However, the highest
speed outperforms the lowest speed when E is larger than 500. This is due to the
fact that when E is small, the UAV’s battery has to be replaced more frequently,
leading to a downgrade of the UAV’s data collection efficiency. Thus, the UAV must
conserve more energy by flying at the lowest speed. By balancing between energy
usage efficiency and data collection efficiency, our proposed D3QL-TL approach can
D3QL-TL (i.e., experience transfer (ET), policy transfer (PT), and hybrid transfer)
in different scenarios, as shown in Figure 3.4 (b) and (c). In particular, the source
MDP, i.e., MS is defined as the MDP described in Section 3.4.1, and the simulation
parameters are also provided in Table. 3.2. Then, the optimal policy obtained by
D3QL-TL after 1.5×105 iterations and the UAV’s experiences gathered in the source
MDP are leveraged to reduce the learning time and learning quality. To gain an
insight of when and how much these transfer learning techniques can improve the
• In the first scenario (illustrated in Figure 3.4 (b)), the target MDP (i.e., M1T )
is the same as MS except the trajectory. The UAV only flies over two zones,
• In the second scenario (illustrated in Figure 3.4 (c)), the difference between
target MDP (i.e., M2T ) and MS is the probabilities of receiving a packet, i.e.,
D3QL-TL and D3QL in the first scenario in Figure 3.10 (a). To investigate how
experiences impact on the learning process of D3QL-TL, we select two sizes of expe-
rience sets, which are 50 and 100. As shown in Figure 3.10 (a), after 5×104 learning
iterations, the average rewards obtained by all types of D3QL-TL can achieve up
to 179% greater than that of D3QL, except the ET with the size of 50. For the
ET group, when the experience size is small, e.g., 50, transfer learning does not
improve the learning process of D3QL since the average reward of ET is similar to
that of D3QL. However, when this size is large enough, e.g., 103 , the asymptotic
3.4 Performance Evaluation 82
0.8
0.7
0.6
0.7
Average Reward (rewards/time slot)
0.4 0.6
0.3
0.5
0.2
1 10 20 30 40 50 1 10 20 30 40 50
Iterations (2x102) Iterations (2x103)
performance of D3QL-TL is 171% greater than that of D3QL in terms of average re-
ward. Interestingly, for the hybrid approach, when the experience size decreases, the
hybrid with zero experience size, consistently outperforms other transfer approaches
Specifically, only after 103 iterations, the PT obtains the optimal policy, and its
average reward is stable at around 0.68. Thus, the PT can help the UAV to reduce
the learning time up to 50% compared with those of other approaches. These results
demonstrate that if the environment’s dynamics in the target MDP, e.g., probabil-
ities of receiving a packet, are similar to those in the source MDP, PT is the best
choice. The reason is that the change of trajectory makes source experiences less
efficient than that of the source policy. It is worth noting that only schemes with
PT can improve the system performance at the beginning of the learning process
because this policy can help the UAV choose valuable actions in this period, e.g.,
selecting battery replacement action when it is near the station and its energy level
is low.
In Figure 3.10 (b), we show the results of the second scenario where the packet
3.5 Conclusions 83
arrival probabilities are different from that of the source MDP. In this scenario,
we set the experience size to 103 for the ET and hybrid schemes. Again, it can
be observed that only schemes with PT can improve the initial performance, e.g.,
during the first 2×103 iterations. Unlike the first scenario, the PT yields the worst
performance among the transfer learning schemes, and its asymptotic performance
at the end of the learning process. In contrast, the ET and hybrid schemes achieve
metric, i.e., 192%, is significantly higher than that of ET, i.e., 86%. These results
suggest that when the environment dynamics change, the hybrid scheme should be
selected.
to improve the performance of DRL, e.g., the ET with 50 experiences in the first
scenario and the PT in the second scenario. In the first case, the change in UAV’s
trajectory means that the UAV still works in the same environment, but it follows
another path. Therefore, the PT can quickly achieve the best policy because it
can directly improve the performance of the UAV instead of gradually transferring
the source knowledge in the ET. On the other hand, the probability of successfully
collecting a packet changes in the second scenario, meaning that the UAV is placed
in a different environment. In this context, the hybrid transfer technique can obtain
the highest performance since it can leverage both the policy and the important
3.5 Conclusions
In this chapter, we develop a novel Deep Dueling Double Q-learning with Transfer
Learning algorithm (D3QL-TL) that jointly optimizes the flying speed and energy
3.5 Conclusions 84
replenishment activities for the UAV to maximize the data collection performance
only the dynamic and uncertainty of the system but also the high dimensional state
and action spaces of the underlying MDP problem with hundreds of thousands of
transfer, and hybrid transfer) allow UAVs to “share” and “transfer” their learned
quality. The simulation results show that our proposed solution can significantly
improve the system performance (i.e., data collection and energy usage efficiency)
and has a remarkably lower computational complexity compared with other conven-
tional approaches. In the next chapter, we will explore an emerging technology, i.e.,
integrated communication and sensing (ICAS), that enables the ubiquitous sensing
capability of 6G networks.
85
Chapter 4
This chapter focuses on the Integrated Communications and Sensing (ICAS) tech-
nology that plays a critical role in enabling 6G systems to become ubiquitous sen-
sors [1]. Additionally, ICAS also emerges as a promising solution for Autonomous
Vehicles (AVs), a use case of 6G [2], where sensing and data communications are
two important functions that often operate simultaneously. For ICAS application
to AVs, optimizing the waveform structure is one of the most challenging tasks due
cally, the preamble of a data communication frame is typically leveraged for the
Interval (CPI) is, the greater the sensing task’s performance is. In contrast, commu-
surrounding radio environments are usually dynamic with high uncertainties due to
their high mobility, making the ICAS’s waveform optimization problem even more
challenging. To that end, this chapter presents our proposed solution that can auto-
matically optimize the waveform configuration for the ICAS of autonomous vehicles
framework established on the Markov decision process and recent advanced tech-
optimize its waveform structure (i.e., number of frames in the CPI) to maximize
4.1 System Model 86
dynamic and uncertainty. Note that the optimal waveform is not fixed, instead, the
ICAS-AV needs to optimally adapt its waveform configuration (i.e., the number of
and system requirements. Since the IEEE 802.11ad is leveraged for the ICAS sys-
tem in this work, the optimized waveform has the same characteristics as the IEEE
802.11ad. According to [117], the waveform used in the IEEE 802.11ad system is
show that our proposed approach can improve the joint communication and sensing
The content of this chapter is organized as follows. Sections 4.1 and 4.2 introduce
the ICAS system model and the problem formulation, respectively. Then, the Q-
learning-based and the proposed i-ICS algorithms are proposed in Section 4.3. In
Section 4.4, simulation results are analyzed. Finally, we conclude this chapter in
Section 4.5.
ing (mm-Wave ICAS) system based on the IEEE 802.11ad SC-PHY specification.
Note that this work leverages the preamble in the SC-PHY for sensing task since it
is similar to those in Control Physical Layer (C-PHY) and OFDM Physical Layer
cal layer types in IEEE 802.11ad. Moreover, according to [118], the OFDM-PHY is
obsolete and may be removed in a later revision of the IEEE 802.11ad standard. The
called the recipient vehicle. Let dX and vX denote the distance and the relative
4.1 System Model 87
Time Slot
AVX
ICAS-AV
Target
Target
Figure 4.1 : The ICAS system model in which the ICAS-AV maintains a data
communication with AVX based on IEEE 802.11ad. At the same time, the ICAS-AV
senses its surrounding environment by utilizing echoes of its transmitted waveforms.
speed between ICAS-AV and AVX , respectively. At the same time, the ICAS-AV
gathers echoes of transmitted signals from surrounding targets (e.g., moving vehicles
AV1 , . . . , AVX ) to perform the sensing function, as depicted in Figure 4.1. This work
assumes that time is divided into equal slots. The time slot is small enough so that
the velocities of targets can be considered constant in a time slot [38, 44].
In this work, similar to [38] and [119], the waveform structure is defined by the
number of frames and their locations in the Coherent Processing Interval (CPI). Note
that the maximum target’s relative velocity vmax can only be explicitly estimated
when frames are located at specific locations in the CPI time [38]. Specifically, the
n-th frame is located at nTd (as illustrated in Figure 4.1). Here, n ∈ [0, 1, . . . , N − 1]
Doppler shift ∆fmax = 2vmax /ζ [120]. The rationale of these setting is that pream-
bles of IEEE 802.11ad frames can act analogously as pulses in the pulsed radar
in which pulses are repeated after a Pulse Repetition Interval (PRI) [120]. At the
4.1 System Model 88
beginning of a time slot, the ICAS-AV decides the mm-Wave ICAS waveform struc-
ture (i.e., the number of frames in the CPI) that will be used to transmit data in
this time slot. Then, it observes feedback from the receiving vehicle AVX (i.e., the
acknowledgment of frame) and echoes signals from its surrounding targets at the
The ICAS-AV has a data queue with a maximum of Q packets, each with B
Bytes. When a new packet arrives, it will be stored in the data queue if this queue
is not full. Otherwise, this packet will be dropped. The packet arrival is assumed
to follow the Poisson distribution with the mean λ packets per time slot. Note
that the 802.11ad frame has a varying data field; therefore, each frame can contain
one or multiple packets. In the following, we first elaborate the proposed ICAS
transmit and receive signal models, then discuss the sensing processing and ICAS
performance metrics.
varying length data field. The preamble contains multiple Golay sequences whose
along the zero Doppler axis, making it perfect to be utilized for sensing function,
e.g., range estimation and multi-target detection [44]. However, its AF is very
multi-frame processing is proposed to address this problem [38, 44]. By doing so,
the preambles across frames act as radar pulses in a Coherence Processing Interval
(CPI). This work considers an ICAS waveform structure that consists of N IEEE
802.11ad frames in a CPI. The IEEE 802.11ad system can recognize this aggregated
Notation Description
Q Maximum number of packets in the data queue (Packets)
B Size of each packet (Bytes)
N Number of frames in a CPI
n Frame index in the Coherent Processing Interval (CPI)
ζ Wavelength of the communication link (m)
dX Distance to the recipient vehicle AVX (m)
vX Relative speed between ICAS-AV and AVX (m/s)
vmax Maximum target’s relative velocity (m/s)
Td Sub-Doppler Nyquist sampling interval (s)
∆fmax Maximum Doppler shift (Hz)
λ Mean packet arrival rate per time slot (Packets/time slot)
Gc Large-scale path loss in data communication
GT X , GRX Antenna gains of transmitter and receiver
σc2 Variance of the complex white Gaussian noise
σob Radar cross-section of a scattering center (m2 )
Gob Large-scale channel gain for a scattering center
∆v Velocity resolution (m/s)
δ Velocity measurement accuracy (RMS error) (m/s)
S State Space
A Action Space
r Immediate Reward Function
w1 , w 2 , w 3 Weights in the reward function r
The maximum target’s relative velocity, denoted by vmax , can only be explic-
itly estimated when frames are located at specific locations in the CPI time [38].
Specifically, the n-th frame is located at nTd (as illustrated in Figure 4.1), where
val with a maximum Doppler shift ∆fmax = 2vmax /ζ [120]. Note that the desired
ICAS system performance can be achieved by optimizing the ICAS waveform pa-
rameters (e.g., the number of frames in the CPI), which will be described in more
details in Section 4.1.3. The transmit signal model is then defined as follows. Let
sn [k] denote the symbol sequence corresponding to n-th transmitted frame with Kn
N
X −1 K
X n −1
where gT X (t) is the unit energy pulse shaping filter at the transmitter of ICAS-AV
and Ts is the symbol duration. In this study, similar to [38, 123], we consider a
single data stream model where the adaptive analog beamforming can be applied to
cus energy towards specific directions. This not only improves communication link
quality but also enhances sensing capabilities by increasing the signal-to-noise ratio
(SNR) of received signals from objects of interest. By leveraging the spatial diversity
and increased aperture provided by multiple antenna elements, a system can more
accurately determine the angle of arrival (AoA) and angle of departure (AoD) of sig-
for tracking or locating objects. Thus, the above received signals corresponding to
cation link between AVX and ICAS-AV is established, the large-scale path loss is
where GT X and GRX are the antenna gains of the transmitter (TX) and the re-
and delayed versions of x(t). Thus, the received communication signal corresponding
4.1 System Model 91
p c −1
PX
ync [k] = Gc βc [p]sn [k − p] + znc [k], (4.3)
p=0
where znc [k] is the complex white Gaussian noise with zero mean and variance σc2 , i.e.,
NC (0, σc2 ), and βc [p] is the small-scale complex gain of the p-th path. Note that we
assume that βc [p] is independent and identically distributed (i.i.d) N (0, σp2 ) where
PPc −1 2
p=0 σp = 1, as in [38, 44]. The SNR of the communication channel is defined as
Sensing received signal Similar to [38, 44, 124], this work uses the scattering
center representation to describe the sensing channel. We consider that there are O
range bins, and at the o-th range bin there are Bo scattering centers (i.e., targets).
A scattering center (do , b) can be defined by its distance do , velocity vob , radar cross-
section σob , round-trip delay τo = do /c with c being the speed of light, and Doppler
shift ∆fob = 2vob /λ. The large-scale channel gain corresponding to a scattering center
GT X GRX λ2 σob
Gbo = . (4.4)
64π 3 d4o
As in [38,125], only a target whose do is large in comparison with the distance change
during the CPI (i.e., do ≧ vob TCP I ) is considered, so the small-scale channel gain βob
can be assumed to be constant during the CPI. Thus, the received sensing signal
model corresponding to symbol k in the n-th frame can be given as follows [38]:
O−1 B o −1
b
X X p
yn [k] = Eno [k] Gbo e−j2π∆fo (kTs +nTd ) + zns [k], (4.5)
o=0 b=0
4.1 System Model 92
where zns [k] ∼ NC (0, σn2 ) is the complex white Gaussian noise of the sensing channel,
and Eno [k] is the delayed and sampled Matched Filtering (MF) echo from the o-th
PKn −1
range bin, i.e., Eno [k] = i=0 sn [i]g((k−i)Ts −nTd −τo ), where g(t) = gT X (t)∗gRX (t)
In this study, similar to [38, 119, 125], we assume that the channel is stationary
during the preamble period of a frame due to the small preamble duration. As such,
the received signal model corresponding to the preamble Eto [k] of a frame can be
O−1 B o −1
b
X X p
ynt [k] = Eto [k] Gbo e−j2π∆fo nTd + zns [k]. (4.6)
o=0 b=0
Note that kTs can be omitted from the phase shift term in the signal model cor-
We now discuss sensing signal processing in the ICAS system. Based on the cross-
and the received signal, the ICAS system can detect a target with high probability
(more than 99.99%) and achieve the desired range resolution (i.e., 0.1 m [40]) for
work only considers the velocity estimation of the ICAS system, which is more
challenging to obtain a high accuracy than those of target detection and range
estimation processes.
After detecting targets and obtaining the corresponding range bins, the velocity
estimation can be executed as follows. Given the n-th frame received in (4.6),
4.1 System Model 93
the sensing channel corresponding to detected targets at the o-th range bin can be
B o −1
b
X
hno = ubo e−j2π∆fo nTd + zon , (4.7)
b=0
p
where ubo = γ Es Gbo is the signal amplitude, γ is the correlation integration gain,
and zon is the complex white Gaussian noise NC (0, σn2 ). Then, the channel vector cor-
−1
responding to the o-th range bin for N frames in the CPI, i.e., ho = [h1o , h2o , . . . , hN
o ],
ho = Do uo + zo , (4.8)
where zo = [zo0 , zo1 , . . . , zoN −1 ] is the channel noise vector, uo ≜ [u0o , u1o , . . . , uB
o
o −1 T
] de-
b b
notes the channel signal amplitude vector, d(vob ) ≜ [1, e−j2π∆fo Td , . . . , , e−j2π∆fo (N −1)Td ]T
is the vector of channel Doppler corresponding to b-th velocity at o-th range bin, and
Do ≜ [d(vo0 ), d(vo1 ), . . . , d(voBo −1 )] is the matrix of channel Doppler. The target ve-
locity can be estimated based on (4.8) by using Fast Fourier transform (FFT)-based
algorithms that are widely used in the classical radar processing [44, 120].
As discussed in the previous subsection, this work focuses on the target velocity
estimation. Thus, the sensing performance for the ICAS can be determined by the
velocity estimation accuracy (i.e., velocity resolution). For the FFT-based velocity
ζ
∆v = , (4.9)
2Nf Td
where Nf is the number of frames used for the velocity estimation process. Then,
the velocity measurement accuracy can be characterized by the root mean square
4.1 System Model 94
error that depends on the SNR of the received sensing signal as follows [40, 126]:
ζ
δ= √ . (4.10)
2Nf Td 2SNR r
Equation (4.10) implies that given a fixed CPI time, a fixed Td , and a constant data
rate, the velocity estimation accuracy increases (i.e., δ decreases) as the value of Nf
increases. Recall that frames are placed at consecutive multiples of Td in the CPI.
Therefore, the last frame’s size is larger than those of others if the total number
of frames is less than the maximum number of frames in the CPI. As such, as the
number of frames in the CPI increases, the size of the last frame decreases since the
The communication metrics for the ICAS must represent the data transmission
performance. Two typical communication metrics are the transmission rate and
reliability (e.g., packet loss). Thus, this work considers two metrics, 1) the length of
the data queue representing the efficiency of data transmission, and 2) the number
Note that unlike the system models in [38, 42, 46] that only consider static en-
many uncertainties, such as the wireless channel quality and the number of arrival
packets. Compared with [38, 42, 46], such dynamics make the waveform optimiza-
tion problem more challenging. First, the wireless channel is dynamic and uncertain
due to the high mobility of the vehicle, resulting in time-varying successful data
transmission probability. Second, since the transmission demand of the user as well
as the vehicle’s autonomous system may change over times, the number of packet
requests arriving at the system may be different at different times. Therefore, the
data arrival process at the ICAS-AV is also highly dynamic. As a result, due to these
It is also worth noting that a large frame has a higher drop probability than that
of a smaller frame in the same wireless environment. Thus, using multiple small-size
frames can increase not only the reliability of transmission but also the sensing per-
because it increases the number of preambles that do not contain user data. Con-
sequently, packets pile up at the ICAS-AV, leading to packet drop in the queue. In
wireless channel quality, which can be represented through the packet drop proba-
bility and SNR, and the packet arrival rate) highly influence the ICAS performance
in regard to data transmission reliability and sensing accuracy. Therefore, the ICAS
system needs to obtain an optimal policy that optimizes the ICAS waveform (i.e.,
the number of frames in the CPI) to achieve the desired performance in terms of
environment may change significantly from time to time, especially in the ICAS en-
vironment where AVs usually travel. Thus, optimizing the ICAS waveform in each
time slot is an intractable problem. The following sections will describe our pro-
posed MDP framework for the ICAS operation problem that enables the ICAS-AV
to quickly and effectively learn the optimal policy without requiring complete infor-
mation from the surrounding environment, thereby achieving the best performance
deal with the highly dynamic and uncertain of the surrounding environment. The
MDP framework can help the ICAS-AV adaptively decide the best action (i.e., the
ICAS waveform structure) based on its current observations (i.e., the current data
4.2 Problem Formulation 96
queue length and channel quality) at each time slot to maximize the ICAS system
performance without requiring complete knowledge on the packet arrival rate, data
by the state space S, the action space A, and the immediate reward function r. The
following sections will discuss more details about the components of our proposed
framework.
As we aim to maximize the performance of the ICAS system with regard to data
transmission efficiency and sensing accuracy, we need to consider the following key
factors. The first one is the current data queue length (i.e., the number of packets
in the data queue) because it reflects the efficiency of the data transmission process.
For example, given a packet arrival rate, the lower the number of packets in the
queue is, the higher the data transmission efficiency is. The second one is the link
quality that can be estimated by using the SNR metric at the ICAS-AV. At the
beginning of a time slot, the link quality is estimated based on the feedback (i.e.,
the recipient vehicle’s ACK frame and targets’ echoes) of the transmitted frame in
the previous time slot. Although it does not represent the instantaneous channel
In this work, we consider that the channel quality level can be grouped into C
different classes, which are analogous to the Modulation and Coding Scheme (MCS)
levels in IEEE 802.11ad [121]. These classes have different probabilities of bit errors
[p1 , p2 , . . . , pC ]. Note that given the transmission link with bit error probability pb ,
In addition, if a frame drops, all packets in this frame will be lost. To that end, the
4.2 Problem Formulation 97
n o
S = (q, c) : q ∈ {0, . . . , Q}; c ∈ {0, . . . , C} , (4.11)
where q is the current number of packets in the data queue and c is the channel
quality. Here, Q is the maximum number of packets that the data queue can store.
In this way, the system state can be represented by a tuple s = (q, c). By this design,
the ICAS system continuously operates without falling into the terminal state.
As discussed in Section 4.1.3, the ICAS waveform plays a critical role in the
system performance. In particular, given a fixed CPI time T , using a large num-
ber of frames results in a high reliability of data transmission and a high sensing
accuracy. However, it reduces the efficiency of data transfer as there are more over-
head data. At each time slot, the ICAS-AV needs to select the most suitable ICAS
waveform structure (i.e., the number of frames in the CPI) to maximize the system
A = 1, . . . , N , (4.12)
where N is the maximum number of frames in the CPI. Recall that the beginning
N = ⌊ TCP
Td
I
⌋, where TCP I is the CPI time and ⌊·⌋ is the floor function. As such,
if the number of frames in the CPI selected by the ICAS-AV is less than N , the
last frame will be longer than others. Note that when the data queue is empty,
the ICAS-AV can still send dummy frames (e.g., frames whose data fields contain
Since the ICAS system performs two functions simultaneously, i.e., data trans-
mission and sensing, we aim to maximize the ICAS system performance by balancing
the data transmission efficiency and the sensing accuracy. Thus, the reward func-
tion needs to capture both of them. The data transmission efficiency can be defined
according to the number of packets waiting in the queue and the number of dropped
packets. Specifically, the lower the number of packets in the data queue and the
number of dropped packets are, the higher the efficiency of the ICAS system is.
Suppose that at time slot t, the ICAS-AV observes state st and takes action at . Let
qt , δt , and lt denote the current size of the data queue, the sensing accuracy, and the
number of dropped packets that the ICAS-AV observes at the end of t, respectively.
where w1 , w2 , and w3 are the weights to tradeoff between the number of packets
waiting in the queue, the sensing accuracy, and the number of dropped packets due
to the data queue full. In practice, these weights should be determined carefully
wireless channel quality and/or data arrival rate at the transmitter of an ICAS-
AV. For example, in a system requiring high sensing accuracy, these weights can
be set such that the sensing accuracy contributes the largest share in the reward
function. Note that the value range and unit of each component in the reward
function are different. For example, the number of loss packets is the positive integer
number, e.g., 5 packets, while the sensing accuracy can be a fraction, e.g, 0.001 m/s.
Therefore, a higher value of its weight does not guarantee a higher contribution to
the immediate reward. In Section V.B.2, experiments are conducted to gain insights
4.2 Problem Formulation 99
into the impact of weights under various environmental conditions. The negative
function in (4.13) implies that the ICAS-AV should take an action that can quickly
free the data queue, lower the number of dropped packets, and achieve a high sensing
accuracy. Note that the lower the value of δt is, the higher the velocity estimation
accuracy of the system is. Given the above, the immediate reward function (4.13)
ICAS.
The objective of this study is to find an optimal policy for the ICAS-AV that
maximizes the long-term reward function. Let R(π) denote the long-term average
reward function under policy π : S → A, then the problem can be formulated as:
T
1X
max R(π) = lim E rt (st , π(st )) , (4.14)
π T →∞ T
t=1
where π(st ) is the action at time t according to policy π. Thus, given the ICAS-AV’s
current data queue length and wireless channel quality, the optimal policy π ∗ gives
an optimal action that maximizes R(π). In addition, Theorem 4.1 shows that the
Theorem 4.1. With the proposed MDP framework, the average reward function
R(π) is well defined under any policy π and regardless of a starting state.
Proof. We first prove that the Markov chain of the considered problem is irreducible
as follows. Recall that the state of the ICAS consists of two factors, i.e., the current
queue length q and the wireless channel quality c. For each time slot, the data arrival
rate is assumed to follow the Poison distribution and the channel quality is derived
the ICAS is at state s at time t, it can move to any other states s′ ∈ S{s} after
4.3 Reinforcement Learning-based Solutions for ICAS-AV Operation Policy 100
Environment Action
AVX
ICS-AV
Target
Epsilon
Target
Interaction loop
greedy
Current estimated
Experience State Q-network Q-values
Update parameters
Learning loop
Copy Loss
function
Experience Buffer
Experience Buffer E
Figure 4.2 : The proposed i-ICS model, in which the ICAS-AV obtains an optimal
policy by gradually updating its policy based on its observations of the surrounding
environment.
finite time steps. As such, the proposed MDP is irreducible with the state space S,
thereby making the average reward function R(π) is well defined under any policy
Operation Policy
Due to the highly dynamic and uncertainty of the environment (e.g., packet drop
probability due to the channel quality and the data arrival rate), the ICAS-AV is
unable to obtain this information in advance. In this context, RL can help the
ICAS-AV obtain the optimal policy without requiring completed knowledge about
Section 2.2.4, that can help the ICAS-AV gradually learn an optimal policy through
three layers in which only the hidden layer is fully connected. In addition, the Q-
network is built only from traditional simple component, e.g., neurons with Tanh
on AVs that are equipped with sufficient computing resources. It is also important
to mention that the decision of action for the ICAS-AV is an output of the Q-
network after feeding the state to the Q-network. Therefore, the decision time (i.e.,
the inference time of the Q-network) is very marginal. It is worth noting that since
i-ICS leverages DNNs, which are non-linear approximators, its optimality cannot
based on D3QL, its discussion of the optimality and computational complexity was
discussed in Section 2.2.5. In practice, a few deep learning applications have been
deployed in AVs such as Tesla Autopilot [106] and ALVINN [107]. Thus, they are
clear evidences of the effectiveness and efficiency when deploying machine learning
three types of objects: (i) ICAS-AV, which is an autonomous vehicle equipped with
802.11ad based ICAS, (ii) a receiving vehicle AVX that maintains communication
with ICAS-AV, and (iii) a target AV1 moving at a distance of around the ICAS-
AV, as illustrated in Figure 4.1. Specifically, in the simulation, the related velocity
of the target is selected randomly between −45 m/s to 50 m/s, similar to those
in [38]. As such, the maximum related velocity between ICAS-AV and a target is
4.4 Performance Evaluation 102
50 m/s (i.e., 111.847 miles/h). Therefore, this setting is appropriate for practical
scenarios. Note that the proposed ICAS system uses the mmWave transmission,
(LOS) scenario, similar to what considered in related works [38, 42, 119]. The CPI
time TCP I is set to 10Td , meaning that the maximum number of frames in the CPI
is 10, similar to that in [38]. Note that one frame in the CPI can contain multiple
Recall that the data queue is used to store data packets when the amount of arrival
data at the ICAS transmitter excesses the ICAS maximum transmission rate. All
packets’ sizes are assumed to be equal to 1500 Bytes, which is the typical value of
the maximum transfer unit (MTU) in WiFi networks and the Internet [128]. Other
parameters of the ICAS system are set based on IEEE 802.11ad standard, e.g., a
and unknown to the ICAS-AV in advance. The detailed settings of environment are
as follows. First, as the wireless channel is dynamic and uncertain, the successful
data transmission probability varies over time according to the channel condition.
Thus, the packet error ratio PER can represent the Channel Quality Indicator. We
consider that the ICAS system operates at Modulation and Coding Scheme 3 (MCS-
3) of IEEE 802.11ad SC-PHY mode [121]. Note that the IEEE 802.11ad standard
requires the PER of MCS-3 to be less than or equal to 1% and the SNR ≈ 1.5 dB
so that the system can normally perform [121]. In practice, the PER depends on
many factors, such as wireless channel quality, modulation technique, and trans-
mit power [127]. Therefore, for demonstration purposes, we consider three chan-
nel levels with different values of PER and SNR: (i) level-1 with PER = 10% and
SNR = −1.5 dB, (ii) level-2 with PER = 1% and SNR = 1.5 dB, and (iii) level-3
with PER = 0.3% and SNR = 4.5 dB. Based on these quality levels, we assume that
4.4 Performance Evaluation 103
the wireless channel can fall into one of the three types. The first one is the poor
level-3 at a time slot corresponds to a probability vector ppc = [0.6, 0.2, 0.2]. Specifi-
cally, for the poor channel condition, the probability of the channel quality at level
1 at a time slot is 60% and so on. Besides poor channel, other types correspond
to a normal and strong channel whose probability vectors are pnc = [0.2, 0.6, 0.2],
and pgc = [0.2, 0.2, 0.6], respectively. Second, the packet arrival is also dynamic
so that it is assumed to follow the Poisson distribution with the mean λ packets
per time slot. Our proposed framework will be evaluated under these channel con-
ditions (i.e., poor, normal, and strong channel) with different values of the mean
data arrival rate λ. Moreover, we further evaluate our approach with two sets of
weights for the immediate reward function, i.e., {w1 = 0.05, w2 = 0.4, w3 = 0.5} and
{w1 = 0.025, w2 = 0.8, w3 = 0.5}. Note that these above settings are just for simu-
lation purposes. Our proposed learning algorithm (i.e., i-ICS) does not require these
parameters in advance and can adapt to them through real-time interactions with
The parameters for the proposed learning algorithms are set as follows. Specifi-
algorithms. Specifically, the agent takes a random action with probability ϵ and an
action based on its current policy with probability 1 − ϵ. Unlike rule-based scenarios
where agents have predefined policies, the RL agent does not have prior knowledge
by taking action randomly at the beginning. Then it adjusts its policy based on the
selected action’s feedback, i.e., the reward for performing this action. Therefore, the
agent can gradually adjust the probability of selecting random action to obtain an
optimal policy that maximizes the system performance [129]. However, due to the
random action should not be zero since the RL needs to update its policy according
to changes in the environment. Given the above, in this work, ϵ is set to 1 at the
beginning and then linearly decreases to 0.1 after 8 × 105 iterations, which is similar
Other parameters for the Q-learning-based and the i-ICS algorithms are set as
typical values, as in [31, 33, 34]. Specifically, the discount factor η is 0.9 for both the
Q-learning-based algorithm and the i-ICS. For the Q-learning based approach, the
learning rate is 0.1. For the proposed i-ICS, the Adam optimizer is used to train
the Q-network with a learning rate of 10−4 and the target Q-network’s parameters
of two main entities, i.e., the ICAS-AV agent and the environment. At the begin-
ning of time slot t, the agent observes the current state, i.e., st = (qt , ct ), where
qt is the current number of packets in the data queue and ct is the channel qual-
ity. As described in Algorithm 2, the ICAS-AV agent maintains two DNNS (i.e.,
the Q-network and the target Q-network) built based on the Pytorch framework.
Upon receiving the state, the agent feeds the current state to the Q-network to
obtain an action at (i.e., the number of frames in the CPI) at this time slot. After
the agent performs the selected action, the environment calculates an immediate
reward rt given by Eq. (4.13). Based on these information, the environment ob-
tains (i) the number of arrival packets by drawing from the Poison distribution with
the mean λ and (ii) the channel quality that is randomly generated based on the
probability vector, e.g., ppc = [0.6, 0.2, 0.2] for the poor channel condition. Then,
the environment updates the number of packets in the data queue and constructs
a next state st+1 = (qt+1 , ct+1 ). After that, the environment sends the next state
and the reward to the agent. Once receiving them, the agent stores the experience,
rithm 2. The target Q-network is updated at every 104 time steps by cloning from
the Q-network’s parameters. The process is repeated until the agent converges to
an optimal policy. Recall that the ICAS-AV does not have any prior information
about its surrounding environment’s uncertainties and dynamics, e.g., the packet
drop probabilities and packet arrival rate. Therefore, the proposed solutions are
compared with two baseline policies: 1) a greedy policy where the ICAS-AV selects
an action to maximize the reward function without caring about the uncertainties
and dynamics of the environment and 2) a deterministic policy in which the ICAS-
AV always sends ndp frames in the CPI time. Here, we set ndp to be a half of N
(i.e., 5) to demonstrate the ICAS’s average performance when the number of frames
methods (e.g., [38, 119]) as baselines since they require complete information about
rates of our proposed approaches, i.e., Q-learning based algorithm and i-ICS. We
then study the influences of several key factors (e.g., the packet arrival rate, wireless
channel quality, and weights in the immediate reward function) on the performance
Figure 4.3 illustrates the convergence rates of our proposed algorithms for the
ICAS system, i.e., the Q-learning and the i-ICS. Here, we compare their performance
in the normal channel, and the mean number of arrived packets λ is set to 14. It
can be observed that the i-ICS achieves a superior result in terms of average reward
compared with that of the Q-learning. Specifically, at the beginning of the learning
process, the Q-learning and i-ICS obtain similar results. However, after 2×104
4.4 Performance Evaluation 106
−1.50
−1.75
Average reward
−2.00
−2.25
−2.50
−2.75
−3.00 Ǧ
Ǧ
−3.25
0 20 40 60 80 100
Iteration (2x103)
iterations, the i-ICS’s average reward is 20% greater than that of the Q-learning.
Then, the i-ICS eventually converges to the optimal policy after 4.5×104 while Q-
learning still struggles with a mediocre policy. Under the optimal policy obtained
We then evaluate the robustness of our proposed approach, i.e., i-ICS, by varying
the mean number of packets arrived at a time slot λ from 2 to 20. The learned policies
of Q-learning and i-ICS are obtained after 2×105 training iterations. To evaluate
the performance of the considered ICAS system, we consider four metrics, including
(i) the average cost, which is the negation of the average reward, (ii) the average
queue length, (iii) the average velocity estimation accuracy, i.e., sensing accuracy,
and (iv) the average packet drop. The reason for introducing the average cost is
to make the demonstration consistent across system performance metrics (i.e., the
smaller the cost is, the better the system performance is). Note that the average
We first set the wireless channel to normal quality. The weights of the immediate
reward function are presented by a weights vectors W1 = [0.05, 0.4, 0.5], i.e., w1 =
4.4 Performance Evaluation 107
Ǧ 50 Ǧ
7
Average cost
30
4
3
20
35.1%
2
10
1
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Greedy Greedy
8
Deterministic policy Deterministic policy
0.8
0.6
0.4
0.2
0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.4 : Varying the data arrival rate λ under normal channel condition, i.e.,
pnc = [0.2, 0.6, 0.2], with the weight vector W1 = [0.05, 0.4, 0.5].
0.05, w2 = 0.4, w3 = 0.5, as shown in Figure 4.4. Clearly, the average costs of
all policies increase as the packet arrival rate increases, i.e., λ increases from 2 to
20, as shown in Figure 4.4 (a). It stems from the fact that given a data queue
with fixed capacity and the ICAS system operates under the same environment’s
characteristics, the higher the value of λ, the higher the number of dropped packets
due to the full packet queue. Indeed, Figures 4.4 (b) and (d) clearly show that when
the packet arrival rate increases, the average queue length and average packet drop
increase for all policies. It can be observed that our proposed algorithm (i.e., i-ICS)
compared with those of other policies. Similarly, i-ICS has the lowest average packet
drop regardless of the packet arrival rate, and it consistently maintains the average
Regarding the sensing metric, i-ICS and Q-learning achieve the highest sensing
4.4 Performance Evaluation 108
Ǧ 50 Ǧ
7
Average cost
5
30
4
20
3 25%
2 10
1
0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
0.20
6
0.18
4
0.16
0.14
2
0.12
0
0.10
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.5 : Varying the data arrival rate λ under normal channel condition, i.e.,
pnc = [0.2, 0.6, 0.2], with the weight vector W2 = [0.025, 0.8, 0.5].
accuracy (i.e., the lowest value of average velocity estimation accuracy) when λ is
less than 12. Whereas their average sensing accuracy results are not good compared
to other policies if λ is larger than 12. The reasons are as follows. When the packet
arrival rate is low (i.e., λ < 12), the average numbers of packets in the queue of all
policies never pass 30% of the data queue capacity, as shown in Figure 4.4 (b). Thus,
the ICAS-AV can increase the number of frames in the CPI, meaning a decrease in
the packet sent in the CPI, to achieve a higher sensing performance without worrying
about packet loss due to a full queue. Figure 4.4 (c) clearly shows that the Q-learning
and i-ICS can learn this strategy to obtain the best performance in terms of average
As λ increases from 12 to 20, the average queue lengths of the greedy and deter-
ministic policies quickly reach the maximum number of packets that can be stored
in the data queue, i.e., 50, as shown in Figure 4.4 (b). This can lead to a high
4.4 Performance Evaluation 109
8
Ǧ 50 Ǧ
Average cost
5
30
4
3 20
2 3 .8%
10
1
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Greedy Greedy
Deterministic policy Deterministic policy
0.6 8
0.5
6
0.4
4
0.3
2
0.2
0
0.1
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.6 : Varying the data arrival rate λ under poor channel condition, i.e.,
ppc = [0.6, 0.2, 0.2], with the weight vector W1 = [0.05, 0.4, 0.5].
possibility of packet drop due to the full data queue. Interestingly, sensing perfor-
mance of Q-learning and i-ICS policies decreases to the worst at λ = 16 and λ = 18,
respectively. Then, they manage to increase the sensing accuracy when the packet
queue is mostly always full at λ = 20. The reason is that the data transmission
efficiency is unable to be improved because of a very high packet arrival rate that
the system cannot handle. Thus, it might be better to improve the sensing ac-
curacy instead of communication efficiency. On the other hand, since the greedy
and deterministic policies do not consider the uncertainty of the environment (e.g.,
the packet drop possibility), they can maintain better sensing accuracy when the
packet arrival rate is high. However, their transmission efficiency is very low. As
can be seen in Figure 4.4 (d), the average numbers of packet drops of the greedy
and deterministic policies are up to 50% higher than that of our proposed learning
algorithm, i.e., i-ICS. Thus, the proposed algorithm can help the ICAS-AV to ob-
4.4 Performance Evaluation 110
Ǧ 50 Ǧ
8
Average cost
30
5
4
20
3
6.26%
10
2
1
0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Greedy Greedy
0.24 Deterministic policy Deterministic policy
8
0.22
0.20 6
0.18
4
0.16
2
0.14
0.12 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.7 : Varying the data arrival rate λ under poor channel condition, i.e.,
ppc = [0.6, 0.2, 0.2], with the weight vector W2 = [0.025, 0.8, 0.5].
tain an optimal policy that strikes a balance between sensing and data transmission
metrics, thereby achieving the best overall system’s performance compared with
those of other policies. Although i-ICS and Q-learning experience a similar trend
stems from the fact that i-ICS can effectively address the high dimensional state in
a complicated problem.
Interestingly, although i-ICS always achieves the lowest cost when λ increases
from 2 to 20, as shown in Figure 4.4 (a), its sensing performance is not good as
those of greedy and deterministic policies when λ > 12, as shown in Figure 4.4 (c).
The reason is that the proposed reward function defined in Eq. (4.13) is a weighted
sum of communication and sensing metrics. Specifically, the higher the reward
obtained by the ICAS system is, the better performance the ICAS system achieves.
Since the cost is the negation of the reward, the ICAS system performs better if its
4.4 Performance Evaluation 111
7 Ǧ 50 Ǧ
Average cost
30
4
3
2.6% 20
10
1
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Greedy 8 Greedy
0.7
Deterministic policy Deterministic policy
0.6
6
0.5
0.4 4
0.3
2
0.2
0.1 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.8 : Varying the data arrival rate λ under strong channel condition, i.e.,
pgc = [0.2, 0.2, 0.6], with the weight vector W1 = [0.05, 0.4, 0.5].
cost is lower. However, the cost only captures the joint/overall performance of the
communication and sensing function. Therefore, one of these functions may perform
Next, we investigate how the immediate reward function’s weights can influence
the system performance by changing the weight vector to W2 = [0.025, 0.8, 0.5] and
varying the packet arrival rate. In Figure 4.5, it can be observed that the results
of deterministic policy are mostly unchanged, except the average cost result, when
changing these weights because the ICAS-AV’s environment is still the same as the
previous experiment, and this policy does not rely on the immediate function. As
the weight of sensing metric (i.e., w2 ) is doubled, i-ICS, Q-learning, and greedy
policies achieve sensing accuracy results that are much better than those in the
previous experiment. In addition, except for the Q-learning, they also consistently
7 Ǧ 50 Ǧ
Average cost
5
30
4
20
3
33%
2 10
1
0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Greedy Greedy
0.225 Deterministic policy Deterministic policy
8
0.200
6
0.175
4
0.150
0.125 2
0.100
0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Lamdba Lamdba
Figure 4.9 : Varying the data arrival rate λ under strong channel condition, i.e.,
pgc = [0.2, 0.2, 0.6], with the weight vector W2 = [0.025, 0.8, 0.5].
policies’ data transmission metrics (i.e., the average packet drop and average queue
length) become worse than those in the first experiment, as shown in Figures 4.5
(b) and (d). The reason is that when the ratios w1 /w2 and w3 /w2 become smaller,
the ICAS system pays more attention to the sensing accuracy. Thus, Figure 4.5
clearly shows that in practice, these weights can be adjusted so that our proposed
learning algorithm can obtain a policy that fulfils different requirements of a ICAS
system at different times. Thanks to the ability to learn without requiring complete
information of the surrounding environment, i-ICS still achieves the best overall
different channel qualities, i.e., (i) poor quality with the PER probability vector ppc =
[0.6, 0.2, 0.2] and (ii) good quality with the PER probability vector pgc = [0.2, 0.2, 0.6].
To do so, we vary the packet arrival rate λ. For each of these channel qualities, two
4.4 Performance Evaluation 113
sets of results are collected according to W1 and W2 , as shown in Figures 4.6 to 4.9.
Overall, all policies’ results experience similar trends as those in normal channel
quality. It can be observed that the channel quality significantly affects the joint
when the channel quality changes from poor to normal and then to good, the overall
system performance increases regardless of the weight vector. The reason is that as
the channel quality becomes better, the packet drop probability decreases, leading
performance improves.
to 4.9, when the mean packet arrival rate is small, i.e., λ ≤ 12, the performance of
i-ICS increases as the channel changes from poor to good quality for both W1 and
W2 . Interestingly, when λ is larger than 12, the sensing performance of i-ICS expe-
riences differently with the weight vectors W1 and W2 . Specifically, with W1 , the
from poor to good quality. Whereas, with W2 , the i-ICS achieves the highest and
lowest sensing performance under the normal channel and strong channel. Thus,
sensing performance depends on not only the channel quality but also the immedi-
ate reward function’s weights and the mean packet arrival rate λ. As such, better
channel quality does not guarantee better sensing performance. Note that the pro-
posed approach in this study aims to maximize the joint communication and sensing
the performance of one function (e.g., sensing) is a bit lower, our proposed i-ICS
In summary, i-ICS achieves the highest overall performance boost among the
considered policies when channel quality changes from poor to good. For instance,
with W1 , i-ICS’s average cost decreases up to 51.7% while those of the greedy and
4.5 Conclusion 114
can effectively adapt its behaviour according to the changes in its surrounding en-
4.5 Conclusion
based on the observations to maximize the overall performance of the ICAS system.
Then, we have proposed an advanced learning algorithm, i.e., i-ICS, that can help
the ICAS-AV gradually learn an optimal policy through interactions with the sur-
in advance. As such, our proposed approach can effectively handle the environment’s
dynamic and uncertainty as well as the high dimensional state space problem of the
underlying MDP framework. The extensive simulation results have clearly shown
that the proposed solution can strike a balance between communication efficiency
in different scenarios.
115
Chapter 5
This chapter presents MetaSlicing, our proposed resource allocation framework for
Metaverse require enormous resources that have never been seen before, especially
computing resources for intensive data processing to support the Extended Real-
ity, enormous storage resources, and massive networking resources for maintaining
ultra high-speed and low-latency connections. Therefore, this work aims to pro-
pose a novel framework, namely MetaSlicing, that can provide a highly effective
and comprehensive solution for managing and allocating different types of resources
may have common functions, we first propose grouping applications into clusters,
applications. As such, the same resources can be used by multiple applications si-
real-time characteristic and resource demand’s dynamic and uncertainty in the Meta-
and propose an intelligent admission control algorithm that can maximize resource
results show that our proposed solution outperforms the Greedy-based policies by up
to 80% and 47% in terms of long-term revenue for Metaverse providers and request
This chapter is structured as follows. Sections 5.1 introduces the system model
5.1 System Model 116
MetaSlicing Framework
End Metaverse Metaverse Infrastructure
Users Tenants Service Provider
Metaverse
Decomposed MetaSlices
Admission Naviga�on MetaSlice
𝑴𝑻𝟏 Management
Tier-N
MetaSlice
𝑴𝑻𝟐 Requests Analyzer
MetaSlice
…
1
…
Similarity
Decomposition Travel MetaSlice
Analysis
𝑴𝑻𝑴
…
Admission
Controller
Func�ons Educa�on MetaSlice
5 2 Tier-2
Resource Accepted
Update Requests 3
Resource
Allocation
Resource Grouping
Management
Resource
Avaibility MetaSlice 1’s Shared
4 Func�ons Func�ons
Figure 5.1 : The system model of the proposed MetaSlicing framework. In this
framework, different resource types in different tiers can be used and shared to
create Metaverse applications (i.e., MetaSlices).
control formulation is presented in Section 5.2 After that, the Metaverse application
analysis and the proposed deep reinforcement learning-based algorithm are discussed
in Section 5.3. In Section 5.4, simulation results are analyzed. Finally, Section 5.5
In this work, we consider a system model including three main parties, i.e., (i)
End-users, (ii) Metaverse tenants, and (iii) the Metaverse Infrastructure Service
Metaverse tenant will request the MetaSlice from the MISP according to its sub-
scribed users’ demands. If a request is accepted, the MISP will allocate its resources
to initiate this MetaSlice. In the following, we explain the main components together
verse
As discussed in the previous section, although the centralized cloud can be used
congestion in the network and a point of failure. In the literature, most existing
works related to the Metaverse resource management consider single-tier edge com-
puting architecture [54, 56, 57]. However, the single-tier edge architecture may not
be appropriate and effective since the edge capacity is often limited while Metaverse
single-tier edge computing, Metaverse applications are likely created near the user’s
location so that the user’s QoE requirements may not be satisfied if the user moves
far away.
tions. First, this architecture alleviates the extensive resource demands of Metaverse
applications for both the Metaverse tenants and end-users. Second, this architec-
ture enables the distribution of different types of resources, e.g., computing, storage,
networking, and communication capabilities, along the path from end-users to the
cloud. By doing so, Metaverse applications can leverage the resources placed near
end-users, resulting in a low delay and high QoS for users. Third, when a user moves
to a new location, these applications can be migrated to a site near the user’s new
location to maintain the QoE. Thus, distributing resources increases the resilience of
architecture.
ploy Metaverse applications, it still faces several challenges. First, recall that Meta-
latency. As a result, they are supposed to be created at the edge servers that are
placed near users, i.e., tier-1, thereby possibly causing overload at low tiers. Note
that the resource capacity at low tiers is often much lower than those at high tiers,
e.g., cloud. As such, it may result in high latency in both computing and trans-
mission or even interrupting services, thus reducing users’ QoE since latency is one
of the most important factors to guarantee QoE in Metaverse. Second, a user can
physically move while using MetaSlices, and the latency requirement may not be
guaranteed if they move too far from the place where ongoing applications are cre-
ated. This issue can be mitigated by migrating these MetaSlices to a new place near
the user’s new location. However, the migration may introduce long delay due to
the transmission of user’s data and the application re-initialization. In the following
virtual worlds, and each virtual world can be regarded as an application for a spe-
an application may have several functions that can operate independently. For ex-
ample, a tourism application may have a recommendation, digital map, and driving
assistant functions. Based on the user’s location, the recommendation function can
suggest nearby attractive places, and then the user can use the digital map function
to get more information about these places. After that, the user can use the driving
assistant function to get to the chosen place. Since these functions can be used
or unused depending on different users, they can run separately without interfer-
5.1 System Model 119
a modular way by which their functions can be initialized and operated indepen-
dently. By doing so, these independent functions can be connected to each other
connect to current online services, e.g., Google Maps API. Given the above, this
work considers that a MetaSlice, i.e., application, can be decomposed into multiple
dependent functions.
Note that this work does not focus on optimal application decomposition, and we
assume that applications can be decomposed by using existing methods, e.g., [130,
131]. We consider that each function is allocated dedicated resources due to the
the dedicated resource allocation scheme likely executes faster than this function
under the dynamic scheme. This is because, under the dynamic resource alloca-
may degrade the QoS of Metaverse users. In addition, because resources are not
reserved for each function in a dynamic allocation scheme, a function may not have
overloading with other functions, leading to a high delay or even service disruption.
This problem can be alleviated by migrating the function to another server with
execution. In this context, dedicated resource allocation can avoid this issue by
reserving resources for each function. Moreover, the study in [132] points out that
under a typical load of business applications, the dedicated resource allocation can
In practice, the MetaSlice decomposition can offer numerous benefits for deploy-
entities and are connected via API, they can be developed and upgraded indepen-
rather than putting their resources on other functions (e.g., driving assistant or
digital map) that can be effectively developed by specialized third parties. Third,
the MetaSlice decomposition can help to manage MetaSlices more convenient. For
they are independent entities. For example, a MetaSlice for travel may consist of
several major functions, such as a digital map, real-time traffic, real-time weather,
and a driving assistant. In this case, these functions can be placed dynamically in
ments. Specifically, real-time traffic and driving assistant functions can be placed at
tier-1 near end-users since they require low delay, while a digital map (that does not
need frequent updates) can be located at a high tier, e.g., at the cloud. Finally, this
technique can also alleviate the application migration problem in a multi-tier re-
delay. Thus, the application decomposition technique offers a flexible and effective
and structure for initializing and managing this MetaSlice during its life cycle. From
5.1 System Model 121
Table 5.1 : Comparison between MetaSlicing, network slicing, and virtualized net-
work function allocation.
analogous to the network slice paradigm in the fifth generation of the cellular network
(5G) that consists of multiple network functions [135]. It is worth mentioning that
slicing is the general term in describing virtualization methods that can partition and
due to their names, but they are essentially different, as shown in Table 5.1.
“Metaverses” that can be accessed and interacted with by users through a variety
5.1 System Model 122
contained virtual environment that can be customized to suit the needs of its users.
sive virtual experiences for users. On the other hand, network slicing refers to the
creation of multiple virtual networks (i.e., network slices) that can coexist on a single
can be customized to meet the needs of its users, with its own set of network func-
the creation of specialized and dedicated network services for different applications,
independent functions and then optimally distributes them at different tiers. Then,
number of functions can be shared among MetaSlices, thus improving resource uti-
lization.
5.1.1.3 MetaInstance
stance that can improve system resource utilization. Suppose that Metaverse may
consist of different MetaSlice types, e.g., tourism, education, industry, and navi-
gation. In addition, MetaSlices can also be grouped into I classes based on their
example, navigation MetaSlice may have driving assistant functions requiring ultra-
low latency and highly-reliable connections, while security and resilience are among
the top concerns of e-commerce and industry MetaSlices. Moreover, different types
of MetaSlice may share the same functions. For instance, tourism and naviga-
5.1 System Model 123
tion MetaSlices can use the same underlying digital map and the real-time traffic
and weather functions whose data are collected by the same perception network,
e.g., IoT. Furthermore, we can observe that different Metaverse tenants can cre-
ate/manage multiple variants from the same type of MetaSlice (e.g., education, in-
dustry, or navigation). Therefore, ongoing MetaSlices may share the same functions.
In this case, a lot of resources can be shared, leading to greater resource utilization
and higher revenue for the Metaverse Infrastructure Service Provider (MISP).
Based on this fact, Metaverse applications can be classified into groups, namely
MetaInstances. A MetaInstance can be defined by two function types, i.e., (i) shared
Figure 5.1. In this case, a MetaInstance can maintain a function configuration con-
the capability of a function is limited, sharing a function for too many MetaSlices
may lead to a decrease in user experience (e.g., processing delay) or even service
a MetaInstance can be similar to that of the Network Slice Instance (NSI), where
It is worth mentioning that even though the MetaSlicing may be similar to the
network slicing [135], they are actually not the same. In particular, network slic-
ing aims to address the diversity (or even conflict) in communication requirements
among various businesses by running multiple logical networks (i.e., Network Slices)
over a physical network. For example, one (e.g., automotive customers) may re-
application into functions and then optimally distributing them at different tiers.
different from that of the network slicing, where dynamic resource allocation ap-
Based on the aforementioned analysis, we can observe that, on the one hand,
our proposed MetaSlicing framework can offer a great solution to the MISP by
maximizing the resource utilization and at the same time minimizing the deployment
cost and initialization time for Metaverse’s applications. On the other hand, this
framework can also benefit end-users by achieving greater user experience, e.g., lower
delay and more reliable services. To achieve these results, the Admission Controller
that share some functions with the ongoing MetaSlices may help the system to save
more resources than accepting those with less or without sharing functions with the
ant sends a MetaSlice request associated with MetaBlueprint to the MISP. Then,
shown in Figure 5.1, the Admission Management block of the MISP consists of a
quest, the MetaSlice Analyzer will analyze the request’s MetaBlueprint to determine
similarities between the functions and configuration of the requested MetaSlice and
consists of at least (i) the function configuration record, including the list of required
functions and the description of interactions among them, and (ii) the MetaSlice con-
figuration (e.g., class ID and required resources). Based on the similarity analysis
(to be presented in Section IV) obtained from the MetaSlice Analyzer and the cur-
rently available resources of the system, the Admission Controller decides whether
cates the accepted MetaSlice to a MetaInstance with the highest similarity index and
dedicated functions, system resources are allocated to initiate these new functions.
If the current MetaInstances do not share any function with the new MetaSlice,
a new MetaInstance is created for the accepted MetaSlice. When a MetaSlice de-
parts/completes, its resources will be released, and the MetaInstance will be updated
accordingly.
In this work, we consider D resource types owned by the MISP, e.g., comput-
ing, networking, and storage. Then, the required resources for a MetaSlice m can
the amount of type d resources. In this case, the total occupied resources by all
X
ndm ≤ N d , ∀d ∈ {1, . . . , D}, (5.1)
m∈M
where N d is the total amount of type d resources of the MISP, and M is the set of all
running MetaSlices in the system. In our proposed solution, the system’s available
resources and the required resources for the request are two crucial factors for the
process and its required resources are likely unknown in advance. In addition, the
5.2 MetaSlicing Admission Control Formulation 126
departure process of MetaSlices (i.e., how long a MetaSlice remains in the system) is
also highly dynamic and uncertain. Therefore, in the next section, we will introduce
framework to address the MetaSlice admission control problem due to the follow-
ing reasons. First, the sMDP can enable the MetaSlicing’s Admission Controller
to adaptively make the best decisions (i.e., whether to accept or reject a MetaSlice
request) based on the currently available system resources (i.e., computing, net-
working, and storage) and the MetaSlice request’s blueprint (i.e., resource, class and
ment (e.g., arrival and departure processes of MetaSlices) to maximize the MISP’s
long-term revenue. Second, in practice, MetaSlice requests can arrive at any time,
so the admission decision needs to be made as soon as possible. As such, the de-
cision epoch in the considered problem changes rapidly. However, the conventional
Markov Decision Process (MDP) only takes an action at each time slot with an equal
time period, making it unable to capture real-time events, e.g., request arrival [98].
In contrast, the sMDP makes a decision whenever an event occurs so that it can
perfectly capture the real time of MetaSlicing. Finally, the MetaSlice’s lifetime is
highly uncertain. Upon a MetaSlice departs from the system, its occupied resources
are released, and the system state transits to a new state immediately. Again, the
fashion.
To that end, the sMDP will be used in our framework to enable the MISP to make
real-time decisions and maximize its long-term revenues. Technically, an sMDP can
be defined by a set of five components, including (i) the decision epoch ti , (ii) the
5.2 MetaSlicing Admission Control Formulation 127
Notation Description
Q Maximum number of packets in the data queue (Packets)
nm Required Resource Vector of MetaSlice m
ndm The amount of type d resources required by MetaSlice m
nu the available resources
ndu The number of available resources type d
Nd The total amount of type d resources of the MISP
A The action space
S The state space
T The transition probability
r The immediate reward function
i The class ID
j The similarity index
e The event vector
as The action at state s
x The number of ongoing MetaSlices
xi The number of MetaSlices in class i that are running
λi The mean of Poisson distribution
1/µi The mean of the exponential distribution
wd The weight corresponding to resource type d
ndo Number of type d resources occupied by this requested MetaSlice
π(sg ) The action derived by policy π at decision epoch g
τg The interval time between two consecutive decision epochs
Tπ The limiting matrix corresponding to policy π
Fm The function configuration of a MetaSlice m
state space S, (iii) the action space A, (iv) the transition probability T , and (v)
the immediate reward function r. In the following sections, we will explain how
this framework can capture all events in the MetaSlice system and make optimal
The decision epochs are defined as points of time at which decisions are made [98].
In our real-time MetaSlicing system, the Admission Controller must make a decision
once a MetaSlice request arrives. Therefore, we can define the decision epoch as an
Aiming to maximize revenue for the MISP with limited resources, several im-
portant factors need to be considered in the system state space. First, the current
system’s available resources and the resources required by the current MetaSlice
request are the two most important factors for the Admission Controller to decide
whether it accepts this current request or not. Second, since the income for leasing
the similarity between a requested MetaSlice and the ongoing MetaInstances. Re-
call that the higher the value of j is, the higher similarity between the requested
MetaSlice and the running MetaInstances is, leading to lower occupied resources
when deploying this new request. Hence, the similarity index is also an important
where D is the total number of resource types, and ndu denotes the number of avail-
able resources of type d. Similarly, the request’s required resources are denoted by
nm = [n1m , . . . , ndm , . . . , nD d
m ]), where nm is the number of requested resources of type
d. Given the above, the system state space can be defined as follows:
n
S≜ n1u , . . . , ndu , . . . , nD 1 d D
u , nm , . . . , nm , . . . , nm , i, j :
the MetaSlice Analyzer. By this design, the system state is presented by a tuple,
i.e., s ≜ (nu , nm , i, j), and the system can work continuously without ending at a
5.2 MetaSlicing Admission Control Formulation 129
Recall that in the sMDP framework, the system only transits from state s to state
s′ if and only if an event occurs (e.g., a new MetaSlice request arrival). We define the
a MetaSlice class i departs from the system, ei = 1 if a new MetaSlice request class-i
I
X
E ≜ e : ei ∈ {−1, 0, 1}; |ei | ≤ 1 . (5.3)
i=1
Note that there is a trivial event e∗ ≜ (0, . . . , 0) meaning that no MetaSlice request
whether to accept or reject this request to maximize the long-term revenue for the
i.e., a Metaverse application. As such, we assume that users’ requests are queued
at the corresponding Metaverse tenant before being forwarded to the MISP. If the
MISP rejects a user’s request, it will return to the queue at the Metaverse tenant
subscribed by this user until it is served or the user decides to stop the request. Since
5.2 MetaSlicing Admission Control Formulation 130
the lifetime of a MetaSlice is limited, and resources are released whenever a MetaSlice
this work does not focus on managing the queue, and we assume that the queue
a Metaverse tenant can rent resources from several MISPs to provide alternative
access for their subscribed users if the current MISP runs out of resources. Again,
this problem is also out of the scope of this work. Thus, this is a potential direction
Markov chain’s state transition probabilities. Since the sMDP is based on the semi-
Markov Chain (CTMC) {X(t)}, the uniformization method can be used to derive
the state transition probabilities T [98, 139, 140]. Specifically, the uniformization
transforms the CTMC into a corresponding stochastic process {X̄(t)} whose tran-
sition epochs are derived from a Poisson process at an uniform rate, whereas state
transitions follow a discrete-time Markov chain {Xn }. These two processes, i.e.,
we do not know when a user request comes and leaves the system. Thus, we can con-
sider that the arrival process of class-i requests follows the Poisson distribution with
with mean 1/µi , as those in [139]. In this way, the parameters of the uniformization
5.2 MetaSlicing Admission Control Formulation 131
I
X
z = max (λi + xi µi ), (5.5)
x∈X
i=1
I
X
zx = (λi + xi µi ), (5.6)
i=1
class i that are running simultaneously in the system), and X is the set containing
all possible values of x. Since the total resources of on-going MetaSlices cannot
xi
I X
X
ndm ≤ N d ∀d ∈ {1, . . . , D}. (5.7)
i=1 m=1
• The probability of trivial event (i.e., no MetaSlice request of any class arrives
Then, we can obtain the state transition probabilities T = {Pss′ (as )} with s, s′ ∈ S
and as ∈ As , i.e., the probability that the system moves between states by taking
actions.
To maximize the MISP’s long-term revenue, the income from leasing resources to
consider that the revenues for leasing resources for different classes are different since
different classes may have different requirements such as reliability and delay. Recall
that in our proposed Metaverse system, MetaSlices can share some functions with
the same class. As such, even if two MetaSlices give the same revenue, accepting a
MetaSlice that requires fewer resources will benefit the provider in the long term.
where ri is the revenue from leasing resources for a MetaSlice class i, and ndo is the
that if slices have the same income (i.e., they are in the same class), accepting
requests with fewer resource demands can help the provider to maximize the long-
term revenue.
Note that a weight wd reflects the price for renting one unit of resources type
dressed by various methods such as optimization theory [141] or game theory [142].
i.e., the higher amount of remaining resources type d is, the lower its weight is. In
practice, the resources are priced based on MISPs’ business strategies, e.g., [143,144].
Note that optimizing weights for resource types is out of scope in this work. In ad-
dition, the simulations in Section 5 are performed to study the impact of immediate
5.2 MetaSlicing Admission Control Formulation 133
Since the statistical characteristics (e.g., arrival rate and departure rate of a
MetaSlice) of the proposed system are stationary (i.e., time-invariant), the policy
π for the meta-controller can be described as the time-invariant mapping from the
state space to the action space, i.e., π : S → As . This study aims to find an optimal
policy for the MetaSlicing’s Admission Controller that maximizes a long-term aver-
age reward function Rπ (s), which is defined as an average expected reward obtained
PG
E g=0 r(sg , π(sg ))|s0 = s
Rπ (s) = lim PG , ∀s ∈ S, (5.9)
g=0 τg |s0 = s
G→∞ E
where G is the total number of decision epochs, π(sg ) is the action derived by π
at decision epoch g, and τg is the interval time between two consecutive decision
epochs. The existence of the limit in Rπ (s) is proven in Theorem 5.1. Thus, given the
currently available resources and the MetaSlice request’s information, the optimal
policy π ∗ can give optimal actions to maximize Rπ (s), thereby maximizing the long-
Theorem 5.1. Given that the state space S and the number of decision epochs in
PG
E g=0 r(sg , π(sg ))|s0 = s
Rπ (s) = lim PG (5.10)
g=0 τg |s0 = s
G→∞ E
T π r(s, π(s))
= , ∀s ∈ S, (5.11)
T π y(s, π(s))
where r(s, π(s)) is the expected immediate reward and y(s, π(s)) is the expected in-
terval between two successive decision epochs when performing action π(s) at state
5.2 MetaSlicing Admission Control Formulation 134
policy π, which is given based on the transition probability matrix of this chain, i.e.,
Tπ , as follows:
G−1
1 X g
Tπ = lim Tπ . (5.12)
G→∞ G
g=0
Note that the underlying Markov chain of our sMDP model is irreducible (i.e.,
the long-term average reward is independent of the starting state), which is proven
in Theorem 5.2.
Theorem 5.2. The long-term average reward Rπ (s) for any policy π is well-defined
Since the limiting matrix T π exists and the sum of probabilities that the system
moves from one state to others is one, we have s′ ∈S T π (s′ |s) = 1. Given the above,
P
T π r(s, π(s))
max Rπ = ,
π T π y(s, π(s))
X (5.13)
′
subject to: T π (s |s) = 1, ∀s ∈ S.
s′ ∈S
Note that while the problem studied in this work involves resource management
for Metaverse, it may seem at first glance that our problem formulation is similar
(VNF) resource allocation or network slicing [145–147]. However, they are funda-
tions when allocating resources [54, 56, 57]. Therefore, it makes our mathematical
different functions that can operate and be initialized independently. Then, we in-
groups according to their function similarities in which some functions can be shared
available resources and similarity between the MetaSlice request and the ongoing
It is worth mentioning that due to the discrete action space, (5.9) and (5.13)
are not convex, making them challenging to solve by conventional methods. In the
following section, we will discuss our proposed solution that can help the Admission
Admission Management
This section presents our proposed approach for the MetaSlice admission man-
agement. We first discuss the main steps in the MetaSlice analysis to determine
Controller to address the real-time decision making requirement and the high un-
certainty and dynamics of the request’s arrival and MetaSlice departure processes,
which are, in practice, often unknown in advance. Thanks to the self-learning abil-
ity of DRL, the Admission Controller can gradually obtain an optimal admission
policy via interactions with its surrounding environment without requiring complete
Recall that the major role of the MetaSlice Analyzer is to analyze the request’s
MetaBlueprint to determine the similarity between the requested MetaSlice and the
ongoing MetaInstances. Then, the similarity report is used to assist the Admis-
sion Controller in deciding whether to accept or reject a request. This work uses
the function configuration to decide the similarity index since it clearly shows the
ing that a function is not required by the MetaSlice. Note that a MetaInstance
also maintains a function configuration set and updates it whenever the MetaSlice
is initialized or released.
Given two function configuration sets F1 and F2 , the similarity index can be
given as follows:
F
1 X 1 2
j(F1 , F2 ) = b(f , f ), (5.15)
F f =1 f f
where b can be any similarity function used for vectors such as Jaccard and Cosine
similarity functions [148]. Here, we use Jaccard similarity that is defined as follows:
ff1 · ff2
bJaccard (ff1 , ff2 ) = , (5.16)
||ff1 ||2 + ||ff2 ||2 − ff1 · ff2
where the numerator is the dot product of two vectors and || · || is the Euclidean
norm of the vector, which is calculated as the square root of the sum of the squared
vector’s elements.
Recall that the similarity index j is one of the four elements of the state, and in
the state. Thus, the similarity index plays two important roles in the MetaSlicing
learning algorithm with precious information for making decisions. Second, based
a new MetaInstance.
convergence guarantee after the learning phase [149]. Nevertheless, using a table to
estimate the optimal values of all state-action pairs Q∗ (s, a), i.e., Q-values, hinders
5.3 AI-based Solution with MetaSlice Analysis for MetaSlice Admission
Management 138
the Q-learning from being applied in a high-dimensional state space as the problem
considered in this work with hundred thousand states [101]. In addition, the usage of
the Q-table is only feasible when values of states are discrete, but the similarity score
in the considered state can be a real number. These challenges are addressed by deep
a Q-table, is used to approximate Q∗ (s, a) for all state-action pairs [150]. However,
both Q-learning and DQN have the same problem of overestimation when estimating
Q-values [31]. This issue makes the learning process unstable or even results in a
sub-optimal policy if overestimations are not evenly distributed across states [102].
To that end, this thesis proposes to leverage D3QL, discussed in Section 2.2.4,
for the MetaSlicing Admission Controller’s algorithm, namely iMSAC, that can
address these above issues effectively by leveraging three innovative techniques: (i)
the memory replay mechanism, (ii) the Dueling neural network architecture, and (iii)
the double deep Q-network (DDQN) algorithm. The iMSAC model is illustrated in
Figure 5.2.
Note that most recently proposed DRL algorithms are on-policy methods, e.g.,
(PPO), and aim to overcome the limitation of DQN in problems with continu-
ous action space [151, 152]. As such, they may not perform well in problems with
discrete action space, such as the considered problem in this work. For instance,
the study in [153] shows that the Dueling DDQN outperforms the A3C in most of
57 Atari games with discrete action spaces. Similarly, results in [154] show that the
performance of PPO and Dueling DDQN are similar in a discrete action problem,
and the run time of PPO is almost two times greater than that of Dueling DDQN.
Moreover, the authors in [155] demonstrate that PPO’s performance is heavily in-
fluenced by its code-level implementation. Hence, the PPO may require significant
effort to optimize hyper-parameters, which is not the focus of our work. In contrast,
5.3 AI-based Solution with MetaSlice Analysis for MetaSlice Admission
Management 139
Environment
Action
Interaction loop
Epsilon
greedy
Observation State Q-network Estimated
Q-values
Update parameters
Learning loop
Update
Loss
function
Q-target
Mini-batch
Buffer B
Figure 5.2 : The proposed iMSAC-based Admission Controller for the MetaSlicing
framework.
those in [34, 150]. More importantly, since A3C and PPO are on-policy methods,
they must use up-to-date experiences collected by their current policies to learn
an optimal policy [99]. As such, old experiences are dropped after each learning
iteration, making the sample efficiencies of these methods low. Here, the sample
efficiency presents how much an RL algorithm can get the most use of every expe-
an off-policy method that can leverage experiences collected from any other poli-
cies [99]. Given the above observations, the Dueling DDQN is the most appropriate
learning approach for the considered problem as it can perform well with typical
the training process of a deep neural network, i.e., Q-network that represents the
admission policy. As iMSAC is based on D3QL, its discussion of the optimality and
computational complexity was discussed in Section 2.2.5. Since the Q-network only
5.4 Performance Evaluation 140
consists of conventional components, e.g., fully connected layers and tanh activation,
the decision latency is very marginal since decisions can be made mostly instantly
by feeding the state to the Q-network. In this work, simulations are performed
on a typical laptop with the AMD Ryzen 3550H (4 cores at 2.1 GHz) and 16GB
RAM. We record that the average decision latency is 6.2 µs. In practice, several low
and ALVINN [107], and thus they clearly demonstrate the applicability of iMSAC
The parameters for our simulation are set as follows, unless otherwise stated. We
consider that the system supports up to nine types of functions, i.e., F = 9. Each
MetaSlice consists of three different functions and belongs to one of three classes,
i.e., class-1, class-2, and class-3. In the configuration set F , we set K = 1. Here, we
set λ1 , λ2 , and λ3 to 60, 40, and 25 requests/hour, respectively, and its vector is
denoted by λ = [60, 40, 25]. The average MetaSlice session time is 30 minutes, i.e,
µi = 2, ∀i ∈ {1, 2, 3}. The revenues ri for accepting a MetaSlice request from class
does not require the above information in advance. It can adjust the admission
policy according to the practical requirements and demands (e.g., rental fees, arrival
rate, and total resources of the system) to maximize a long-term average reward.
Therefore, without loss of generality, the system has three types of resources, i.e.,
for computing.
5.4 Performance Evaluation 141
In our proposed algorithm, i.e., iMSAC, the settings are as follows. For the ϵ-
greedy policy, the value of ϵ is gradually decreased from 1 to 0.01. The discount
factor γ is set to 0.9. We use Pytorch to build the Q-network and the target Q-
network. They have the same architecture as shown in Figure 2.4. During the
and [31], e.g., the learning rate of the Q-network is set at 10−3 and the target-Q
network’s parameters are copied from the parameters of Q-network at every 104
steps.
Recall that our proposed solution consists of two important elements, i.e., the
intelligent algorithm iMSAC and MetaSlice analysis with the MetaInstance tech-
improvement in resource utilization. Meanwhile, the iMSAC can help the Admission
our proposed solution, i.e., iMSAC+MiT, with three counterpart approaches: (i) iM-
SAC, (ii) Greedy policy [99] where the MetaSlicing’s Admission Controller accepts
a request if the system has enough resources for the request, and (iii) Greedy policy
with the MetaInstance technique, i.e., Greedy+MiT. Recall that since the iMSAC
Simulations are conducted to gain insights into our proposed solution, i.e., iM-
SAC+MiT. First, we will investigate the convergence rate of our proposed algorithm
1.0
0.8
Average Reward
iMSAC+MiT
0.6 iMSAC
Greedy+MiT
Greedy
0.4
0.2
0.0
1 10 20 30 40 50
Iteration (7.5x103)
impacts of important system parameters, e.g., the available system resources, im-
mediate rewards reflecting the revenue of the MISP, and the maximum number of
MetaSlices sharing one function that is one of the most important parameters of the
MISP.
Figure 5.3 shows the convergence rates of our proposed iMSAC algorithm in two
scenarios with and without the MetaInstance technique. In this experiment, we set
storage, radio bandwidth, and computing resources to 1200 GB, 1200 MHz, and
functions in total. The average rewards obtained by Greedy+MiT and Greedy are
also presented for comparisons. Specifically, the learning curves of iMSAC+MiT and
iMSAC have a very similar trend. As shown in Figure 5.3, both of them gradually
converge to the optimal policy after 6×104 iterations. However, the iMSAC+MiT’s
average reward is stable at 0.9, which is 2.25 times greater than that of the iMSAC.
Similarly, the Greedy+MiT’s average reward is much greater (i.e., 3.5 times) than
that of the Greedy. Thus, these results clearly show the benefits of the iMSAC and
5.4 Performance Evaluation 143
the MetaInstance technique. In particular, while the MetaInstance can help to max-
imize the resource utilization for the system, the iMSAC can make the Admission
Controller learn the optimal policy to maximize the long-term average reward.
in different scenarios. First, we vary the storage, radio, and computing resources
from 400 GB, 400 MHz, and 400 GFLOPS/s to 2200 GB, 2200 MHz, and 2200
by the system is varied from 10 to 55. The policies of iMSAC+MiT and iMSAC
are obtained after 3.75 × 105 learning iterations. In this scenario, two metrics for
evaluating the Admission Controller’s performance are average reward and accep-
tance probability since they clearly show the effectiveness of the admission policy
in terms of the income for MISP (i.e., average reward) and the service availabil-
ity for end-users. Figure 5.4(a) clearly shows that as the total amount of system
resources increases, the average rewards of all approaches increase. This is due to
the fact that the higher the total resource is, the higher number of MetaSlice the
system can host, and thus the greater revenue the system can achieve. It is observed
that the proposed algorithm iMSAC+MiT always obtains the highest average re-
ward, up to 80% greater than that of the second-best policy in this scenario, i.e.,
Greedy+MiT. Similarly, Figure 5.4(b) shows that iMSAC+MiT achieves the highest
that of the Greedy+MiT, i.e., the second-best policy. In addition, Figures 5.4(a)
and (b) demonstrate the benefit of MetaInstance. In particular, it helps the system
to increase the average rewards and acceptance rates of both iMSAC and Greedy
To gain more insights, we look further at the acceptance probability for each class
5.4 Performance Evaluation 144
0.9 0.9
1.6 iMSAC+MiT iMSAC+MiT Class-1
iMSAC 0.8 0.8
iMSAC Class-2
1.4
Greedy+MiT
Acceptance Probability
Acceptance Probability
0.7 Greedy+MiT 0.7 Class-3
1.2
Greedy Greedy
Average Reward
0.6 0.6
1.0
0.5 0.5
0.8
0.4 0.4
Acceptance Probability
Acceptance Probability
0.7 Class-3 0.7 Class-3 0.7 Class-3
of MetaSlice. As shown in Figures 5.4(c) and (d), for the iMSAC+MiT and iMSAC,
the acceptance probabilities of class-3 are always higher than those of other classes
accept requests from all the classes at almost the same probability, as depicted in
Figures 5.4(e), and (f). Recall that the arrival rate of class-3 is the lowest value (i.e.,
λ3 = 25), while the immediate reward for accepting requests class-3 is the greatest
value, i.e., r3 = 4. Thus, the proposed algorithm iMSAC can learn and adjust its
policy to achieve the best result. More interestingly, for the iMSAC+MiT results,
the acceptance probability of class-3 has significant gaps (up to 50% greater than
those of other classes) when the total resources are little (i.e., less than 20), as shown
in Figure 5.4. This stems from the fact that if the available resources are low, the
Admission Controller should reserve resources for future requests from class-3 with
5.4 Performance Evaluation 145
0.5 10
0.9
Acceptance Probability
Average Rewards
0.7
0.3
0.6
0.5
0.2
0.4
0.3 0.1
iMSAC+MiT iMSAC+MiT
0.2 Greedy+MiT Greedy+MiT
0.0
1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
NL NL NL
(a) Average rewards (b) Acceptance probability (c) Average No. of MetaInstances
the highest reward. In contrast, if the system has more available resources, the
Admission Controller should accept requests from all classes more frequently. This
observation is also shown in Figure 5.4(d), where the MetaInstance is not employed.
Next, we investigate one of the most important factors in the MetaSlicing frame-
work, which is the maximum number of MetaSlices that share the same function,
denoted by NL . In this experiment, we set the resources the same as those in Fig-
ure 5.3, and other settings are set the same as those in Section 5.4.1. Figure 5.5(a)
shows that as the value of NL increases from 1 to 15, the average rewards obtained
by our proposed solution iMSAC+MiT and Greedy+MiT first increase and then
stabilize at around 0.88. Remarkably, when the value of NL is small (i.e., less than
8), the iMSAC+MiT’s average reward is always greater than that of Greedy+MiT,
up to 122%. The reason is that as NL increases, meaning that more MetaSlices can
share the same function, the number of MetaSlices that can be deployed in the sys-
tem increases. In other words, the MetaSlicing system capacity increases according
to the increase of NL . As such, the Admission Controller can accept more requests
rewards for both approaches originate from the fact that the arrival and departure
processes of class i follow the Poison and exponential distributions with a fixed mean
λi and 1/µi .
5.4 Performance Evaluation 146
0.5 0.12
6
Acceptance Probability
Greedy
Average Reward
0.08 4
0.3
0.06 3
0.2
0.04 2
iMSAC+MiT iMSAC+MiT
0.1 iMSAC iMSAC
0.02 1
Greedy+MiT Greedy+MiT
Greedy Greedy
0.0 0.00 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
r3 r3 r3
(a) Average rewards (b) Acceptance probability (c) Average No. of running MetaSlices
tions can be made in Figure 5.5(b). Specifically, our proposed solution maintains
higher request acceptance probabilities (up to 34%) than that of the Greedy+MiT
proaches increase, then they are stable at around 0.46 when NL > 4 for iMSAC+MiT
and NL > 7 for Greedy+MiT. The reasons are similar as those in Figure 5.5(a). In
particular, the higher the system capacity is, the higher the request acceptance
probability is. Unlike the above metrics, the average numbers of MetaInstances
decrease for both approaches as the value of NL increases from 1 to 15, as shown
in Figure 5.5(c). The reason is that an increase of NL can result in increasing the
stances in a system with a fixed capacity. The above observations in Figure 5.4 and
Figure 5.5 show the superiority of our proposed approach compared with others,
We continue evaluating our proposed solution in the case where the immediate
reward of class-3, i.e., r3 , is varied from 1 to 10. In this experiment, we set the
storage, radio, and computing resources to 400 GB, 400 MHz, and 400 GFLOPS/s,
respectively. The arrival rate vector of MetaSlice is set to λ = [60, 50, 40] to explore
the robustness of our proposed solution. In Figure 5.6(a), as r3 increases, the average
5.4 Performance Evaluation 147
0.16 0.16
0.14 0.14
Acceptance Probability
Acceptance Probability
0.12 0.12
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
Class-1 Class-1
0.02 Class-2 0.02 Class-2
Class-3 Class-3
0.00 0.00
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
r3 r3
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
r3 r3
Figure 5.7 : The acceptance probability per class when varying the immediate reward
of class-3.
that our proposed solution, i.e., iMSAC+MiT, consistently achieves the highest
average reward, up to 111% greater than that of the second-best, i.e., Greedy+MiT
when r3 = 1. Interestingly, when r3 is small (i.e., less than 4), the iMSAC’s average
rewards are lower than those of the Greedy+MiT. However, when r3 becomes larger
than or equal to 4, the average rewards obtained by iMSAC are higher than those
of the Greedy+MiT.
Similarly, Figures 5.6(b) and (c) show that our proposed solution always obtains
the highest values compared to those of other approaches in terms of the acceptance
5.4 Performance Evaluation 148
10. Interestingly, even with a decrease in the acceptance probability and average
increase as r3 is varied from 1 to 10, as shown in Figure 5.6. The reason is that
when the immediate reward of class-3 is very high (e.g., r3 = 10) compared to those
of class-1 and class-2 (i.e., 1 and 2, respectively), the iMSAC+MiT reserves more
We now further investigate the above observations when varying the immedi-
ate reward of requests class-3 by looking deeper at the acceptance probability per
class for each approach. Figures 5.7(a)-(d) illustrate the acceptance rate per class
Figure 5.7(c), the Greedy+MiT’s acceptance probabilities for all classes are almost
the same, at around 0.06, when the immediate reward of class-3 increases from 1 to
10. A similar trend is observed for the Greedy but at a lower value, i.e., 0.04, in
probability for class-3 increases while those of other classes decrease as r3 increases
from one to 10, as shown in Figure 5.7(a). More interestingly, when the immediate
reward of class-3 requests is small (i.e., r3 < 2), class-3 requests have the lowest
acceptance probability compared to those of other classes. However, when the im-
mediate reward of class-3 requests is larger than or equal to 2, class-3 requests will
achieve the highest acceptance probability compared with those of other classes.
Moreover, when r3 > 4, the acceptance probability for class-3 requests obtained by
requests also increases until reaching a threshold with a lower value, i.e., around 0.14,
than those of the Greedy-based solutions. Thus, the iMSAC+MiT and iMSAC
can obtain a good policy in which the acceptance probability of a class increases
if its reward increases compared with the rewards of other classes, and vice versa.
Note that our proposed solution does not need complete information about the
proposed solution can always achieve the best results in all scenarios when we vary
The above findings underscore the efficacy of the proposed MetaSlicing frame-
applications. The system must balance between accepting more requests (increasing
rewards. The integration of MiT with the intelligent iMSAC algorithm enables a
5.5 Conclusion
In this chapter, we have proposed two innovative techniques, i.e., the appli-
niques, we have developed a novel framework for the Metaverse, i.e., MetaSlicing,
that can smartly allocate resources for MetaSlices, i.e., Metaverse applications, to
find the optimal admission policy for the Admission Controller under the high dy-
namics and uncertainty of resource demands. The extensive simulation results have
clearly demonstrated the robustness and superiority of our proposed solution com-
5.5 Conclusion 150
pared with the counterpart methods as well as revealed key factors determining the
enormous potential research topics that we can explore. For example, elastic re-
namically according to the resource demand of this MetaSlice during operation, thus
further improving resource utilization. Another topic that we can further consider
verse. This could include developing standards and protocols for communication
between different MetaSlices and mechanisms for managing conflicts and ensuring
Chapter 6
The previous chapter has addressed the resource management challenges in 6G net-
framework divides an original message into two parts. The first part, i.e., the active-
ambient backscatter tag that backscatters upon the active signals emitted by the
transmitter. Notably, the backscatter tag does not generate its own signal, making
it difficult for an eavesdropper to detect the backscattered signals unless they have
the backscatter message, the eavesdropper is unable to decode the original message.
Even in scenarios where the eavesdropper can capture both messages, reconstruct-
ing the original message is a complex task without understanding the intricacies of
fectively decode the backscattered signals at the receiver, often accomplished using
the maximum likelihood (MLK) approach. However, such a method may require a
complex mathematical model together with perfect channel state information (CSI).
that can not only effectively decode the weak backscattered signals without requir-
6.1 System Model 152
ct Backs
re catter
Di als signa
ls
n
sig bm
n
Tag
r
Easvesdroppe
Direct signals
Transmitter d mn Receiver
ing perfect CSI but also quickly adapt to a new wireless environment with very little
knowledge. Simulation results show that our proposed learning approach, without
requiring perfect CSI and complex mathematical model, can achieve a bit error ratio
close to that of the MLK-based approach. They also clearly show the efficiency of
the proposed approach in dealing with eavesdropping attacks and the lack of training
and the channel model are discussed in Sections 6.1 and 6.2. Then, Sections 6.3
and 6.4 present the MLK-based detector and our proposed DL-based detector for
presented in Section 6.5, and Section 6.6 discusses our simulation results. Finally,
shown in Figure 6.1. Here, the transmitter has a single antenna, while the receiver
signals to gather the information in this channel. To cope with the eavesdropper, this
work deploys a low-cost and low-complexity tag equipped with the AmB technology,
has two operation states, including (i) the absorbing state, where the tag does not
reflect incoming signals, and (ii) the reflecting state, where the tag reflects incoming
signals. In this way, the AmB tag can transmit data without using any active RF
components. Before sending a message to the receiver, the transmitter first encodes
it into two messages: (i) an active message for the transmitter and (ii) an AmB
message forwarded to the AmB tag via the wired channel. Note that the use of a
single-antenna transmitter stems from our study’s aim to develop a lightweight anti-
The active message is then directly transmitted by the transmitter to the re-
ceiver using the conventional RF transmission method. At the same time, when the
transmitter transmits the active message, the AmB tag will leverage its RF signals
to backscatter the AmB message [69, 157]. Thus, the receiver will receive two data
streams simultaneously, one over the conventional channel dmn and another over the
backscatter channel bmn , as depicted in Figure 6.1. Note that instead of producing
active signals, the AmB tag only backscatters/reflects signals. Therefore, wiretap-
ping AmB signals is intractable unless the eavesdropper has prior knowledge about
the system configuration, i.e., the usage of AmB and the exact backscatter rate.
As a result, our system can provide a new deception strategy for data transmis-
the transmitter, so it pays less attention to (or is even unaware of) the presence of
the AmB message, the eavesdropper cannot decode the information from the origi-
nal message. Moreover, even in the worse case in which the eavesdropper knows the
presence of AmB communication in the system, they cannot obtain the original mes-
o(1) o(4) o(7) ... o(o) ... o(2) o(3) o(5) o(6) ... o(O)
Divide the backscatter message into
multiple I-bit Backscatter frames
f1 f2 f3 f4 f5 ...
mechanism. As a result, our proposed approach can deal with eavesdropping attacks
in wireless systems.
The detail of our proposed encoding mechanism is depicted in Figure 6.2. Specif-
ically, at the beginning, an encoding mechanism is used to split the original message
into two parts, i.e., backscatter message and active transmit message. Note that our
framework here and following analysis can adopt any encoding mechanism, and the
design of encoding mechanism is out of the scope of this work. Since the AmB rate
is usually lower than the active transmission rate, the AmB message’s size can be
designed to be smaller than the active message’s size. As such, the AmB message is
constructed by taking bits at every K bits of the original message. By doing so, the
system security under the presence of the eavesdropper can be significantly improved
because the eavesdropper is unable to derive the message splitting mechanism and
the backscatter message. Note that to improve the detection performance, we in-
corporate F pilot bits into the AmB frame, as depicted in Figure 6.2. The specific
utilization and significance of these pilot bits will be extensively discussed in detail
6.2 Channel Model 155
Notation Description
M The number of antennas
I The number of bits in each backscatter frame
dmn The conventional channel
bmn The backscatter channel
F The number of pilot bits
ymn The signal received at the m-th antenna at the n-th time
y The receiver’s total received signals
σmn The noise following CSCG
Pt The transmitter’s transmit power
Ptr The average power received by the receiver
λ The wave length
Lr , L b The transmitter-receiver and transmitter-tag distances
Gt, Gr The antenna gains of the transmitter and the receiver
stn The signal transmitted by the transmitter at time instant n
fdm , fbm The fading of the transmitter-receiver and transmitter-tag-receiver links
Pb The average power received by the AmB tag
Gb The antenna gain of the AmB tag
gr The Rayleigh fading of the transmitter-tag link
ln The active signals from the transmitter at the AmB tag
e The state of the AmB tag
sbn The backscattered signals
γ The reflection coefficient
αdt The average SNRs of transmitter-to-receiver channel
αbt The average SNRs of transmitter-tag-receiver channel
fd The channel response vectors of transmitter-to-receiver channel
fb The channel response vectors of transmitter-tag-receiver channel
Z(·) The binary entropy function
θ0 and θ1 The probability of backscattering bit 0 and bit 1, respectively
V A realization of y
p(V|e) The conditional probability density function
Y(i) The sequence of received signal corresponding to the period of the i-th AmB symbol
in Section 6.4.1.
This section presents the channel model of our considered system in detail. In
particular, the AmB rate should be significantly lower than the sampling rate of the
transmitter’s signals so that the receiver can decode the backscattered signals with
low BERs [69, 70, 88]. Formally, the transmit signals’ sampling rate is assumed to
6.2 Channel Model 156
be N times higher than the AmB rate. In other words, each bit of the backscatter
the receiver has M antennas (M ≥ 1). Let ymn denote the signal that the m-th
antenna of the receiver receives at the n-th time. As illustrated in Figure 6.1,
ymn comprises (i) the active signal transmitted on the direct link, (ii) the signals
backscattered by the AmB tag on the backscatter link, and (iii) noise from the
where σmn is the noise following the unit-variance and zero-mean circularly symmet-
ric complex Gaussian (CSCG), denoted by σmn ∼ CN (0, 1). In the following, the
On the direct channel, i.e., the transmitter-receiver link, the average power re-
κPt Gt Gr
Ptr = , (6.2)
Lr υ
λ 2
where Pt is the transmitter’s transmit power, and κ = ( 4π ) in which λ is the wave-
The antenna gains of the transmitter and the receiver are denoted by Gt and Gr ,
as stn . Then, at the receiver’s m-th antenna, the direct link signal is given by
p
dmn = fdm Ptr stn , (6.3)
6.2 Channel Model 157
where fdm represents the Rayleigh fading such that E[|fdm |2 ] = 1 [88]. Note that
the proposed AmB tag uses the AmB communication technology to backscatter stn
without using any dedicated energy sources. Since the AmB tag does not know the
active message, stn appears as random signals [82,88,158]. Therefore, we can assume
As mentioned, the AmB tag backscatters the transmitter’s active signals to trans-
mit AmB messages. In the following, we present the formal formulation of the AmB
signals at the receiver. Firstly, the average power received by the AmB tag can be
defined by
κPt Gt Gb
Pb = , (6.4)
Lb υ
where Gb denotes the antenna gain of the AmB tag, and Lb denotes the transmitter-
tag distance. Let gr denote the Rayleigh fading of the link from the transmitter to
the AmB tag, then the active signals from the transmitter at the AmB tag is given
by
p
ln = Pb gr stn . (6.5)
As discussed above, the key idea of the AmB communication is to absorb or reflect
signals. As such, we denote e as the state of the AmB tag. Specifically, e = 1 when
the tag reflects the transmitter’s signals, i.e., transmitting bits 1, and e = 0 when
the tag absorbs the transmitter’s signals, i.e., transmitting bits 0. Because each
tag remains unchanged during this period. The backscattered signals then can be
given by
where γ is the reflection coefficient. It is worth noting that γ captures all properties
of the AmB tag such as load impedance and antenna impedance [158]. Let fbm
represent the Rayleigh fading of the AmB channel, and Le denotes the AmB tag-to-
receiver distance. Here, we can assume that E[|fbm |2 ] = 1 and E[|gr |2 ] = 1 without
loss of generality [88]. Then, the signals in the AmB link received by the m-th
antenna is given by
r
Gb Gr κ
bmn = fbm sbn
Le υ
r
Gb Gr κ p
= fbm γe g r P s
b tn
Le υ
s
κ|γ|2 Gb Gr Pb
= fbm e gr stn (6.7)
Le υ
s
κ|γ|2 Gb Gr κPt Gt Gb
= fbm e gr stn
Le υ Lb υ
s
κ|γ|2 Ptr Gb 2 Lr υ
= fbm e gr stn .
Lb υ Le υ
κ|γ|2 Gb 2 Lr υ
By denoting α̃r = Lb υ Le υ
, (6.7) can be rewritten as
p
bmn = fbm e gr α̃r Ptr stn . (6.8)
Now, we can obtain the received signals at the receiver’s m-th antenna by sub-
Let αdt ≜ Ptr and αbt ≜ α̃r Ptr denote the average signal-to-noise ratios (SNRs) of the
√ √
ymn = fdm αdt stn + fbm e gr αbt stn +σmn . (6.10)
| {z } | {z }
direct link
backscatter link
Since the receiver has M antennas, the channel response vectors corresponding
to the backscatter and the direct channels and are respectively given by
√ √
yn = fd αdt stn + fb e gr αbt stn +σn ,
| {z } | (6.12)
direct link
{z }
backscatter link
This work considers that there are I bits in each backscatter frame, i.e.,
To enhance the receiver’s decoding process, F pilot bits are placed in each backscat-
ter frame, so there are I −F information bits in each backscatter frame. We consider
that the AmB tag and the receiver know these bits in advance and use them to es-
timate the AmB tag-receiver channel coefficients. More details about the use of
pilots will be discussed in Section 6.4. We assume that during one AmB frame, the
AmB tag-receiver channel is invariant [82]. Since each backscatter bit is backscat-
tered over N symbols transmitted by the transmitter, we can express the receiver’s
6.2 Channel Model 160
√ (i)
√
(i)
yn(i) = fd αdt stn + fb e(i) gr αbt stn + σn(i) , (6.14)
where the backscatter state e(i) equals backscatter bit x(i) , i.e., e(i) = x(i) , ∀i =
1, 2, . . . , I, and n = 1, 2, . . . , N .
Note that for the AmB tag, the transmitter’s signals appear as random sig-
nals and are unknown in advance. Therefore, the AmB rate’s closed-form (denoted
RAmB ) cannot be obtained [82, 157]. For that, in Theorem 6.1, we provide an alter-
native method to derive the AmB system’s idealized throughput model when N = 1.
Let θ0 and θ1 denote the probability of backscattering bit 0 and bit 1, respectively.
Theorem 6.1. The maximum achievable rate of the AmB tag is given as
Z
∗
RAmB = Z(θ0 ) − θ1 p(V|e = 1) + θ0 p(V|e = 0) Z(µ0 )dV, (6.15)
V
where µ0 = p(V|e = 0), and Z(·) represents the binary entropy function.
∗
It can be observed from (6.15) that the maximum achievable rate RAmB de-
quence of received signal corresponding to the period of the i-th AmB symbol. We
describe our proposed AmB detectors that use MLK and DL to recover the original
bits x(i) from these sequences of received signals in the following sections.
6.3 AmB Signal Detector based on Maximum Likelihood 161
Recall that signals backscattered by the AmB tag can be regarded as the active
signals’ background noise, making them very challenging to be detected [63, 80].
This section presents the AmB signal decoding based on MLK. Note the MLK-
The aim of MLK-based detector is to derive the received signals’ likelihood func-
tions. In particular, if the AmB tag transmits bits “0” (corresponding to e(i) = 0),
the receiver only receives signals solely from the transmitter-to-receiver channel.
Whereas, if the AmB tag transmits bits “1” (corresponding to e(i) = 1), the receiver
will receive signals from both direct and AmB channels. Consequently, the channel
statistical covariance matrices for these scenarios are expressed as [82], [158]
IM is the identity matrix with size M × M . It is important to note that the es-
timation of CSI can be performed by various techniques which are well studied in
the literature [159, 160]. Similar to [82], the noise and the transmitter’s RF signals
are assumed to follow the CSCG distribution∗ . As a result, yn also follows the
(i)
∗
Note that this information is not required by our proposed DL-based AmB signal detector.
6.3 AmB Signal Detector based on Maximum Likelihood 162
(i)
Now, the conditional probability density function (PDFs) of the received signals yn
1 (i) H
−yn R−1
(i)
p(yn(i) |e(i) = 0) = M
e 0 yn ,
π |R0 |
(6.18)
1 (i) H −1 (i)
p(yn(i) |e(i) = 1) = M e−yn R1 yn ,
π |R1 |
where | · | and (·)−1 represent the determinant and inverse of a matrix, respectively.
From (6.18), we now can express the likelihood functions H(·) for the sequence of
N
Y 1 (i) H
−yn R−1
(i)
H(Y(i) |e(i) = 1) = e 1 yn ,
n=1
π M |R1 |
N
(6.19)
Y 1 (i) H
−yn R−1
(i)
H(Y(i) |e(i) = 0) = e 0 yn .
n=1
π M |R 0|
Let ẽ(i) be the estimation of e(i) . From (6.19), the MLK hypothesis in (6.20) can
1, H(Y(i) |e(i) = 0) < H(Y(i) |e(i) = 1),
ẽ(i) = (6.20)
0, H(Y(i) |e(i) = 0) > H(Y(i) |e(i) = 1).
H (i) H
1, QN M 1 e−yn(i) R−1
(i) QN
0 yn <
M
1
e −yn R−1
(i)
1 yn ,
(i) n=1 π |R0 | n=1 π |R1 |
ẽ = H (i) H
(6.21)
0, N M 1 e−yn(i) R−1 (i) QN
0 yn > 1 −yn R−1
(i)
1 yn .
Q
e
n=1 π |R0 | M
n=1 π |R1 |
6.4 Deep Learning-Based Signal Detector 163
1, PN yn(i) H (R−1 − R−1 )yn(i) > N ln |R1 | ,
n=1 0 1 |R0 |
ẽ(i) = H
(6.22)
P N (i) −1 −1 (i) |R1 |
n=1 yn (R0 − R1 )yn < N ln |R0 | .
0,
Then, the backscattered bit x(i) can be derived based on ẽ(i) . However, for that,
MLK requires perfect CSI for both conventional and AmB transmissions, making its
for the system to estimate these matrices accurately. To that end, we develop a
DL-based signal detector that can overcome this challenge. We first discuss the
data preprocessing procedure, aiming to construct a dataset for the training phase
of the DL detector, and then our proposed DNN architecture is presented in detail.
In practice, the raw data (e.g., received signals) may not be qualified to be used
directly to train a DL model as it can lead to a poor or even useless trained model.
period. Instead of directly using Y(i) for model training, the sample covariance
matrix S(i) of the received signals, defined by (6.23), can be used as the training
6.4 Deep Learning-Based Signal Detector 164
d Input layer
d1
The main reason is that the sample covariance matrix can capture the relationship
more insights on received signals than the signals themselves. In addition, the
responding to an AmB symbol, i.e., N . In other words, the higher value N is, the
better performance the detector can achieve. Typically, N is often set to a higher
value than the number of antennas [82, 83, 88], and thus the size of S(i) is smaller
than that of Y(i) . Therefore, training the DNN model with the sample covariance
matrix S(i) can not only preserve all crucial information of the received signals but
also reduce the training data size without affecting the learning efficiency [82, 88].
especially for a system with multiple antennas [161]. To alleviate this challenge, pilot
6.4 Deep Learning-Based Signal Detector 165
bits are leveraged to indirectly capture the channel coefficients, thus improving the
pilot bits where the first F/2 bits are “0”, and the rest are “1”. The detector knows
these bits in advance so that the channel coefficients can be implicitly obtained by
inspecting the received signals corresponding to them. In this work, the averages
of S(i) of received signals corresponding to pilot bits “1” and “0” are leveraged to
F/2 N
2 X X (i) (i) H
R̄1 = y y , when e(i) = 1,
F N i=1 n=1 n n
(6.24)
F/2 N
2 X X (i) (i) H
R̄0 = y y , when e(i) = 0.
F N i=1 n=1 n n
In this way, the detector can use R̄0 and R̄1 to effectively detect AmB sig-
nals without requiring channel coefficients explicitly. In particular, if the AmB tag
backscatters bit “0”, the sample covariance matrix S(i) would be similar to R̄0 but
different from R̄1 . In contrast, if bit “1” is backscattered, the sample covariance
matrix S(i) would be similar to R̄1 but different from R̄0 . It is worth noting that
these similarities and differences between S(i) and the averaged covariance matrices
(i.e., R̄0 and R̄1 ) can be better represented by multiplying S(i) with the inverse of
these matrices. As such, the data for training our DL model is given as follows:
(i)
D1 = S(i) R̄−1
1 ,
(6.25)
(i)
D0 =S (i)
R̄−1
0 .
By doing so, the proposed AmB signal detector based on DL can obtain adequate
information about the channel coefficients via S(i) , R̄0 , and R̄1 to learn and detect
backscattered bits.
To create input data for the DL model, the matrices R̄0 and R̄1 are processed
6.4 Deep Learning-Based Signal Detector 166
(i) (i)
as follows. Recall that the receiver has M antennas, D0 and D1 are M × M
(i) (i)
matrices. Let d0ij and d1ij be the elements of D0 and D1 , respectively, with
i, j ∈ {1, 2, . . . , M }. First, the matrices R̄0 and R̄1 are flattened into two vectors,
i.e., d0 and d1 , each has M ×M elements, as shown in Figure 6.3. Then, d1 is con-
catenated at the end of d0 to form vector d with 2M 2 elements. Finally, the input
data for the DL model is constructed by taking (i) the real part, (ii) the imaging
part, and (iii) the absolute of each element of the combined vector d, as illustrated
in Figure 6.3. Thus, the input data can allow the DL model to effectively leverage all
This work designs a DNN at the signal detector to decide whether the signals
received by the receiver correspond to bit “1” or bit “0” of an AmB frame. Aiming to
DNN with a simple and typical architecture. In particular, our DNN consists of an
input layer, a small number (e.g., three) of fully connected (FC) hidden layers, and
an output layer, as illustrated in Figure 6.3. Since the input vector data’s size is
M ×M ×2× 3, the input layer consists of M ×M ×2× 3 neurons. After the input
layer receives the input data, it will be sequentially forwarded to FC layers, and
each consists of multiple neurons. In FC layers, each neuron connects to all neurons
in the previous layer. For example, each neuron in the first FC layer connects to
all neurons in the input layer. The output of the last FC layer is applied to the
soft-max function at the output layer to obtain the category probabilities p of which
the received signals corresponds to bit “0” and bit “1”. Finally, the output layer
uses these probabilities to determine the input’s class, i.e., “0” or “1”. In DNNs,
linear unit (ReLU), to the sum of its input and bias to get its output. Here, the
6.4 Deep Learning-Based Signal Detector 167
neuron is calculated as the weighted sum of the outputs of neurons in the previous
layer that are connected to it. The training process aims to optimize the DNN
parameters θ (i.e., weights and bias) to minimize the loss L(f (d; θ), ψ), which is the
error between the DNN output f (d; θ) and the ground truth ψ, as follows:
min E L(f (d; θ), ψ) , (6.26)
θ
where the cross-entropy loss is used at the output layer. To minimize the loss
L(f (d; θ), ψ), this thesis proposes to use SGD (discussed in Section 2.1) to optimize
the DNN’s parameters. Thanks to this design, the proposed DL based-receiver can
accurately predict transmitted bits in a backscatter frame even for weak AmB signals
accurate CSI, it faces several shortcomings. First, the training process of DL often
requires a huge amount of received signals for the training process to achieve good
performance [89]. Thus, this makes DL less efficient in practice, especially when data
is expensive and contains noise due to the environment’s dynamic and uncertainty,
over-fitting problem, i.e., performing excellently on training datasets but very poorly
on test datasets, if the training dataset does not contain enough samples to represent
all possible scenarios. Due to the dynamic nature of the wireless environment, the
channel conditions may vary significantly over time. For example, a moving bus
may change channel conditions from LoS to NLoS and vice versa. Consequently,
real-time data may greatly differ from training data, making these above problems
more severe. Third, wireless channel conditions can also be very different at different
areas due to their landscapes, so the DL model trained at one area may not perform
well at other areas. As such, different sites may need training different models
6.5 AmB Signal Detector based on Meta-Learning 168
from scratch, which is time-consuming and costly [90]. For that, the next section
will discuss our approach based on meta-learning that can effectively alleviate these
challenges.
learning has an ability of learning to learn, i.e., self-improving the learning algo-
rithm [89–91, 162]. The main idea of meta-learning is to train the model on a
the trained model can quickly perform well in a new task only after a few update
iterations, even when provided with a small dataset specific to the new task [91].
The meta-learning algorithms can be classified into two groups [89]. The first group
aims to encode knowledge in special and complex DNN architectures, such as Con-
networks [165], and Memory-Augmented Neural Networks [162]. As such, this group
requires more overhead for operations [91], and thus it may not be suitable for our
lightweight framework. In the second group, learning algorithms aim to learn the
model’s initialization. Thus, this approach does not require any additional elements
A classical approach of the second group (i.e., learning the model’s initialization)
is that the model is first trained with a large dataset from existing tasks and then
fine-tuned with the new task’s data. However, this pre-training method could not
guarantee that the learned initialization is good for fine-tuning, and many other
techniques need to be leveraged to achieve good performance [89]. Recently, the work
model’s parameters during the learning process with a collection of similar tasks,
high complexity. To overcome this problem, the Reptile algorithm is proposed in [89].
descent (SGD) and Adam, making it less computationally demanding while still
maintaining a similar performance level of MAML [89]. For that reason, this work
adopts the Reptile algorithm. The detail of the meta-learning-based AmB signal
parameters θ) is trained with a set T of similar tasks [89, 91]. As an example, let
us consider a set of image classification tasks. The first task involves classifying
images into different animal categories, specifically the tiger, dog, and cat classes.
On the other hand, the second task focuses on classifying images into another set
of animal categories, namely the elephant, bear, and lion classes. Similarly, this
work considers a set of AmB signal detection tasks, each under a particular channel
Generally, meta-learning comprises two nested loops. In the inner loop, a base-
learning algorithm (e.g., SGD or Adam) solves task τ , e.g., detecting the AmB
signal under the Rician channel. The objective of the inner loop’s learning is to
6.5 AmB Signal Detector based on Meta-Learning 170
algorithm, such as stochastic gradient descent (SGD). Thus, the inner loop is similar
to the training process of the proposed DL-based AmB signal detector discussed in
θ̄ = Uτp (θ), where Uτp (·) denotes the update operator. Then, at the outer loop, a
where ηo is the outer step size controlling how much the model parameters are
i.e., performing a single step of gradient descent in the inner loop, Algorithm 6.1
becomes a joint training on the mixture of all tasks, which may not learn a good
Our proposed meta-learning algorithm comprises two nested loops. The inner
algorithm, and thus, shares the same computational complexity. Therefore, assum-
ing that the outer loop consists of K iterations, the computational complexity of
our meta-learning approach would be K times that of SGD. However, our simula-
framework will be evaluated in various scenarios to get insights into its performance.
6.6 Simulation Results 171
under the following settings unless otherwise stated. For the transmission aspect, the
AmB rate is 50 times lower than the transmitter-to-receiver link rate, i.e., N = 50,
and the AmB message consists of I = 100 bits. Since the transmitter-receiver link
SNR αdt significantly depends on many factors, such as transmit power, antenna
gain, and path loss, we vary it from 1 dB to 9 dB in the simulations to examine our
proposed solution in different scenarios. The tag-receiver link SNR αbt is often quite
In this work, the system is simulated using Python, and the DL model is built
using the Pytorch library. Specifically, the Rayleigh fading follows the zero-mean and
the AmB tag does not know the transmitter’s signals (i.e., RF resource signals)
sending from the transmitter to the receiver. In addition, this work focuses on signal
detection for the AmB link. As such, the RF resource signals can be assumed to
follow the zero-mean and unit-variance CSCG distribution, similar to those in [82,83,
158]. For evaluating different aspects of our proposed solution, we use three metrics,
i.e., maximum achievable AmB rate and BER for the transmission performance and
∗
To get insights on the maximum achievable AmB rate RAmB of the proposed
system, we perform simulations in three settings, where the receiver has 1, 3 and 10
antennas, as shown in Figure 6.4. First, in Figure 6.4(a), the prior probability of
backscattering bit “0”, i.e., θ0 , is varied to observe the maximum achievable AmB
6.6 Simulation Results 172
0.6 0.6
1 antennas
3 antennas
0.5 0.5
(bits/resource symbol)
(bits/resource symbol)
10 antennas
1 antenna
0.4 3 antennas 0.4
10 antennas
0.3 0.3
0.2 0.2
RAmB
RAmB
0.1
*
*
0.1
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 6 7 8 9 10
θ0 αdt(dB)
(a) (b)
Figure 6.4 : The maximum achievable AmB rate when varying (a) θ0 and (b) the
transmitter-to-receiver link SNR, i.e., αdt .
∗
rate, i.e., the idealized throughput RAmB which is given by Theorem 6.1. The results
are obtained by 106 Monte Carlo runs. Figure 6.4(a) shows that the maximum
∗
achievable AmB rate RAmB increases when the receiver has more antennas. This is
because the receiver can achieve a higher gain with more antennas. As such, the
received backscatter signals are enhanced, thereby reducing the impacts from fading
and interference originating from the direct link and the surrounding environment.
∗
Importantly, RAmB is maximized at θ0 = 0.5. Therefore, we set θ0 = 0.5 in the rest
of simulations. Next, we vary the transmitter-to-receiver link SNR, i.e., αdt , from
1 dB to 9 dB, as shown in Figure 6.4(b). It can be observed that as the active link
∗
SNR αdt increases, the maximum achievable backscatter rate RAmB increases. The
reason is that when the transmitter increases its transmit power, the signal arriving
It is worth noting that our proposed framework can counter eavesdropping at-
tacks in two folds. Firstly, a part of the original message is hidden in the AmB
the proposed message encoding mechanism makes reconstructing the original mes-
6.6 Simulation Results 173
sage non-trivial unless the eavesdropper knows this and at the same time it can
In particular, without knowledge about the system in advance (i.e., the settings of
AmB transmission), the eavesdropper is not even aware of the existence of the AmB
emphasizing that even if the eavesdropper is aware of the AmB message, it is still
challenging to capture the AmB message. Suppose that the eavesdropper leverages
the AmB signal detector circuit proposed in [70], which requires different values of
resistors and capacitors in the circuit for different backscatter rates. As such, if the
eavesdropper deploys the AmB circuit but does not know the exact backscatter rate,
it still cannot decode the backscatter signals [70]. Iteratively testing every possible
rate is impractical. It can be considered that the eavesdropper can use MLK- or
require complete information on signal distribution and perfect CSI, while DL-based
methods need a large amount of data and time to detect AmB signals properly.
Given the above, the probability that the eavesdropper can successfully capture the
To further evaluate the security of our proposed system, we consider the worst-
case in which the eavesdropper knows the exact backscatter rate in advance, and thus
they can capture the AmB message as well as the active message. However, since the
message encoding technique is unknown, the eavesdropper still does not know how
to construct the original message based on the active and AmB messages. They only
know that combining these two messages can decode the original message. As such,
they must determine the position of all I bits of the AmB message in the original
this case, this study considers the guessing entropy metric [81].
6.6 Simulation Results 174
10200
[G(X)]
10150
10100
1050
0.01 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Figure 6.5 : The upper bound of the expected number of guesses, i.e., E[G(X)], vs.
the message splitting ratio β.
Suppose that the original message has P bits, the probability that the eaves-
dropper successfully finds the positions of I bits in the original message by guessing
is given by
1 I!(P − I)!
pI = P
= . (6.28)
CI P!
To correctly identify the position of I bits, the eavesdropper needs to ask a number
of questions of the kind “does this correct?”. Suppose that the eavesdropper follows
an optimal guessing sequence, the average number of these questions defines the
guessing entropy [81]. We can consider that the position of I bits is analogous to the
key kI for reconstructing the original message. Let X denote the random variable
by probability distribution Pk . Note that in this work, the key set consists of all
possible keys, and thus the size of K is CIP . One of the optimal brute-force attacks
is the scheme where the eavesdropper knows Pk , and it sorts the key set K in the
Then, it iteratively tries a key in the sorted key set K̄. For that, the guessing entropy
is defined as follows:
X
G(X) = ip(xi ), (6.29)
1≤i≤|K̄|
6.6 Simulation Results 175
where i in the index in K̄, and p(xi ) corresponds to the selection probability of a key
at index i. In [81], the upper bound of the expected number of guesses, i.e., E[G],
for the eavesdropper with the optimal guessing sequence is given as follows:
2
|K̄|
1 X p 1
E[G(X)] ≤ p(xi ) + . (6.30)
2 i=1 2
In Figure 6.5, we vary the message splitting ratio β = P/I to observe the guessing
entropy of our framework. Here, β = 0.1 means that 10% of original message bits
are transmitted by the AmB tag, and the remaining bits are transmitted by the
transmitter. Since the AmB rate is significantly lower than the transmitter rate [82,
158], the message splitting ratio β is less than 0.5 in practice. As shown in Figure 6.5,
when the ratio β increases from 0.01 to 0.5, the guessing entropy also increases. It is
stemmed from the fact that as the size of the AmB message increases, the probability
of guessing successfully in one trial decreases, as implied by (6.28) when β < 0.5. It
is worth mentioning that guessing entropy is the average number of questions asked
greater the value of E[G(X)], the higher the security level is.
The settings for the DL-based approaches are as follows. Note that since the
architecture of DNN can significantly affect the performance of the DL-based AmB
signal detector, it must be designed thoughtfully. For example, a DNN with more
layers may perform better but demands more resources and takes longer time for
training. Thus, this work designs a simple and lightweight DNN while still achieving
good performance of detecting AmB signal. Particularly, our DNN has an input
layer, four FC layers, and an out layer. The number of neurons in the input layer
1.0
0.8
Accuracy
0.6
0.4
0.2
DL-based Detector
MLK-based Detector
0.0
1 10 20 30 40 50 60 70 80 90 100
Epoch
Figure 6.6 : The convergence of DL-based AmB signal detector when αdt = 1 dB
and the receiver has 10 antennas.
the number of the receiver’s antennas. The first, second, and third FC layers have
600, 1000, and 600 neurons, respectively. The training dataset is obtained from 104
Figure 6.6 shows the convergence rate of our proposed DL-based AmB signal
detector when the direct link SNR αdt = 1 dB and the receiver has 10 antennas. To
train the DNN, we use the SGD with the learning rate of 0.001 [95]. The batch
size is set to 1, 000 data points, and thus each training epoch consists of 10 learning
iterations. In Figure 6.6, the accuracy of the MLK-based detector is also presented
as a baseline. It can be observed that after 30 learning epochs, our proposed DL-
based detector converges into a reliable model that can achieve an accuracy close
to that of the MLK-base detector. Note that MLK-based detectors are considered
the optimal signal detection scheme, but it is impractical due to high computing
Now, we investigate the system’s BER performance when varying the direct link
our proposed DL-based detector is conducted in the same way described in 6.6.4
6.6 Simulation Results 177
with two settings, i.e., providing (i) the estimated CSI based on pilot bits, namely
DL-eCSI, and (ii) the perfect CSI, namely DL-pCSI. We perform simulations with
three antenna configurations for the receiver, i.e., 1, 3, and 10 antennas. To obtain
reliable results, both the MLK-based and the DL-based detectors are evaluated by
performing Monte Carlo fashion with 106 runs. Note that this work only focuses
on the AmB link (i.e., the tag-receiver link), and thus we only obtain the BER of
the AmB link since the BER of the direct (transmitter-receiver) link can be close to
Figure 6.7(a) shows that when the direct link SNR, i.e., αdt , increases from 1 dB
to 9 dB, the BERs of all approaches decrease. The rationale behind is that the
stronger the transmitter’s active signals are, the stronger the signals backscattered
by the AmB tag (i.e., AmB signals) are, so the lower the system BER is. It is also
observed that the system performance improves (i.e., a decrease in BER) as the
number of receiving antennas increases. This is because the received signals at the
receiver are enhanced by antenna gain, which is typically proportional to the number
of antennas. Similar to the observation in Section 6.6.4, the BERs of our DL-based
detector are close to those of the MLK-detector, an optimal signal detector, in all
cases with 1, 3, and 10 antennas. Thus, these clearly show the effectiveness of our
Next, we vary the tag-receiver SNR, i.e., αbt , from −25 dB to −5 dB, while
observations in Figure 6.7(a), when αbt and the number of antennas increase, the
BERs of all approaches reduce. The rationale behind this is that when the received
signal is stronger and the antenna gain is higher, the detector can attain a lower
BER, indicating better performance. Again, the gap observed in BERs between
our proposed DL-based detector and the MLK method is marginal. Interestingly,
the DL-eCSI can achieve comparable BER results compared with that of DL-pCSI,
6.6 Simulation Results 178
100 100
10−1 10−1
BER
BER
10−2 10−2
10−3 10−3
10−4 10−4
1 2 3 4 5 6 7 8 9 −25 −20 −15 −10 −5
αbt (dB) αbt (dB)
(a) (b)
Figure 6.7 : Varying (a) the transmitter-receiver SNR αdt and (b) the tag-receiver
SNR αbt .
as shown in Figures 6.7(a) and (b). Thus, these results show that our proposed
DL-based detector can perform well in practice with only estimated CSI.
Now, we evaluate the proposed meta-learning approach for the AmB-signal de-
tector with the following setups. We consider that the target task is detecting
AmB signals under Rayleigh fading to better demonstrate the comparison between
meta-learning-, DL-, and MLK- based approaches. The set of tasks for the learning
process, i.e., training tasks, consists of (i) detecting AmB signals under Rician fad-
ing and (ii) detecting AmB signals under WINNER II fading [167]. Algorithm 6.1
is used to train the DNN model, i.e., meta-model, with the training tasks’ datasets.
Here, each training task’s dataset contains 500 data points, and the outer step size η
is set to 0.1, similar to [89]. Then, the trained meta-model is trained with the dataset
collected from the target task (i.e., detecting AmB signals under Rayleigh fading)
by using SGD. For these simulations, we select two types of baselines: (i) DL- and
(ii) MLK-based approaches. Three DL models are trained with different numbers
of data points, including 50, 103 , and 104 , in a similar procedure in Section 6.6.4.
100
10−1
BER
10−2
10−3
10−4
1 2 3 4 5 6 7 8 9
αdt (dB)
based approach and other baselines when varying the transmitter-receiver SNR αdt ,
as shown in Figure 6.8. Similar to the observations in Figures 6.7(a) and (b), as the
signal’s strength increases (i.e., an increase of SNR), all approaches achieve a lower
approaches, DL with 104 data points achieves the lowest BER, and DL with 50
data points gets the highest BER. The reason is that more data points can gain
more knowledge to the DL model. Thus, it can reduce the over-fitting issue, i.e.,
performing well on a training dataset but poor on a test dataset, thereby improv-
ing the system performance [95]. Interestingly, when the transmitter-receiver SNR
αdt ≥ 3 dB, meta-learning with only 50 data points can achieve the lowest BER
compared with DL-based approaches that require up to 104 data points. Whereas,
when αdt is low (less than 3 dB), meta-learning’s BER is slightly higher than that
of the DL with 104 data points but still lower than those of DL with 103 and 50
data points. These observations are stemmed from the fact that meta-learning has
a generalization ability that can help a DNN model to learn a new task quickly with
0.07
10-antennas MLK
0.06 DL with dataset size = 50
DL with dataset size = 103
Standard Deviation
0.05
DL with dataset size = 104
0.04 Meta-learning with dataset size = 50
0.03
0.02
0.01
0.00
1 2 3 4 5 6 7 8 9
αdt (dB)
Figure 6.9 : Reliability of learning process with different training datasets’ sizes.
learning approach 20 times to obtain the standard deviation of its BER results, as
shown in Figure 6.9. The simulation settings are set as those in Figure 6.8. Gener-
SNR αdt increases from 1 dB to 10 dB. This is because the received signals contain
more noise at a lower SNR, making the learning more challenging and leading to
high deviation of results, i.e., more unstable results. In particular, at αdt = 1, the
standard deviation of DL with 50 training data points is about 10 times higher com-
pared with that of the best learning-based approach i.e., DL with 104 data points.
Again, when αdt ≥ 3 dB, the proposed meta-learning-based approach achieves com-
parable results as those of DL with 104 data points. Notably, when αdt ≥ 8 dB, the
results of meta-learning and DL with 104 data points close to the results of MLK.
When the αdt decreases to less than 3 dB, the meta-learning still obtains very good
Finally, to get insight into how the amount of training data can impact meta-
learning, we set the size of each dataset collected from a training task to Dt and
then vary Dt from 100 to 500 data points. Here, the transmitter-receiver SNR αdt
is set to 5 dB. Figure 6.10 shows the BER performance of models trained by meta-
learning with 50, 100, 200, 400, and 103 data points from the target task, namely
6.7 Conclusion 181
0.023 Meta-50
Meta-100
0.022
Meta-200
0.021 Meta-400
Meta-103
0.020
DL-104
BER
0.019
0.018
0.017
0.016
Meta-50, Meta-100, Meta-400, and Meta-103 . For comparing with the DL-based
approach, this figure also presents the result of the model trained from scratch by
SGD (as in Section 6.6.4) at αdt = 5 with 104 data points of the target task, namely
DL-104 . Similar to observations in Figure 6.8, all approaches achieve lower BERs
as the training data size increases, as shown in Figure 6.10. Interestingly, when the
meta-learning training data points are adequate (e.g., Dt ≤ 500), even with only
training with 104 target task’s data points. Given the above, it clearly shows the
6.7 Conclusion
This chapter has introduced a novel anti-eavesdropping framework that can en-
AmB technology. In particular, an original message was divided into two messages,
and they are transmitted over two channels, i.e., direct transmission channel and
AmB communication channel. Since the AmB tag only backscatters the RF signals
AmB signals, our proposed splitting message introduces much more difficulties for
the eavesdroppers to derive the original message. To effectively decode AmB signals
at the receiver, we have developed a signal detector based on DL. Unlike MLK-based
solutions, our detector did not rely on a complex mathematical model nor require
perfect CSI. To make the DL-based detector to quickly achieve good performance
in new environments, we have developed the meta-learning technique for the train-
ing process. The simulation results have shown that our proposed approach can not
only ensure the security of the data communications in the presence of eavesdroppers
but also can achieve the BER performance that is comparable to the MLK-based
Chapter 7
7.1 Conclusion
resource management for the Metaverse, and security, with the goal of achieving
In the first study, we have tackle the problem in UAV-based systems, which
Deep Dueling Double Q-learning with Transfer Learning algorithm (D3QL-TL) that
jointly optimizes the flying speed and energy replenishment activities for the UAV
proposed algorithm effectively addresses not only the dynamic and uncertainty of
the system but also the high dimensional state and action spaces of the underly-
ing MDP problem with hundreds of thousands of states. In addition, the proposed
TL techniques (i.e., experience transfer, policy transfer, and hybrid transfer) allow
have showed that our proposed solution can significantly improve the system perfor-
mance (i.e., data collection and energy usage efficiency) and has a remarkably lower
In the second study, we have addressed the challenge in the Integrated Commu-
nications and Sensing (ICAS) technology that plays a critical role in AVs, a use case
7.1 Conclusion 184
based on the observations to maximize the overall performance of the ICS system.
Then, we have proposed an advanced learning algorithm, i.e., i-ICS, that can help the
ICS-AV gradually learn an optimal policy through interactions with the surround-
advance. As such, our proposed approach can effectively handle the environment’s
dynamic and uncertainty as well as the high dimensional state space problem of the
underlying MDP framework. The extensive simulation results have clearly shown
that the proposed solution can strike a balance between communication efficiency
in different scenarios.
In the third study, we have addressed the resource management for Metaverse,
a new service supported by 6G. We have proposed two innovative techniques, i.e.,
tion for the Metaverse built on a multi-tier computing architecture. Based on these
techniques, we have developed a novel framework for the Metaverse, i.e., MetaSlic-
ing, that can smartly allocate resources for MetaSlices, i.e., Metaverse applications,
tive framework based on sMDP together with an intelligent algorithm, i.e., iMSAC,
to find the optimal admission policy for the Admission Controller under the high
have clearly demonstrated the robustness and superiority of our proposed solution
compared with the counterpart methods as well as revealed key factors determining
the system performance. As Metaverse continues to evolve and expand, there will
be enormous potential research topics that we can explore. For example, elastic
namically according to the resource demand of this MetaSlice during operation, thus
further improving resource utilization. Another topic that we can further consider
verse. This could include developing standards and protocols for communication
between different MetaSlices and mechanisms for managing conflicts and ensuring
Our fourth study has introduced a novel anti-eavesdropping framework that can
message was divided into two messages, and they are transmitted over two chan-
nels, i.e., direct transmission channel and AmB communication channel. Since the
AmB tag only backscatters the RF signals to transmits data, instead of generating
active signals, the eavesdropper is unlikely aware of the existence of AmB trans-
missions. Even if eavesdroppers can capture AmB signals, our proposed splitting
message introduces much more difficulties for the eavesdroppers to derive the origi-
nal message. To effectively decode AmB signals at the receiver, we have developed
a signal detector based on DL. Unlike MLK-based solutions, our detector did not
rely on a complex mathematical model nor require perfect CSI. To make the DL-
developed the meta-learning technique for the training process. The simulation re-
sults have shown that our proposed approach can not only ensure the security of
the data communications in the presence of eavesdroppers but also can achieve the
perfect CSI.
how different components of 6G technology, ranging from aerial data collection and
and secure data transmission, can be optimized using advanced machine learning
techniques. Each study, while distinct in its focus, contributes to the overarching
goal of creating a robust, efficient, and secure 6G network. They complement and
nology.
It is worth noting that our proposed solutions are based on deep neural networks,
and thus they cannot guarantee optimality. Additionally, since the thesis explores
emerging topics in 6G networks, which are in their infancy, datasets and testbeds are
not publicly available. Also, due to our limited resources, our proposed approaches
were evaluated in simulated environments, which may not completely reflect real-
world conditions. However, the obtained results have still provided insights into the
networks.
Our results show great potential for leveraging advanced machine learning to
ten requires complete information about the system) may not be effective in
6G. Thus, there is an urgent need for an effective solution to facilitate the
mance. This could improve signal quality and overall network performance
significant challenge, as they can disguise their true nature. In this context,
predictive AI, which involves forecasting future states based on historical data,
(IoE) [1, 2], 6G networks indeed consist of a massive number of devices, most
7.2 Future Works 188
input data of AI models, they can severely poison AI models. Thus, there needs
that can directly operate with encrypted data. By doing so, the need for
data decryption and key distribution is eliminated, thus reducing the required
investigated.
189
Appendix A
Proofs in Chapter 5
First, we need to prove the existence of the limiting matrix T π given in (5.12).
In [168], it is proven that for an aperiodic irreducible chain (as that of our proposed
sMDP), the limiting matrix (which is the Cesaro limit of order zero named C-lim)
G−1
1 X g
Tπ = C-lim = lim Tπ . (A.1)
G→∞ G
g=0
Next, because the total probabilities that a given state moves to others is one,
" G #
1 X
T π r(s, π(s)) = lim E r(sg , π(sg )) , ∀s ∈ S, (A.2)
G→∞ G + 1
g=0
" G #
1 X
T π y(s, π(s)) = lim E τg ∀s ∈ S. (A.3)
G→∞ G + 1
g=0
It is observed that the long-term average reward Rπ (s) in (5.9) is the ratio between
T π r(s, π(s)) and T π y(s, π(s)). Note that by the quotient law for limits, the limit of
a division of two functions is equivalent to the division of the limit of each function
if the limit of a function at the denominator is not equal to zero. Since τg , i.e., the
interval time between two consecutive decision epochs, is always larger than zero,
T π y(s, π(s)) is always larger than zero. Therefore, the long-term average reward
Rπ (s) exists.
A.2 The Proof of Theorem 2 190
This proof is based on the irreducible property of the underlying Markov chain,
space S consists of the currently available resources in the system, the required re-
sources, the class ID, and the similarity of the MetaSlice request. Recall that the
request arrival and MetaSlice departure follow the Poisson and exponential distri-
butions, respectively. In addition, the requested MetaSlice can have any function
supported by the MetaSlicing system, and thus its required resources are arbitrary.
Given the above, suppose that the Admission Controller observes state s at time t,
then the system state can move to any other state s′ ∈ S after a finite time step.
Therefore, the proposed sMDP is irreducible, and thus the long-term average reward
function Rπ (s) is well-defined regardless of the initial state and under any policy.
191
Appendix B
Proofs in Chapter 6
In this appendix, we prove Theorem 6.1 in a similar way as in [82]. Since the
mutual information between the AmB tag’s state e, i.e., M(e; y) and the received
signals y defines the achievable AmB rate, we can obtain the maximum achievable
where
In (B.2), Z(θ0 ) is the binary entropy function defined by (B.3), and C(e|V) is the
Note that since Z(θ0 ) does not depend on the channel coefficients, the maximum
∗
achievable AmB rate RAmB is given by
∗
RAmB = E[M(e, y)] = Z(θ0 ) − EV [C(e|V)]. (B.4)
θj p(y|e = j)
p(e = j|V) = , (B.5)
θ1 p(V|e = 1) + θ0 p(V|e = 0)
B.1 The proof of Theorem 6.1 192
with j ∈ {0, 1}. By letting µj = p(e = j|V), the conditional entropy C(e|V) is
expressed by
1
X
C(e|V) = − µj log2 µj = Z(µ0 ). (B.6)
j=0
∗
RAmB = Z(θ0 ) − Eỹ [Z(µ)]
Z
(B.7)
= Z(θ0 ) − θ1 p(V|e = 1) + θ0 p(V|e = 0) Z(µ0 )dV.
V
193
Bibliography
[2] I. F. Akyildiz, A. Kak, and S. Nie, “6G and beyond: The future of wireless
communications systems,” IEEE Access, vol. 8, pp. 133 995–134 030, Jul. 2020.
[3] L. U. Khan, I. Yaqoob, M. Imran, Z. Han, and C. S. Hong, “6G wireless sys-
[5] “Connecting the world from the sky,” Mar. 2014. [Online]. Available:
https://about.fb.com/news/2014/03/connecting-the-world-from-the-sky/
[7] J. A. Z. al., “An overview of signal processing techniques for joint communica-
tion and radar sensing,” IEEE Journal of Selected Topics in Signal Processing,
[9] “Mesh for Microsoft Teams aims to make collaboration in the ‘Meta-
arXiv:2203.05471, 2022.
[11] Meta Quest. Horizon Worlds. (Dec. 10, 2021). Accessed: May 01, 2022.
t=64s
[12] Y. Wu, K. Zhang, and Y. Zhang, “Digital twin networks: A survey,” IEEE
Internet of Things Journal, vol. 8, no. 18, pp. 13 789–13 804, May 2021.
pp. 4–12.
[14] C. Kwon, “Smart city-based Metaverse a study on the solution of urban prob-
lems,” Journal of the Chosun Natural Science, vol. 14, no. 1, pp. 21–26, Mar.
2021.
[15] H. Jeong, Y. Yi, and D. Kim, “An innovative e-commerce platform incor-
Computing, Information and Control, vol. 18, no. 1, pp. 221–229, Feb. 2022.
BIBLIOGRAPHY 195
[16] J. Hu, S. Yan, X. Zhou, F. Shu, J. Li, and J. Wang, “Covert communica-
2023).
Communications Surveys & Tutorials, vol. 16, no. 3, pp. 1550–1573, Feb. 2014.
[21] X. Zeng, F. Ma, T. Chen, X. Chen, and X. Wang, “Age-optimal UAV tra-
[23] Y. Zhang, Z. Mou, F. Gao, L. Xing, J. Jiang, and Z. Han, “Hierarchical deep
IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3786–3800, Mar. 2021.
BIBLIOGRAPHY 196
data collection system for UAV-assisted IoT,” IEEE Internet of Things Jour-
[26] F. Shan, J. Luo, R. Xiong, W. Wu, and J. Li, “Looking before crossing:
[27] J. Gong, T. Chang, C. Shen, and X. Chen, “Flight time minimization of UAV
for data collection over wireless sensor networks,” IEEE Journal on Selected
[28] Q. Pan, X. Wen, Z. Lu, L. Li, and W. Jing, “Dynamic speed control of un-
manned aerial vehicles for data collection under internet of things,” Sensors,
[29] X. Lin, G. Su, B. Chen, H. Wang, and M. Dai, “Striking a balance between
system throughput and energy efficiency for UAV-IoT systems,” IEEE Internet
[30] K. Li, W. Ni, E. Tovar, and A. Jamalipour, “On-board deep Q-network for
on Vehicular Technology, vol. 68, no. 12, pp. 12 215–12 226, Dec. 2019.
[31] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
[32] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning do-
pp. 1995–2003.
[35] G. Naik, B. Choudhury, and J. Park, “IEEE 802.11bd & 5G NR V2X: Evo-
the United States,” Proceedings of the IEEE, vol. 99, no. 7, pp. 1162–1182,
Jul. 2011.
sensing,” IEEE Communication Magazine, vol. 54, no. 12, pp. 160–167, Dec.
2016.
[39] X. Cheng, D. Duan, S. Gao, and L. Yang, “Integrated sensing and communi-
Nov. 2017.
[48] T. Xu, T. Zhou, J. Tian, J. Sang, and H. Hu, “Intelligent spectrum sensing:
IEEE Open Journal of the Communications Society, vol. 2, pp. 775–784, Feb.
2021.
2022).
com/media/assets/corporate/docs/about-us/media/media-release/2022/03/
[52] X. Wang et al., “Characterizing the gaming traffic of World of Warcraft: From
game scenarios to network access technologies,” IEEE Network, vol. 26, no. 1,
[54] Y. Jiang, J. Kang, D. Niyato, X. Ge, Z. Xiong, and C. Miao, “Reliable coded
BIBLIOGRAPHY 200
allocation framework for synchronizing Metaverse with IoT service and data,”
[58] K. Cao, Y. Liu, G. Meng, and Q. Sun, “An overview on edge computing
google-maps-will-now-power-location-aware-augmented-reality-games/,
01, 2022).
[63] Y. Zhang, Y. Shen, H. Wang, J. Yong, and X. Jiang, “On secure wireless
on Automation Science and Engineering, vol. 13, no. 3, pp. 1281–1293, Jul.
2016.
Wireless Communications, vol. 17, no. 11, pp. 7252–7267, Nov. 2018.
Transactions on Information Forensics and Security, vol. 14, no. 3, pp. 621–
Computer Communication Review, vol. 43, no. 4, pp. 39–50, Aug. 2013.
pp. 151–164.
1–6.
BIBLIOGRAPHY 203
[77] X. Li, M. Zhao, Y. Liu, L. Li, Z. Ding, and A. Nallanathan, “Secrecy analysis
actions on Vehicular Technology, vol. 69, no. 10, pp. 12 286–12 290, Oct. 2020.
[79] X. Li, Y. Zheng, W. U. Khan, M. Zeng, D. Li, G. Ragesh, and L. Li, “Phys-
[82] H. Guo, Q. Zhang, D. Li, and Y.-C. Liang, “Noncoherent multiantenna re-
[83] H. Guo, Q. Zhang, S. Xiao, and Y.-C. Liang, “Exploiting multiple antennas
[84] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel estima-
[85] C. Fan, X. Yuan, and Y.-J. Zhang, “CNN-based signal detection for banded
learning models for wireless signal classification with distributed low-cost spec-
mathworks.com/campaigns/offers/next/deep-learning-ebook.html, (accessed:
[93] S. Peng, S. Sun, and Y.-D. Yao, “A survey of modulation classification using
tions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 7020–7038,
Dec. 2022.
ings of the 2001 IEEE International Conference on Data Mining, 2001, pp.
641–642.
[95] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[96] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global min-
press, 2018.
[102] S. Thrun and A. Schwartz, “Issues in using function approximation for rein-
[103] S. Halkjær and O. Winther, “The effect of correlated input data on the dynam-
755.
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, Oct. 2009.
[109] Z. Zhu, K. Lin, and J. Zhou, “Transfer learning in deep reinforcement learning:
survey,” Proceedings of the IEEE, vol. 110, pp. 1073 – 1115, Jun. 2022.
BIBLIOGRAPHY 207
efficient IoT sensor with RF wake-up and addressing capability,” IEEE Sen-
learning,” IEEE Transactions on Mobile Computing, vol. 19, no. 6, pp. 1274–
[115] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient UAV control
Oct. 2021.
[119] P. Kumari, W. Heath, and S. A. Vorobyov, “Virtual pulse design for IEEE
pp. 3315–3319.
McGraw-Hill, 2014.
[121] “Wireless LAN medium access control (MAC) and physical layer (PHY) spec-
2016.
radars,” The European Physical Journal - Applied Physics, vol. 57, Nov. 2011.
[125] H. Rohling and M.-M. Meinecke, “Waveform design principles for automotive
[126] G. R. Curry, Radar System Performance Modeling, 2nd ed. Artech House,
2004.
BIBLIOGRAPHY 209
packet error rate in wireless networks,” in Proceedings of the 3rd Annual Com-
333–338.
[130] S. Sahhaf et al., “Network service chaining with optimized network function
[132] A. Wolke, M. Bichler, and T. Setzer, “Planning vs. dynamic control: Re-
[133] S. Abbasloo, Y. Xu, and H. J. Chao, “C2TCP: A flexible cellular TCP to meet
Challenges & Call for Action,” ESTI. Accessed: Apr. 24, 2022. [Online].
BIBLIOGRAPHY 210
solutions,” IEEE Communications Surveys & Tutorials, vol. 20, no. 3, pp.
[138] H. Zhang et al., “Network slicing based 5G and future mobile networks: Mobil-
nl/%7Ekallenberg/Lecture-notes-MDP.pdf
[141] G. Wang, G. Feng, W. Tan, S. Qin, R. Wen, and S. Sun, “Resource allocation
[142] H. Roh, C. Jung, W. Lee, and D.-Z. Du, “Resource pricing game in geo-
1527.
BIBLIOGRAPHY 211
[145] W. Wu, N. Chen, C. Zhou, M. Li, X. Shen, W. Zhuang, and X. Li, “Dynamic
IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2076–
[146] X. Shen, J. Gao, W. Wu, K. Lyu, M. Li, W. Zhuang, X. Li, and J. Rao,
[148] P. Jaccard, “The distribution of the flora in the alpine zone,” The New Phy-
[154] X. Wang, L. Zhang, Y. Liu, C. Zhao, and K. Wang, “Solving task scheduling
ment learning,” Journal of Manufacturing Systems, vol. 65, pp. 452–468, Oct.
2022.
munications surveys & tutorials, vol. 20, no. 4, pp. 2889–2922, May. 2018.
nal on Selected Areas in Communications, vol. 37, no. 2, pp. 452–463, Feb.
2019.
BIBLIOGRAPHY 213
[159] S. Ma, G. Wang, R. Fan, and C. Tellambura, “Blind channel estimation for
[160] W. Zhao, G. Wang, S. Atapattu, R. He, and Y.-C. Liang, “Channel estima-
reader,” IEEE Transactions on Vehicular Technology, vol. 68, no. 8, pp. 8254–
2015.
[165] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learn-
[167] P. Kyosti, “WINNER II channel models,” IST, Tech. Rep. IST-4-027756 WIN-