Zhao 等 - 2022 - Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing
Zhao 等 - 2022 - Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing
Abstract— Mobile edge computing can effectively reduce Index Terms— Mobile edge computing, UAV networks, task
service latency and improve service quality by offloading offloading, cooperative offloading, deep reinforcement learning.
computation-intensive tasks to the edges of wireless networks.
Due to the characteristic of flexible deployment, wide coverage
and reliable wireless communication, unmanned aerial vehi- I. I NTRODUCTION
cles (UAVs) have been employed as assisted edge clouds (ECs)
for large-scale sparely-distributed user equipment. Consider-
ing the limited computation and energy capacities of UAVs,
a collaborative mobile edge computing system with multiple
W ITH the development of mobile applications (i.e., auto-
matic navigation, infrastructures monitoring, online
games, etc.), more and more mobile application tasks
UAVs and multiple ECs is investigated in this paper. The task become computation-intensive and delay-sensitive, especially
offloading issue is addressed to minimize the sum of execu- in Internet-of-Things [1], [2]. However, these tasks may
tion delays and energy consumptions by jointly designing the impose a great challenge on user equipment (UE), which have
trajectories, computation task allocation, and communication
resource management of UAVs. Moreover, to solve the above a limited computation and battery capabilities. To address
non-convex optimization problem, a Markov decision process is these challenges, multi-access edge computing (MEC) [3]
formulated for the multi-UAV assisted mobile edge computing is considered to be an extension of cloud computing for
system. To obtain the joint strategy of trajectory design, task allo- data computation and communication in mobile networks.
cation, and power management, a cooperative multi-agent deep Instead of transmitting the computation requests to the central
reinforcement learning framework is investigated. Considering
the high-dimensional continuous action space, the twin delayed computing stations, MEC places servers at the mobile network
deep deterministic policy gradient algorithm is exploited. The edges (i.e., cellular base stations or WiFi access points) with
evaluation results demonstrate that our multi-UAV multi-EC task computation and storage resources. It will be more convenient
offloading method can achieve better performance compared with for servers to offer computing services to deal with intensive
the other optimization approaches. computation tasks of UEs, leading to lower service latency
and better service quality.
Manuscript received 3 June 2021; revised 14 December 2021; accepted
18 February 2022. Date of publication 2 March 2022; date of current version Nevertheless, there is still a challenging issue for UEs to
12 September 2022. This work was supported in part by the National Key obtain the reliable computation services. On one hand, many
Research and Development Program of China under Grant 2018YFB1801105; UEs execute computation-intensive applications in remote or
in part by the National Natural Science Foundation of China under Grant
U1801261 and Grant 61801101; in part by the Key Areas of Research mountainous areas, where communication infrastructures are
and Development Program of Guangdong Province, China, under Grant always distributed sparsely with poor communication condi-
2018B010114001; in part by the Science and Technology Development Fund, tions and uncertain MEC environments [4]. On the other hand,
Macau SAR, under Grant 0009/2020/A1; in part by the Key Research and
Development Plan of Hubei Province under Grant 2021BGD013; in part by there may be massive users to require computation-intensive
the Program of Introducing Talents of Discipline to Universities under Grant services simultaneously. With limited storage and computation
B20064; and in part by the National Research Foundation, Singapore, under its resources, it will be difficult for MEC servers to offer their
the AI Singapore Program, under Grant AISG2-RP-2020-019. The associate
editor coordinating the review of this article and approving it for publication computation services, especially in hotspot areas [5]. Fortu-
was K. Tourki. (Corresponding author: Yiyang Pei.) nately, due to the advantages of flexible deployment and large
Nan Zhao and Zhiyang Ye are with the Hubei Collaborative Innovation coverage, unmanned aerial vehicles (UAVs) have been applied
Center for High-Efficiency Utilization of Solar Energy, Hubei University
of Technology, Wuhan 430068, China (e-mail: [email protected]; to assist MEC systems to execute the computation-intensive
[email protected]). tasks [6], [7]. By establishing LoS links with ground UEs,
Yiyang Pei is with the Singapore Institute of Technology, Singapore 138683 the UAVs can act as the “flying MEC servers” to offer
(e-mail: [email protected]).
Ying-Chang Liang is with the Center for Intelligent Networking and considerable offloading services with low network overhead
Communications (CINC), University of Electronic Science and Technology and execution latency.
of China (UESTC), Chengdu 610056, China, and also with the Peng Cheng Although prior works in the UAV-assisted networks mainly
Laboratory, Shenzhen, Guangdong 518066, China (e-mail: [email protected]).
Dusit Niyato is with the School of Computer Science and Engi- focus on communication aspects [8], [9], there is still some
neering, Nanyang Technological University, Singapore 639798 (e-mail: research on UAVs-assisted MEC systems, such as trajectory
[email protected]). design [10]–[12], resource management [13]–[15], computa-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TWC.2022.3153316. tion offloading [16]–[18]. However, most existing works con-
Digital Object Identifier 10.1109/TWC.2022.3153316 sidered the scenario of single UAV for computation offloading.
1536-1276 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6950 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6951
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6952 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
Similarly, the G2A transmission energy consumption Similarly, the A2G transmission energy consumption
between UE m and UAV n can be defined as between UE m and EC k through UAV n can be obtained
G2A Dm Pnr as
Emn (t) = Pnr Tmn
G2A
(t) = , (15) n
Rmn (t) A2G γmk (t)Dm Pnt
Emnk (t) = Pnt Tmnk
A2G
(t) = . (22)
where Pnr is the receiving power of UAV n. Rnk (t)
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6953
n n
γm0 (t) + γmk (t) = 1, ∀n (27d) According to the constraints of the minimum optimization
k problem (27), we can have the value ranges of each element
0 ≤ Pnt (t) ≤ Pmax , (27e) in an (t), that is, ln (t) ∈ [0, Lhmax], ϑi (t) ∈ [0, 2π), Δzn (t) ∈
(4) − (10), (27f) [−Lvmax , Lvmax ], Pnt (t) ∈ [0, Pmax ], and γmkn
(t) ∈ [0, 1]. Also,
we can observe that the action space An of UAV n is a
where (27b), (27c), and (27d) denote the offloading tasks continuous set. Moreover, with the number of UEs and ECs
constraints of UEs, (27e) is the constraint about the transmit increasing, the size of action spaces exponentially increases.
power of UAVs, (4)-(10) describe the movement constraints 4) Reward Function Rn : To solve the formulated task
of UAVs. offloading optimization problem (27), the N agents should
Generally, it is challenging to solve the non-convex opti- cooperatively minimize the total system cost while satisfying
mization problem (27). Certain unknown variables (i.e., UEs’ certain constraints, such as the overlapping and collision
location and channel condition) may influence the energy constraints. Then, the reward function Rn (t) of UAV n
consumption and execution delay, especially in the dynamic is defined as the negative of the system cost Un (t) if all
network induced by UAVs’ mobility. Moreover, considering constraints are satisfied. Otherwise, if certain constraints are
the decision with the large solution space, it will be intractable not satisfied, there will be the corresponding penalties in the
to obtain the optimal strategy by traditional optimization reward function Rn (t). Moreover, to guarantee UAVs provide
schemes. To address these challenges, an RL method will computing service to all UEs, the coverage constraint of
be investigated to learn the near-optimal policy with little UAVs should be satisfied. If certain UE is beyond the UAVs’
environment information in the next section. coverage, there will be a penalty in the reward function. Thus,
based on the above consideration, the reward function of UAV
III. MADRL FOR TASK O FFLOADING n is given by
⎧
O PTIMIZATION P ROBLEM ⎪ −Un (t),
⎪
⎪ if satisfying constraints,
Here, we first re-model the above problem as a multi-agent ⎨
Rn (t) = −η1 −η2 −η3 (30)
extension of the MDP, which is then solved by an MADRL ⎪
⎪ N
method. ⎪
⎩ [M − Mn (t)], otherwise,
n=1
where η1 , η2 , and η3 denote the penalties related with the
A. MDP Formulation overlapping constraint (9), the collision constraint (10), and the
In UAV-assisted MEC systems, UAVs determine their posi- coverage constraint, respectively. If the horizontal distance of
tion, transmit power and task partition ratios to obtain the any two UAVs does not meet the overlapping constraints (9),
minimum total system cost. Considering that UAVs’ actions each of the two UAVs will experience a penalty η1 . Moreover,
(i.e., UAVs’ movements) may influence the environmental if the distance between any two UAVs does not satisfy the
state, the total system cost is determined by the current state collision constraints (10), there will be a penalty η2 in the
of system environment and the joint actions of all UAVs. reward functions of the two UAVs. Finally, when any UEs
Moreover, the former state and previous actions jointly trigger are not covered by UAVs, all UAVs will incur the penalty
the system environment into a new stochastic state [31].
N
η3 [M − Mn (t)].
In this case, the task offloading optimization issue (27) can be n=1
formulated as a multi-agent Markov decision process (MDP)
N , S, {An }n∈N , P, {Rn }n∈N , δ . N is the agent set, S is B. Multi-Agent DRL Algorithm
the state set of all agents, An is the action space of agent n, To solve the above multi-agent MDP, considering the
P represents the state transition probability, Rn is the reward high-dimensional continuous action space of the task offload-
function of agent n, and δ ∈ [0, 1] denotes the discount factor. ing optimization problem, the multi-agent TD3 (MATD3)
1) Agent Set N : Each UAV acts as an agent to learn its approach is proposed, shown in Fig. 3. Each UAV adopts
scheme of position, transmission power and task partition a TD3 algorithm [32], which comprises one actor network
ratios and obtain the minimum total system cost. Thus, N = with weights μn and two critic networks with weights θn1 and
{1, . . . , N }. θn2 . With the two critic networks, each UAV can deal with
2) State Space S: According to the task offloading opti- the overestimation problem of the Q-values in the one-critic
mization problem, the state s(t) is composed of the 3D framework. In addition, to improve the learning stability,
coordinate positions of UAVs, that is, the target actor network with weights μn and target critic
networks with weights {θni }i=1,2 are adopted.
s(t) = {ω1 (t), ω 2 (t) . . . , ωN (t)}. (28) Different from other multi-agent RL algorithms, where each
3) Action Space An : Since each UAV is required to deter- agent tries to maximize its reward function Rn (t), a cooper-
mine its movements (horizontal fly distance ln (t), horizon- ative multi-agent RL architecture is adopted to achieve the
tal direction angle ϑn (t), and vertical fly distance Δzn (t)), maximum expected discounted reward with the sum reward
transmission power Pnt (t) and task partition ratios γmk
n
(t), the of all UAVs, which is defined as
action space an (t) of UAV n can be given by N
R(t) = Rn (t). (31)
an (t) = {ln (t), ϑn (t), Δzn (t), Pnt (t), γmk
n
(t), ∀k}. (29) n=1
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6954 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6955
Qθn1 (sj , πnμ (sj )) and Qθn2 (sj , πnμ (sj )) by minimizing the loss
function L(θni ), which is defined as
Mb
1 2
L(θni ) = [yj − Qθni (sj , aj )] , i = 1, 2. (35)
Mb j=1
Next, according to (32) and (35), each UAV can update the
weights of the three evaluation networks using the following
equations:
μn ← μn − λ∇μn J(μn ),
θni ← θni − λ∇θni L(θni ), i = 1, 2, (36)
where λ denotes the learning rate. To reduce errors resulting
from temporal difference learning, each UAV updates the
weights of evaluation actor network at a lower frequency than
Fig. 4. Locations of 30 UEs, 2 ECs and 2 UAVs in multi-UAV assisted MEC
that of evaluation critic networks. Here, each UAV chooses to system.
update the evaluation actor network every d time-steps.
TABLE I
Thus, in order to stabilize the training process, by copying
N ETWORK E NVIRONMENT PARAMETERS
the weights of corresponding evaluation networks, each UAV
updates the weights of the three target networks every d time-
steps through
μn = τ μn + (1 − τ )μn ,
θni = τ θni + (1 − τ )θni , i = 1, 2, (37)
where τ denotes the updating rate.
Finally, we discuss the complexity analysis of our proposed
MATD3 algorithm. As for the communication complexity,
in the centralized training procedure, the ground cloud server
needs to frequently communicate with UAVs to obtain the
state about the 3D coordinate positions of UAVs. Since the
total dimension of UAVs’ positions is 3N , the communication
complexity is O(N ). While in the decentralized execution
process, each UAV obtains its action locally, leading to
no communication between UAVs. Hence, the overall com-
munication complexity of our proposed MATD3 algorithm
is O(N ).
Moreover, in the centralized training process, each UAV
estimates the Q-function values with critic networks, where the
sizes of the inputs and outputs are 3N + N (4 + M K) and 1,
respectively. In addition, each UAV determines its action based
ECs in an area of 400 × 400 m2 . The 30 UEs are randomly
on its actor networks with the input size 3N and the output size
distributed within two hotspots, as illustrated in Fig. 4. The
N (4 + M K). While in the decentralized execution procedure,
two UAVs are randomly located to offer their computing
each UAV obtains its action from its actor networks with the
offloading help to the ground UEs. The size of input data
input size 3 and the output size 4 + M K. According to [36],
Dm is generated randomly within [2, 10], and number of CPU
given the fully-connected neural network with fixed numbers
cycles Cm are uniform randomly chosen from [100, 200]. The
of hidden layers and neurons, the computational complexity of
main simulation parameter settings are summarized in Table I.
the back-propagation algorithm is proportional to the product
The proposed MATD3 framework has two-hidden-layer neural
of the input size and the output size. For the critic network, the
networks with 400 and 300 neurons. Table II presents the main
centralized training backprop complexity is O(N M K) while
hyperparameters of the model.
for the actor network, the decentralized execution procedure
is O(N 2 + N M K). Therefore, the overall complexity is
O(N 2 + N M K). A. Training Efficiency of MATD3 Scheme
In this section, the training performance of our proposed
IV. P ERFORMANCE E VALUATION MATD3 optimization method is analyzed. The optimal loca-
In this section, numerical experiments are conducted to tion and computing task allocation of UAVs are also present
evaluate the performance of our proposed MATD3. Here, in this multi-UAV assisted MEC system. The training curves
a multi-UAV assisted MEC system is considered with 2 fixed of our proposed MATD3 optimization method is shown in
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6956 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
TABLE II
H YPERPARAMETERS OF MATD3 M ODEL
n (m = 1, 2 and n = 1, 2)
Fig. 7. Optimal task splitting ratios of ECs γmk
for UEs.
certain UEs are served by ECs only, while certain UEs obtain
computing offloading services from ECs and UAV.
Then, Fig. 7 presents the optimal task splitting ratio allo-
cation strategy. Since the two UAVs cover the two hotspots
respectively, UAV1 (UAV2) does not offer the computing
offloading services for the UEs of the hotspot covered by
Fig. 5. Training curves of MATD3. UAV2 (UAV1). In this case, the first 10 UEs are served by
UAV1, while the last 20 UEs are served by UAV2. Further-
more, we observe that for UEs (5 and 6) with large size of
input tasks, over 40% of the tasks are first processed at UAV1
1
(i.e., γm0 ). After that, the remaining tasks will be offloaded to
ECs for subsequent executing. While 75% of the last 20 UEs
are served by both UAV2 and ECs.
Next, Fig. 8 indicates the effect of the per-device bandwidth
on the optimal task partition ratios. The per-device bandwidth
B1 of EC 1 changes from 0.1 to 3 MHz while the other
per-device bandwidth B2 remains 0.5 MHz, and vice versa.
With the bandwidth assigned to ECs increasing, the more
bandwidth will be assigned to UEs when computing tasks
are offloaded from the UAVs to the ECs, leading to the
higher downlink data rates. Then, we can achieve the less
transmission delay and energy consumption. Moreover, when
Fig. 6. Optimal location of the UAVs. B1 = B2 = 0.5, we can achieve the same total system cost
in both cases, that is, two lines intersect at the same point.
Fig. 5. The training steps are very large at the beginning of Specifically, EC 1 with the greater weight on total system
learning. As the number of episodes increases, learning steps cost when B1 = B2 = 0.5 will have a greater impact on
converge to less than 10 within 30 episodes, which makes the reducing total system cost with more assigned bandwidth.
convergence speed tend to increase. Moreover, as the number While Bk > 0.5 with the other fixed 0.5 MHz, the more
of episodes increases, the two UAVs cover the area of served bandwidth will be assigned to EC k. The case of EC 1 will
UEs more rapidly. Then, the value of [penalty in the reward achieve the less total system cost compared with that of
function will tend to zero, leading to the convergence of the EC 2. However, when Bk < 0.5, EC k will receive the less
training reward. bandwidth. In the case of EC 2, EC 1 has a greater impact on
Figures 6 and 7 present the corresponding optimal location reducing total system cost with B1 = 0.5 > B2 .
and computing task allocation of UAVs, respectively. From Figure 9 plots total system cost with the various computa-
Fig. 6, we can observe that each UAV can be almost located tion capacities of and different per-device bandwidths B1 . The
in the center of one hotspot, which make UAVs provide computation capacity UAVs Fu increases from 3 to 10 GHz.
computing offloading efficiently. Moreover, the dodgerblue The bandwidth B1 of EC 1 increases from 0.5 to 2 MHz
shade represents the coverage of UAVs. The higher the UAV’s with B2 = 0.5. With the growing bandwidth of EC 1, the
location is, the larger its coverage becomes. Considering the higher downlink data rate will be obtained, resulting in the less
collision avoiding constraints of UAVs and channel condition, transmission delay, energy consumption and total system cost.
our proposed method can obtain the optimal location of UAVs Moreover, with the computation capacity UAVs Fu increasing,
to provide offloading opportunities for UEs. Furthermore, the more computation resource is allocated UEs, leading to the
according to the optimal task splitting ratio allocation strategy, less computation delay and total system cost.
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6957
Fig. 8. Total system cost with different per-device bandwidths Bk . Fig. 10. Total system cost as a function of the UAVs’ numbers.
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6958 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
Fig. 12. Total system cost with different optimization methods and uplink channel bandwidths Bu .
Fig. 13. Total system cost with different optimization methods and arrival rate of tasks λm .
Fig. 14. Total system cost with different optimization methods and maximum transmission power of UAVs Pmax .
Furthermore, since the random approach selects a random arrival rate of tasks λm increasing, the more total energy needs
action to achieve the maximum immediate reward, the large to be consumed for UAVs, resulting in the higher total system
total system cost is experienced with both numbers of UAVs, cost decreases in all optimization approaches. In addition, with
especially in the mobile UEs scenarios. With the fixed power more UAVs participating in task offloading, the smaller total
allocation and fixed height of UAVs, the MATD3-FP and system cost can be achieved in Fig. 14(b). Moreover, with
MATD3-FH methods always obtain the larger total system the relatively high fixed transmission power, the largest total
cost compared with our proposed MATD3 approach with both system cost is obtained with the MATD3-FH method in the
numbers of UAVs. In the case of the MADDPG method, as the case of N = 2. The random scheme always obtains the large
number of UAVs increasing, it will be more difficult to obtain total system cost with the high arrival rate of tasks. Without
the optimal action, leading to the worse performance in the UAVs participating in tasks processing, it will be challenging
case of the three UAVs. In the MATD3-EC method, without for the MATD3-EC method to deal with so much tasks,
UAVs participating in tasks processing, it always achieves especially in the case of N = 3. Compared with the other
larger total system cost compared with our proposed method. four learning approaches, our proposed MATD3 approach can
Our MATD3 method can always achieve the smallest total achieve the smallest total system cost with both numbers of
system cost among the six approaches in both fixed UEs and UAVs.
mobile UEs scenarios. Figure 14 shows total system cost as a function of maximum
Figure 13 plots total system cost as a function of arrival transmission power of UAVs Pmax with different optimization
rate of tasks λm with different optimization methods. With the methods. The MATD3-FP approach is considered with the
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6959
fixed power scheme (Pnt = Pmax ). With the maximum [2] X. Kang, Y.-C. Liang, and J. Yang, “Riding on the primary: A new
transmission power of UAVs Pmax increasing, we may need to spectrum sharing paradigm for wireless-powered IoT devices,” IEEE
Trans. Wireless Commun., vol. 17, no. 9, pp. 6335–6347, Sep. 2018.
use the higher transmission power of UAVs Pnt . Considering [3] C. Park and J. Lee, “Mobile edge computing-enabled heterogeneous net-
that the transmission energy consumption is an increasing works,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1038–1051,
function of Pnt , the higher system cost can be obtained as Feb. 2021.
[4] Q. Chen, H. Zhu, L. Yang, X. Chen, S. Pollin, and E. Vinogradov, “Edge
Pmax increases in all cases. It can be also observed that with computing assisted autonomous flight for UAV: Synergies between
more UAVs offering task offloading services, the scenario of vision and communications,” IEEE Commun. Mag., vol. 59, no. 1,
N = 3 can achieve the smaller total system cost than that of pp. 28–33, Jan. 2021.
[5] P. A. Apostolopoulos, G. Fragkos, E. E. Tsiropoulou, and
N = 2. S. Papavassiliou, “Data offloading in UAV-assisted multi-access edge
Moreover, since the MATD3-FP approach always allo- computing systems under resource uncertainty,” IEEE Trans. Mobile
cates the fixed transmission power of UAVs with Pnt = Comput., early access, Mar. 31, 2021, doi: 10.1109/TMC.2021.3069911.
[6] G. Yang, Y.-C. Liang, R. Zhang, and Y. Pei, “Modulation in the air:
Pmax , it may achieve the maximum downlink transmission Backscatter communication over ambient OFDM carrier,” IEEE Trans.
energy consumption among the six approaches, especially Commun., vol. 66, no. 3, pp. 1219–1233, Mar. 2018.
in the large maximum transmission power of UAVs Pmax . [7] X. Xu, H. Zhao, H. Yao, and S. Wang, “A blockchain-enabled energy-
efficient data collection system for UAV-assisted IoT,” IEEE Internet
As for the random method, the relatively higher total system Things J., vol. 8, no. 4, pp. 2431–2443, Feb. 2021.
cost is achieved compared with other four learning schemes [8] N. Zhao, Z. Liu, and Y. Cheng, “Multi-agent deep reinforcement learning
(MATD3-FH, MADDPG, MATD3-EC, and MATD3). With for trajectory design and power allocation in multi-UAV networks,” IEEE
Access, vol. 8, pp. 139670–139679, 2020.
the fixed height of UAVs, the MATD3-FH method may need [9] G. Yang, R. Dai, and Y. C. Liang, “Energy-efficient UAV backscatter
the more transmission power of UAVs to guarantee the enough communication with joint trajectory design and resource optimization,”
downlink transmission data rate, which results in the larger IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 926–941, Feb. 2021.
[10] M. Li, N. Cheng, J. Gao, Y. Wang, L. Zhao, and X. Shen, “Energy-
transmission energy consumption. In the MATD3-EC method, efficient UAV-assisted mobile edge computing: Resource allocation and
since all UAVs only offload all tasks to ECs for process- trajectory optimization,” IEEE Trans. Veh. Technol., vol. 69, no. 3,
ing directly, the downlink transmission energy consumption pp. 3424–3438, Mar. 2020.
[11] Y. Wang, Z.-Y. Ru, K. Wang, and P.-Q. Huang, “Joint deployment and
accounts for a large proportion in the total system cost. Then, task scheduling optimization for large-scale mobile users in multi-UAV-
as Pmax increases, it may achieve the larger total system enabled mobile edge computing,” IEEE Trans. Cybern., vol. 50, no. 9,
cost, especially in the case of N = 3. Clearly, MADDPG pp. 3984–3997, Sep. 2020.
[12] Y. Xu, T. Zhang, D. Yang, Y. Liu, and M. Tao, “Joint resource and
experiences the worse performance with the larger number of trajectory optimization for security in UAV-assisted MEC systems,”
UAVs compared with other methods. Our proposed approach IEEE Trans. Commun., vol. 69, no. 1, pp. 573–588, Jan. 2021.
greatly outperforms the above four schemes with the smallest [13] Z. Yu, Y. Gong, S. Gong, and Y. Guo, “Joint task offloading and
resource allocation in UAV-enabled mobile edge computing,” IEEE
total system cost with both numbers of UAVs. Especially when Internet Things J., vol. 7, no. 4, pp. 3147–3159, Apr. 2020.
N = 2, our proposed approach always obtains the optimal [14] Y. Liu, S. Xie, and Y. Zhang, “Cooperative offloading and resource
transmission power of UAVs regardless of the maximum management for UAV-enabled mobile edge computing in power IoT
system,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12229–12239,
transmission power Pmax . Oct. 2020.
[15] J. Ji, K. Zhu, C. Yi, and D. Niyato, “Energy consumption minimization
in UAV-assisted mobile-edge computing systems: Joint resource allo-
V. C ONCLUSION cation and trajectory design,” IEEE Internet Things J., vol. 8, no. 10,
pp. 8570–8584, May 2021.
This paper investigated a UAV-assisted MEC system with [16] J. Zhang et al., “Stochastic computation offloading and trajectory
multiple UAVs and multiple ECs offloading computation tasks scheduling for UAV-assisted mobile edge computing,” IEEE Internet
of UEs collaboratively. An optimization problem was for- Things J., vol. 6, no. 2, pp. 3688–3699, Apr. 2019.
[17] C. Sun, W. Ni, and X. Wang, “Joint computation offloading and
mulated to obtain the minimum sum of execution delays trajectory planning for UAV-assisted edge computing,” IEEE Trans.
and energy consumptions by jointly designing the trajecto- Wireless Commun., vol. 20, no. 8, pp. 5343–5358, Aug. 2021, doi:
ries, computation task allocation, and communication resource 10.1109/TWC.2021.3067163.
[18] C. Zhan, H. Hu, Z. Liu, Z. Wang, and S. Mao, “Multi-UAV-enabled
management. A cooperative MADRL framework was devel- mobile-edge computing for time-constrained IoT applications,” IEEE
oped to tackle the non-convexity of the task offloading opti- Internet Things J., vol. 8, no. 20, pp. 15553–15567, Oct. 2021, doi:
mization issue. Considering the high-dimensional continuous 10.1109/JIOT.2021.3073208.
[19] R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduction.
action space, MATD3 algorithm was presented to obtain the MIT Press Cambridge, 1998.
optimal policy efficiently. Numerical evaluations were given [20] N. C. Luong et al., “Applications of deep reinforcement learn-
to indicate that the proposed collaborative UAV-EC offloading ing in communications and networking: A survey,” IEEE Com-
mun. Surveys Tuts., vol. 21, no. 4, pp. 3133–3174, May 2019, doi:
method can adapt to the mobility of UEs, the change of com- 10.1109/COMST.2019.2916583.
munication and computation resources, and the dynamicity [21] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
of computation tasks. The proposed scheme can significantly reinforcement learning for user association and resource allocation
in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
reduce the total system cost compared with other optimization vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
approaches. [22] H. Peng and X. Shen, “Multi-agent reinforcement learning based
resource management in MEC- and UAV-assisted vehicular networks,”
IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 131–141, Jan. 2021.
R EFERENCES [23] A. Asheralieva and D. Niyato, “Hierarchical game-theoretic and
reinforcement learning framework for computational offloading in
[1] G. Yang, Q. Zhang, and Y.-C. Liang, “Cooperative ambient backscatter UAV-enabled mobile edge computing networks with multiple service
communications for green Internet-of-Things,” IEEE Internet Things J., providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 8753–8769,
vol. 5, no. 2, pp. 1116–1130, Apr. 2018. Oct. 2019.
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6960 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022
[24] S. Zhu, L. Gui, D. Zhao, N. Cheng, Q. Zhang, and X. Lang, “Learning- Yiyang Pei (Senior Member, IEEE) received the
based computation offloading approaches in UAVs-assisted edge com- B.Eng. and Ph.D. degrees in electrical and electronic
puting,” IEEE Trans. Veh. Technol., vol. 70, no. 1, pp. 928–944, engineering from Nanyang Technological Univer-
Jan. 2021. sity, Singapore, in 2007 and 2012, respectively.
[25] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. S. Shu, “Path planning From 2012 to 2016, she was a Research Scientist
for UAV-mounted mobile edge computing with deep reinforcement with the Institute for Infocomm Research, Singa-
learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, pore. She is currently an Associate Professor with
May 2020. the Singapore Institute of Technology, Singapore.
[26] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan, Her current research interests include reconfigurable
“Deep reinforcement learning based dynamic trajectory control for UAV- intelligent surface, dynamic spectrum access, and
assisted mobile edge computing,” IEEE Trans. Mobile Comput., early application of machine learning to wireless commu-
access, Feb. 16, 2021, doi: 10.1109/TMC.2021.3059691. nications and networks. She was a recipient of the IEEE Communications
[27] T. Ren et al., “Enabling efficient scheduling in large-scale UAV- Society Stephen O. Rice Prize Paper Award in 2021. She is an Editor of IEEE
assisted mobile edge computing via hierarchical reinforcement learn- T RANSACTIONS ON C OGNITIVE C OMMUNICATIONS AND N ETWORKING.
ing,” IEEE Internet Things J., early access, Apr. 7, 2021, doi:
10.1109/JIOT.2021.3071531.
[28] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multi-
agent deep reinforcement learning-based trajectory planning for multi-
UAV assisted mobile edge computing,” IEEE Trans. Cognit. Commun.
Netw., vol. 7, no. 1, pp. 73–84, Mar. 2021.
[29] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D place- Ying-Chang Liang (Fellow, IEEE) was a Professor
ment of an unmanned aerial vehicle base station (UAV-BS) for energy- with The University of Sydney, Australia, a Prin-
efficient maximal coverage,” IEEE Wireless Commun. Lett., vol. 6, no. 4, cipal Scientist and a Technical Advisor with the
pp. 434–437, Aug. 2017. Institute for Infocomm Research, Singapore, and
[30] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge com- a Visiting Scholar with Stanford University, USA.
puting: Partial computation offloading using dynamic voltage scaling,” He is currently a Professor with the University of
IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, Oct. 2016. Electronic Science and Technology of China, China,
[31] F. Ding, X. Zhang, and L. Xu, “The innovation algorithms for multivari- where he leads the Center for Intelligent Networking
able state-space models,” Int. J. Adapt. Control Signal Process., vol. 33, and Communications (CINC). His research interests
no. 11, pp. 1601–1608, Oct. 2019. include wireless networking and communications,
[32] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approx- cognitive radio, symbiotic communications, dynamic
imation error in actor-critic methods,” 2018, arXiv:1802.09477. spectrum access, the Internet of Things, artificial intelligence, and machine
[33] T. Yuan, W. D. R. Neto, C. E. Rothenberg, K. Obraczka, C. Barakat, and learning techniques.
T. Turletti, “Dynamic controller assignment in software defined Internet Dr. Liang is a Foreign Member of Academia Europaea. He was a Distin-
of vehicles through multi-agent deep reinforcement learning,” IEEE guished Lecturer of the IEEE Communications Society and the IEEE Vehicu-
Trans. Netw. Service Manage., vol. 18, no. 1, pp. 585–596, Mar. 2021. lar Technology Society. He received the Prestigious Engineering Achievement
[34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Award from the Institution of Engineers, Singapore, in 2007, the Outstanding
“Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf. Contribution Appreciation Award from the IEEE Standards Association in
Mach. Learn., vol. 32, Jun. 2014, pp. 387–395. 2011, and the Recognition Award from the IEEE Communications Society
[35] F. Ding, L. Xu, D. Meng, X.-B. Jin, A. Alsaedi, and T. Hayat, “Gradient Technical Committee on Cognitive Networks in 2018. He was a recipient of
estimation algorithms for the parameter identification of bilinear systems numerous paper awards, including the IEEE Communications Society Stephen
using the auxiliary model,” J. Comput. Appl. Math., vol. 369, May 2020, O. Rice Prize Paper Award in 2021, the IEEE Jack Neubauer Memorial
Art. no. 112575. Award in 2014, and the IEEE Communications Society APB Outstanding
[36] M. Sipper, “A serial complexity measure of neural networks,” in Proc. Paper Award in 2012. He was the Chair of the IEEE Communications
IEEE Int. Conf. Neural Netw., San Francisco, CA, USA, Mar. 1993, Society Technical Committee on Cognitive Networks and served as the TPC
pp. 962–966. Chair and the Executive Co-Chair for IEEE GLOBECOM’17. He was a
Guest/an Associate Editor of IEEE T RANSACTIONS ON W IRELESS C OMMU -
NICATIONS , IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS ,
IEEE Signal Processing Magazine, IEEE T RANSACTIONS ON V EHICULAR
T ECHNOLOGY, and IEEE T RANSACTIONS ON S IGNAL AND I NFORMATION
Nan Zhao (Member, IEEE) received the B.S., M.S., P ROCESSING OVER N ETWORKS . He was the Associate Editor-in-Chief of
and Ph.D. degrees from Wuhan University, Wuhan, the Random Matrices: Theory and Applications (World Scientific). He is the
China, in 2005, 2007, and 2013, respectively. She Founding Editor-in-Chief of IEEE J OURNAL ON S ELECTED A REAS IN C OM -
is currently a Professor with the Hubei Univer- MUNICATIONS : Cognitive Radio Series, and the Key Founder and the Editor-
sity of Technology, Wuhan, and also works as a in-Chief of IEEE T RANSACTIONS ON C OGNITIVE C OMMUNICATIONS AND
Post-Doctoral Research Fellow at the University N ETWORKING. He is also serving as the Associate Editor-in-Chief for China
of Electronic Science and Technology of China. Communications. He has been recognized by Thomson Reuters (now Clarivate
Her current research involves machine learning in Analytics) as a Highly Cited Researcher since 2014.
wireless communications.
Zhiyang Ye received the bachelor’s degree from Dusit Niyato (Fellow, IEEE) received the B.Eng.
Nanchang Hangkong University in 2019. He is cur- degree from the King Mongkut’s Institute of Tech-
rently pursuing the master’s degree in electrical engi- nology Ladkrabang in 1999 and the Ph.D. degree
neering with the Hubei University of Technology. in electrical and computer engineering from the
His main research focuses on machine learning in University of Manitoba, Canada, in 2008. He is cur-
wireless communications. rently a Full Professor with the School of Computer
Science and Engineering, Nanyang Technological
University, Singapore. His research interests are in
the areas of green communications, the Internet of
Things, and sensor networks.
Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.