0% found this document useful (0 votes)
49 views12 pages

Zhao 等 - 2022 - Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing

Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing

Uploaded by

ZHUOER LIU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views12 pages

Zhao 等 - 2022 - Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing

Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing

Uploaded by

ZHUOER LIU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO.

9, SEPTEMBER 2022 6949

Multi-Agent Deep Reinforcement Learning for Task


Offloading in UAV-Assisted Mobile
Edge Computing
Nan Zhao , Member, IEEE, Zhiyang Ye, Yiyang Pei , Senior Member, IEEE,
Ying-Chang Liang , Fellow, IEEE, and Dusit Niyato , Fellow, IEEE

Abstract— Mobile edge computing can effectively reduce Index Terms— Mobile edge computing, UAV networks, task
service latency and improve service quality by offloading offloading, cooperative offloading, deep reinforcement learning.
computation-intensive tasks to the edges of wireless networks.
Due to the characteristic of flexible deployment, wide coverage
and reliable wireless communication, unmanned aerial vehi- I. I NTRODUCTION
cles (UAVs) have been employed as assisted edge clouds (ECs)
for large-scale sparely-distributed user equipment. Consider-
ing the limited computation and energy capacities of UAVs,
a collaborative mobile edge computing system with multiple
W ITH the development of mobile applications (i.e., auto-
matic navigation, infrastructures monitoring, online
games, etc.), more and more mobile application tasks
UAVs and multiple ECs is investigated in this paper. The task become computation-intensive and delay-sensitive, especially
offloading issue is addressed to minimize the sum of execu- in Internet-of-Things [1], [2]. However, these tasks may
tion delays and energy consumptions by jointly designing the impose a great challenge on user equipment (UE), which have
trajectories, computation task allocation, and communication
resource management of UAVs. Moreover, to solve the above a limited computation and battery capabilities. To address
non-convex optimization problem, a Markov decision process is these challenges, multi-access edge computing (MEC) [3]
formulated for the multi-UAV assisted mobile edge computing is considered to be an extension of cloud computing for
system. To obtain the joint strategy of trajectory design, task allo- data computation and communication in mobile networks.
cation, and power management, a cooperative multi-agent deep Instead of transmitting the computation requests to the central
reinforcement learning framework is investigated. Considering
the high-dimensional continuous action space, the twin delayed computing stations, MEC places servers at the mobile network
deep deterministic policy gradient algorithm is exploited. The edges (i.e., cellular base stations or WiFi access points) with
evaluation results demonstrate that our multi-UAV multi-EC task computation and storage resources. It will be more convenient
offloading method can achieve better performance compared with for servers to offer computing services to deal with intensive
the other optimization approaches. computation tasks of UEs, leading to lower service latency
and better service quality.
Manuscript received 3 June 2021; revised 14 December 2021; accepted
18 February 2022. Date of publication 2 March 2022; date of current version Nevertheless, there is still a challenging issue for UEs to
12 September 2022. This work was supported in part by the National Key obtain the reliable computation services. On one hand, many
Research and Development Program of China under Grant 2018YFB1801105; UEs execute computation-intensive applications in remote or
in part by the National Natural Science Foundation of China under Grant
U1801261 and Grant 61801101; in part by the Key Areas of Research mountainous areas, where communication infrastructures are
and Development Program of Guangdong Province, China, under Grant always distributed sparsely with poor communication condi-
2018B010114001; in part by the Science and Technology Development Fund, tions and uncertain MEC environments [4]. On the other hand,
Macau SAR, under Grant 0009/2020/A1; in part by the Key Research and
Development Plan of Hubei Province under Grant 2021BGD013; in part by there may be massive users to require computation-intensive
the Program of Introducing Talents of Discipline to Universities under Grant services simultaneously. With limited storage and computation
B20064; and in part by the National Research Foundation, Singapore, under its resources, it will be difficult for MEC servers to offer their
the AI Singapore Program, under Grant AISG2-RP-2020-019. The associate
editor coordinating the review of this article and approving it for publication computation services, especially in hotspot areas [5]. Fortu-
was K. Tourki. (Corresponding author: Yiyang Pei.) nately, due to the advantages of flexible deployment and large
Nan Zhao and Zhiyang Ye are with the Hubei Collaborative Innovation coverage, unmanned aerial vehicles (UAVs) have been applied
Center for High-Efficiency Utilization of Solar Energy, Hubei University
of Technology, Wuhan 430068, China (e-mail: [email protected]; to assist MEC systems to execute the computation-intensive
[email protected]). tasks [6], [7]. By establishing LoS links with ground UEs,
Yiyang Pei is with the Singapore Institute of Technology, Singapore 138683 the UAVs can act as the “flying MEC servers” to offer
(e-mail: [email protected]).
Ying-Chang Liang is with the Center for Intelligent Networking and considerable offloading services with low network overhead
Communications (CINC), University of Electronic Science and Technology and execution latency.
of China (UESTC), Chengdu 610056, China, and also with the Peng Cheng Although prior works in the UAV-assisted networks mainly
Laboratory, Shenzhen, Guangdong 518066, China (e-mail: [email protected]).
Dusit Niyato is with the School of Computer Science and Engi- focus on communication aspects [8], [9], there is still some
neering, Nanyang Technological University, Singapore 639798 (e-mail: research on UAVs-assisted MEC systems, such as trajectory
[email protected]). design [10]–[12], resource management [13]–[15], computa-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TWC.2022.3153316. tion offloading [16]–[18]. However, most existing works con-
Digital Object Identifier 10.1109/TWC.2022.3153316 sidered the scenario of single UAV for computation offloading.
1536-1276 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6950 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

Due to the limited computation and energy capacities, one


UAV may provide the quite limited performance of task
offloading. It will be more suitable to investigate the scenario
with multiple UAVs and multiple edge clouds (ECs) collabora-
tively. Moreover, almost all the existing studies focused on the
static UAV-assisted MEC systems with fixed UEs. Practically,
UEs always move around during computing, which makes
it difficult to obtain the optimal strategy. Furthermore, since
UAVs need to fly to certain areas to offer their offloading helps
from different taking off points, different trajectories of UAVs
may cause various channel qualities, leading to different com- Fig. 1. Multi-UAV assisted MEC system with M UEs, N UAVs, and
K ECs.
munication delays and energy consumptions. The allocated
amounts of computation tasks of UAVs may also influence cost by jointly designing the trajectories, computation
on computation delay and energy with the limited on-board task allocation, and communication resource management
resource. Thus, it will be necessary to jointly consider the of UAVs.
issues of trajectories, computation tasks allocation, and com- • We formulate the highly complex non-convex optimiza-
munication resource management to obtain the minimum exe- tion problem as an MDP, which is then solved by a
cution delays and energy consumptions. Unfortunately, with novel cooperative MADRL framework with each UAV
the non-convex nature and non-stationary environment, it may acting as an agent. Considering the high-dimensional
be difficult to obtain the global optimal policy without exact continuous action space, the TD3 algorithm is designed to
and complete information about the environment. find the efficient UAVs’ movements, task offloading allo-
Recently, some research has tried to deal with the cation, and communication resource management based
joint optimization issue in UAV-assisted MEC systems via on dynamic MEC environments.
reinforcement learning (RL) [19]–[22]. By exploring the • We conduct numerical simulations and demonstrate that
dynamic MEC environments, RL can make intelligent deci- the proposed collaborative UAV-EC offloading scheme
sion under uncertainty. In [23], a hierarchical game-theoretic outperforms other optimization approaches, especially in
and RL framework was proposed for computational offload- terms of adaptability to UEs’ mobility, robustness to the
ing with multiple service providers. Zhu et al. studied the change of communication and computation resources, and
learning-based computation offloading mechanism to mini- flexibility to the dynamicity of computation tasks.
mize the average mission response time [24]. In [25], the
The rest of this paper is organized as follows. In Section II,
authors presented a deep RL (DRL) approach to plan flying
we provide system model and problem formulation. Section III
path for UAV-mounted MEC systems. In [26], a DRL approach
proposes MADRL framework to address task offloading
was investigated to minimize energy consumption by optimiz-
issues. We present simulation results in Section IV and
ing the dynamic trajectory control strategy. In [27], Ren et al.
conclusion in Section V.
proposed an efficient scheduling strategy via hierarchical RL
for the large-scale UAV-assisted MEC. In [28], a multi-agent
DRL method was studied for trajectory planning in multi-UAV II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
assisted MEC systems. However, if the number of UAVs or Fig. 1 presents the multi-UAV assisted MEC system with
UEs or ECs is large, the state (i.e., UAVs’ positions) and M UEs, N UAVs, and a set of K ECs. Each UE m needs
action (i.e., UAVs’ movements, tasks allocation, and resource to periodically handle computation-intensive tasks Wm =
management) may grow exponentially, leading to the poor (Dm , Cm , λm ), where Dm is the size of task data, Cm is
convergence efficiency. the number of CPU cycles, and λm is the arrival rate of the
To deal with the above challenges, this paper investigates tasks. Considering the limited computation capacities, UEs
the collaborative UAV-assisted MEC systems, where multiple cannot perform local computing. Then, UAVs are deployed to
UAVs and multiple ECs are designed to offload computa- offer MEC services to ground UEs. Practically, the UAVs are
tion tasks of UEs. The UEs’ tasks offloading optimization planned carefully to avoid overlapping trajectories to conserve
problem is formulated to obtain the minimum execution energy and avoid collision. Therefore, we assume that each
delays and energy consumptions. A cooperative multi-agent UAV is deployed to offer MEC services for ground UEs within
DRL (MADRL) approach is proposed to obtain the trajecto- one corresponding sub-area and that there are no overlaps
ries, computation task allocation, and communication resource between each sub-area. Moreover, it is assumed that all UAVs
management at UAVs. The major contributions of our work are connected to a single ground cloud server via the wireless
are the following: backhaul links.
• We investigate a collaborative task offloading strategy in In the multi-UAV assisted MEC system, limited by factors
the multi-UAV multi-EC MEC systems, where UAVs and such as size, weight, and power, the UAVs can provide
ECs offload computation tasks of UEs collaboratively. limited computation and communication resources. Unlike
Cooperative MADRL method for this scenario has never UAVs, the ECs always consists of MEC servers with more
been investigated. The task offloading optimization prob- resources of computation and communication. Therefore, this
lem is formulated to obtain the minimum total system paper considers four main components of task offloading

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6951

To ensure that the coverage of arbitrary two UAVs cannot


overlap with each other, the following overlapping constraint
must be satisfied,
 n j

vn (t) − vj (t) ≥ Cmax (t) + Cmax (t) , ∀n, j, n = j. (9)
Similarly, to avoid collision between any two UAVs, the
distance of UAVs should be no less than a minimum distance
Dmin . Then, we have the following collision constraint
ω n (t) − ω j (t) ≥ Dmin , ∀n, j, n = j. (10)
Note that if UEs are located within the coverage of certain
Fig. 2. Multi-UAV assisted MEC system with M UEs, N UAVs, and UAV, the UEs will be served by the same UAV. Let UAV
K ECs. n serve Mn (t) UEs at time t. We denote ρnm (t) as a binary
service-association vector. ρnm (t) = 1 when UE m is served
process: 1) ground-to-air (G2A) transmission from UEs to by UAV n, and ρnm (t) = 0 otherwise. Assume that each UE
UAVs; 2) computation at the UAVs; 3) Air-to-ground (A2G) can only be served by at most one UAV at any time. That is,
transmission from UAVs to ECs; and 4) computation at the 
N
ECs. ρnm (t) ≤ 1, ∀m, ∀t.
n=1

A. UAVs Movement B. G2A Transmission From UEs to UAVs


As shown in Fig. 2, the 3D coordinate of UAV n is Here, we denote ω m (t) = [xm (t), ym (t), 0]T as the location
denoted as ω n (t) = [xn (t), yn (t), zn (t)]T , where xn (t), yn (t), of UE m, where xm (t) and ym (t) are the X and Y coordinates,
and zn (t) are the X, Y, Z coordinates of UAV n at time respectively. The distance between UAV n and UE m can be
t, respectively. Denote vn (t) = [xn (t), yn (t)]T as the 2D given by
coordinate of UAV n. Assume that UAV n flies the distance
ln (t) with the angle direction ϑn (t) ∈ [0, 2π) in the horizontal dmn (t) = ω n (t) − ω m (t) . (11)
flight. Then, we have
Similar to [24], [26], [27], assume that the ground UEs
xn (t + 1) = xn (t) + ln (t)cos(ϑn (t)), (1) can communicate with their serving UAV via the orthogo-
yn (t + 1) = yn (t) + ln (t)sin(ϑn (t)). (2) nal frequency-division multiple access. Then, the interference
between different UEs in the coverage of each UAV can be
Additionally, according to [26], [29], assume that UAV ignored. Due to the high altitude of UAVs, the LoS channel
n has a maximum elevation angle φn . Then, at time t, is much more dominant than other channel impairments such
n
the maximum horizontal radius of UAV n Cmax (t) can be as shadowing or small-scale fading. The Doppler shift caused
obtained by the high mobility of UAVs can be assumed to be perfectly
n compensated at the UEs [15]. Then, the G2A channel gain
Cmax (t) = zn (t)tan(φn ). (3)
between UE m and UAV n can be denoted by the free-space
Due to its limited horizontal-flight and vertical-flight speeds, path loss model, which is given by
UAVs always have limited flight distances, which can be given g0
hmn (t) = , (12)
by [dmn (t)]2
Zmin ≤ zn (t) ≤ Zmax , (4) where g0 denotes the power gain with the reference distance
ln (t) = vn (t + 1) − vn (t) ≤ Lhmax , (5) of 1 meter.
Δzn (t) = |zn (t + 1) − zn (t)| ≤ Lvmax , (6) During the task offloading process, the uplink bandwidth
Bu is assumed to be allocated to each UE equally. Then, the
where Zmin and Zmax denote the minimum and maximum G2A data rate between UE m and UAV n is
heights, respectively; Δzn (t) denotes the vertical travel dis-  
Bu hmn (t)Pm
tance; Lhmax and Lvmax are the maximum horizontal and Rmn (t) = log 1 + , (13)
Mn (t) 2 σu2
vertical distances of the UAVs, respectively.
Moreover, in order to guarantee that UAVs move within the where Pm is the transmit power of UE m, σu2 is the additive
served rectangle-shaped area, the following move constraint white Gaussian noise power at each UAV.
must be satisfied, that is, Considering that all tasks are offloaded to UAVs through
the G2A channel, the G2A transmission delay between UE m
0 ≤ xn (t) ≤ Xmax , (7)
and UAV n can be defined as the task data size Dm divided
0 ≤ yn (t) ≤ Ymax , (8) by the corresponding transmission data rate Rmn (t), that is,
where Xmax and Ymax are the side lengths of the rectangle- G2A Dm
Tmn (t) = . (14)
shaped area, respectively. Rmn (t)

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6952 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

Similarly, the G2A transmission energy consumption Similarly, the A2G transmission energy consumption
between UE m and UAV n can be defined as between UE m and EC k through UAV n can be obtained
G2A Dm Pnr as
Emn (t) = Pnr Tmn
G2A
(t) = , (15) n
Rmn (t) A2G γmk (t)Dm Pnt
Emnk (t) = Pnt Tmnk
A2G
(t) = . (22)
where Pnr is the receiving power of UAV n. Rnk (t)

E. Computation at the ECs


C. Computation at the UAVs ECs begin to handle the computation tasks when obtaining
After receiving the entire input data from UEs, each UAV the task data from UAVs. Considering the task proportion
n
decides how much tasks computed locally. We define γmk (t) ∈ n
γmk (t), the computation delay at EC k can be given by
n
[0, 1] and γm0 (t) ∈ [0, 1] as the proportion of tasks of UE m n
EC γmk (t)Dm Cm
executed at EC k and UAV n, respectively. The computation Tmnk (t) = , (23)
delay of UAV n handling the task of UE m is given by fmk (t)
n
γm0 (t)Dm Cm where fmk (t) denotes the computation resource allocated to
UAV
Tmn (t) = , (16) UE m. Here, the total computation resource Fke of EC k is
fmn (t)
allocated to each UE equally, that is fmk (t) = Fke /M .
where fmn (t) denotes the computation resource allocated from
UAV n to UE m. For simplicity, the UAV’s computation
F. Problem Formulation
resource Fu is allocated to each served UE equally, that is
n When the computation tasks of all UEs are completed, the
fmn (t) = Fu /Mn (t). If γm0 (t) = 0, all tasks of UE m are
n energy consumption of UAV n can be obtained as
processed at ECs, while γm0 (t) = 1, all tasks of UE m are
computed at UAV n. M

UAV  G2A 
Then, considering the computation time Tmn (t) and the En (t) = ρnm (t)λm Emn UAV
(t)+Emn A2G
(t)+Emnk (t) , (24)
power consumption [30], the energy consumption of UAV n m=1
handling the task of UE m can be obtained as where λm denotes the arrival rate of tasks.
UAV UAV 3 Moreover, considering that the communication and compu-
Emn (t) = κ [fmn (t)] Tmn (t), (17)
tation modules are often separated at the UAVs, the compu-
where κ ≥ 0 is the effective switched capacitance. tation at the UAVs can be processed simultaneously with the
task transmission to ECs. Then, the execution delay of UAV
D. A2G Transmission From UAVs to ECs n is given by
M

Here, ωk = [xk , yk , 0]T is the fixed location of EC k, where
xk and yk are the coordinates of EC k, respectively. Then, the Tn (t) = ρnm (t)
m=1
distance between UAV n and EC k can be given by  
G2A UAV A2G EC
× Tmn (t)+max{Tmn (t),Tmnk (t)+Tmnk (t)} .
dkn (t) = ωn (t) − ωk  . (18) k
(25)
Considering that certain UAV offloads some tasks to ECs
for further computing, the A2G channel gain between UAV n Then, similar to [13], [23], we denote the weighted sum of
and EC k can be defined as energy consumption En (t) and execution delay Tn (t) as the
g0 system cost of UAV n, that is,
hnk (t) = 2. (19)
[dkn (t)]
Un (t) = w1 En (t) + w2 Tn (t), (26)
Then, the transmission data rate between UAV n and EC k
is given by where w1 and w2 are the weights to indicate the different
  significance on energy consumption and execution delay,
hnk (t)Pnt (t) respectively. w1 ≥ w2 indicates the energy-saving scenarios
Rnk (t) = Bk log2 1 + , (20)
σe2 while w1 < w2 is for the delay-sensitive cases.
where Bk is the bandwidth pre-assigned to EC k and 0 ≤ Thus, by jointly optimizing the UAVs’ position ωn (t),
n n
Pnt (t) ≤ Pmax denotes the transmit power of UAV n at task partition ratios (γm0 (t) and γmk (t)), and transmit power
t
time t, Pmax is the maximum transmission power of each (Pn (t)), the task offloading optimization problem can be
UAV, and σe2 is the additive white Gaussian noise power at designed to minimize the total system cost, which is formu-
each EC. lated as
N

Considering the task data size of ECs and the transmission
data rate Rmn (t), the A2G transmission delay between UE m min
n
Un (t), (27a)
ω n (t),γm0 (t),
and EC k through UAV n can be defined as n n=1
γmk (t),Pnt (t)
n
n
γmk (t)Dm s.t. 0 ≤ γmk (t) ≤ 1, (27b)
A2G
Tmnk (t) = . (21) n
Rnk (t) 0 ≤ γm0 (t) ≤ 1, (27c)

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6953


n n
γm0 (t) + γmk (t) = 1, ∀n (27d) According to the constraints of the minimum optimization
k problem (27), we can have the value ranges of each element
0 ≤ Pnt (t) ≤ Pmax , (27e) in an (t), that is, ln (t) ∈ [0, Lhmax], ϑi (t) ∈ [0, 2π), Δzn (t) ∈
(4) − (10), (27f) [−Lvmax , Lvmax ], Pnt (t) ∈ [0, Pmax ], and γmkn
(t) ∈ [0, 1]. Also,
we can observe that the action space An of UAV n is a
where (27b), (27c), and (27d) denote the offloading tasks continuous set. Moreover, with the number of UEs and ECs
constraints of UEs, (27e) is the constraint about the transmit increasing, the size of action spaces exponentially increases.
power of UAVs, (4)-(10) describe the movement constraints 4) Reward Function Rn : To solve the formulated task
of UAVs. offloading optimization problem (27), the N agents should
Generally, it is challenging to solve the non-convex opti- cooperatively minimize the total system cost while satisfying
mization problem (27). Certain unknown variables (i.e., UEs’ certain constraints, such as the overlapping and collision
location and channel condition) may influence the energy constraints. Then, the reward function Rn (t) of UAV n
consumption and execution delay, especially in the dynamic is defined as the negative of the system cost Un (t) if all
network induced by UAVs’ mobility. Moreover, considering constraints are satisfied. Otherwise, if certain constraints are
the decision with the large solution space, it will be intractable not satisfied, there will be the corresponding penalties in the
to obtain the optimal strategy by traditional optimization reward function Rn (t). Moreover, to guarantee UAVs provide
schemes. To address these challenges, an RL method will computing service to all UEs, the coverage constraint of
be investigated to learn the near-optimal policy with little UAVs should be satisfied. If certain UE is beyond the UAVs’
environment information in the next section. coverage, there will be a penalty in the reward function. Thus,
based on the above consideration, the reward function of UAV
III. MADRL FOR TASK O FFLOADING n is given by

O PTIMIZATION P ROBLEM ⎪ −Un (t),

⎪ if satisfying constraints,
Here, we first re-model the above problem as a multi-agent ⎨
Rn (t) = −η1 −η2 −η3 (30)
extension of the MDP, which is then solved by an MADRL ⎪
⎪ N
method. ⎪
⎩ [M − Mn (t)], otherwise,
n=1
where η1 , η2 , and η3 denote the penalties related with the
A. MDP Formulation overlapping constraint (9), the collision constraint (10), and the
In UAV-assisted MEC systems, UAVs determine their posi- coverage constraint, respectively. If the horizontal distance of
tion, transmit power and task partition ratios to obtain the any two UAVs does not meet the overlapping constraints (9),
minimum total system cost. Considering that UAVs’ actions each of the two UAVs will experience a penalty η1 . Moreover,
(i.e., UAVs’ movements) may influence the environmental if the distance between any two UAVs does not satisfy the
state, the total system cost is determined by the current state collision constraints (10), there will be a penalty η2 in the
of system environment and the joint actions of all UAVs. reward functions of the two UAVs. Finally, when any UEs
Moreover, the former state and previous actions jointly trigger are not covered by UAVs, all UAVs will incur the penalty
the system environment into a new stochastic state [31]. 
N
η3 [M − Mn (t)].
In this case, the task offloading optimization issue (27) can be n=1
formulated as a multi-agent Markov decision process (MDP)
N , S, {An }n∈N , P, {Rn }n∈N , δ . N is the agent set, S is B. Multi-Agent DRL Algorithm
the state set of all agents, An is the action space of agent n, To solve the above multi-agent MDP, considering the
P represents the state transition probability, Rn is the reward high-dimensional continuous action space of the task offload-
function of agent n, and δ ∈ [0, 1] denotes the discount factor. ing optimization problem, the multi-agent TD3 (MATD3)
1) Agent Set N : Each UAV acts as an agent to learn its approach is proposed, shown in Fig. 3. Each UAV adopts
scheme of position, transmission power and task partition a TD3 algorithm [32], which comprises one actor network
ratios and obtain the minimum total system cost. Thus, N = with weights μn and two critic networks with weights θn1 and
{1, . . . , N }. θn2 . With the two critic networks, each UAV can deal with
2) State Space S: According to the task offloading opti- the overestimation problem of the Q-values in the one-critic
mization problem, the state s(t) is composed of the 3D framework. In addition, to improve the learning stability,
coordinate positions of UAVs, that is, the target actor network with weights μn and target critic

networks with weights {θni }i=1,2 are adopted.
s(t) = {ω1 (t), ω 2 (t) . . . , ωN (t)}. (28) Different from other multi-agent RL algorithms, where each
3) Action Space An : Since each UAV is required to deter- agent tries to maximize its reward function Rn (t), a cooper-
mine its movements (horizontal fly distance ln (t), horizon- ative multi-agent RL architecture is adopted to achieve the
tal direction angle ϑn (t), and vertical fly distance Δzn (t)), maximum expected discounted reward with the sum reward
transmission power Pnt (t) and task partition ratios γmk
n
(t), the of all UAVs, which is defined as
action space an (t) of UAV n can be given by N

R(t) = Rn (t). (31)
an (t) = {ln (t), ϑn (t), Δzn (t), Pnt (t), γmk
n
(t), ∀k}. (29) n=1

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6954 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

Algorithm 1 MATD3 Approach for Task Offloading Problem


• Initialize each UAV’s actor networks with weights μn and
μn , respectively.
• Initialize each UAV’s critic networks with weights

{θni }i=1,2 and {θni }i=1,2 , respectively.
• Initialize each UAV’s replay buffer B.
• for each episode do
• Initialize the state s(t) and t = 1.
• while t < Tp do
• Each UAV selects action an (t) = πnμ (sn (t)) + ξ.
• All UAVs set their movements, transmission power
Fig. 3. The MATD3 framework in Multi-UAV assisted MEC system. and task partition ratios according to the joint action a(t).
• All UAVs obtain the reward R(t) and the next state
Moreover, considering the non-stationarity of the net- s(t + 1) and joint action a(t) via communication.
work environment, to guarantee convergence, the strategy • Store (s(t), s (t), a(t), R(t) in B for all n ∈ N .
based on centralized training and decentralized execution is • s(t) ← s (t).
adopted [33]. Specifically, in the centralized training stage, • for n = 1, . . . , N do
the evaluation critic networks and target critic networks • Sample a random mini-batch of (sj , sj , aj , rj ) for
are designed to obtain a global view and deployed in the all UAVs from B.
ground cloud server. Evaluation critic networks are at the state • Update weights {θni }i=1,2 of evaluation critic net-
sn (t) and action an (t) of other agents via communication. works by minimizing loss function L(θni ) in (35).
Then, all UAVs utilize global state s(t) and joint actions • If t mod d then
a(t) = {a1 (t), a2 (t), . . . , aN (t)} so that the policy of other • Update weights μn of evaluation actor network
UAVs can be estimated and Q-function Qθni (s(t), a(t)) can with (32).
be obtained for all UAVs. Also, based on the estimated • Update weights of the three target networks
policy of other UAVs, each UAV can adjust the local actor in (37).
policy πnμ : S → An to achieve the global optimal policy • end If
π μ = {π1μ , π2μ , . . . , πN
μ
}. Then, the network environment is • end for
considered to be stationary to each UAV during the centralized • end for
offline training stage. During the decentralized execution stage, • end for
the critic networks of UAVs are no longer required, and the
weights of the actor networks are fixed. Each UAV executes
its policy using the trained evaluation actor network πnμ (s(t)) in the replay buffer B with size Mr [34]. For each UAV, sample
with the learned weight μn , based only on its local state a random mini-batch of {sj , sj , aj , rj } with size Mb from
information sn (t). Considering that the UAVs do not commu- B. Then, by feeding sj into the evaluation actor network to
nicate with each other, this will greatly reduce communication generate the policy πnμ (sj ), each UAV can update the weight of
overhead and enable its scalability to multi-UAV assisted MEC evaluation actor network using policy gradient strategy [35],
system. that is,
The MATD3 approach for the task offloading optimization
problem is summarized in Algorithm 1. We first initialize ∇μn J(μn )
Mb
weights of the six neural networks and replay buffer B in 1 
= ∇μ π μ (sj )∇an Qθn1 (sj , aj1 , an , . . . , ajN )|an =πnµ (sjn ) .
all UAVs. In each episode, each UAV selects action based on Mb j=1 n n n
its evaluation actor network πnμ (s(t)) with random noise ξ.
According to the action taken above, all UAVs execute the (32)
three-dimensional movements (horizontal fly distance ln (t), Moreover, to prevent over-fitting on the narrow peaks of
horizontal direction angle ϑn (t), and vertical fly distance Q-values, the random noise ˜ is added to target actor network,
Δzn (t)), transmission power Pnt (t) and task partition ratios which can achieve a smoother state-action value estimation.
n
γmk (t). When moving out of the range of the served area, The modified target actions ãj is given by
certain UAV may fly with a random horizontal angle. More- 
over, if moving beyond the limit vertical height, UAVs keep ãj = πnμ sj + ˜, (33)
flying at the boundary height (Zmin or Zmax ). Also, when
where ˜ ∼ clip N 0, σ̂ 2 , −1, 1 is the noise with mean
covering certain hotspots, the corresponding UAVs keep their
0 and standard deviation σ̂ and clipped. Then, target values
3D positions and only change their transmission power and
yj can be obtained as
task partition ratios. By executing the above actions, all
θ
UAVs will receive the next state s (t), joint action a(t) and yj = rj + δ min Qni sj , ãj , i = 1, 2. (34)
i=1,2
immediate reward R(t).
To stabilize training process and improve sample efficiency, Then, based on the policy πnμ (sj ), the two evaluation
each UAV stores the current experience (s(t), s (t), a(t), R(t) critic network will concurrently obtain the two Q-values

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6955

Qθn1 (sj , πnμ (sj )) and Qθn2 (sj , πnμ (sj )) by minimizing the loss
function L(θni ), which is defined as
Mb
1  2
L(θni ) = [yj − Qθni (sj , aj )] , i = 1, 2. (35)
Mb j=1

Next, according to (32) and (35), each UAV can update the
weights of the three evaluation networks using the following
equations:
μn ← μn − λ∇μn J(μn ),
θni ← θni − λ∇θni L(θni ), i = 1, 2, (36)
where λ denotes the learning rate. To reduce errors resulting
from temporal difference learning, each UAV updates the
weights of evaluation actor network at a lower frequency than
Fig. 4. Locations of 30 UEs, 2 ECs and 2 UAVs in multi-UAV assisted MEC
that of evaluation critic networks. Here, each UAV chooses to system.
update the evaluation actor network every d time-steps.
TABLE I
Thus, in order to stabilize the training process, by copying
N ETWORK E NVIRONMENT PARAMETERS
the weights of corresponding evaluation networks, each UAV
updates the weights of the three target networks every d time-
steps through
μn = τ μn + (1 − τ )μn ,
 
θni = τ θni + (1 − τ )θni , i = 1, 2, (37)
where τ denotes the updating rate.
Finally, we discuss the complexity analysis of our proposed
MATD3 algorithm. As for the communication complexity,
in the centralized training procedure, the ground cloud server
needs to frequently communicate with UAVs to obtain the
state about the 3D coordinate positions of UAVs. Since the
total dimension of UAVs’ positions is 3N , the communication
complexity is O(N ). While in the decentralized execution
process, each UAV obtains its action locally, leading to
no communication between UAVs. Hence, the overall com-
munication complexity of our proposed MATD3 algorithm
is O(N ).
Moreover, in the centralized training process, each UAV
estimates the Q-function values with critic networks, where the
sizes of the inputs and outputs are 3N + N (4 + M K) and 1,
respectively. In addition, each UAV determines its action based
ECs in an area of 400 × 400 m2 . The 30 UEs are randomly
on its actor networks with the input size 3N and the output size
distributed within two hotspots, as illustrated in Fig. 4. The
N (4 + M K). While in the decentralized execution procedure,
two UAVs are randomly located to offer their computing
each UAV obtains its action from its actor networks with the
offloading help to the ground UEs. The size of input data
input size 3 and the output size 4 + M K. According to [36],
Dm is generated randomly within [2, 10], and number of CPU
given the fully-connected neural network with fixed numbers
cycles Cm are uniform randomly chosen from [100, 200]. The
of hidden layers and neurons, the computational complexity of
main simulation parameter settings are summarized in Table I.
the back-propagation algorithm is proportional to the product
The proposed MATD3 framework has two-hidden-layer neural
of the input size and the output size. For the critic network, the
networks with 400 and 300 neurons. Table II presents the main
centralized training backprop complexity is O(N M K) while
hyperparameters of the model.
for the actor network, the decentralized execution procedure
is O(N 2 + N M K). Therefore, the overall complexity is
O(N 2 + N M K). A. Training Efficiency of MATD3 Scheme
In this section, the training performance of our proposed
IV. P ERFORMANCE E VALUATION MATD3 optimization method is analyzed. The optimal loca-
In this section, numerical experiments are conducted to tion and computing task allocation of UAVs are also present
evaluate the performance of our proposed MATD3. Here, in this multi-UAV assisted MEC system. The training curves
a multi-UAV assisted MEC system is considered with 2 fixed of our proposed MATD3 optimization method is shown in

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6956 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

TABLE II
H YPERPARAMETERS OF MATD3 M ODEL

n (m = 1, 2 and n = 1, 2)
Fig. 7. Optimal task splitting ratios of ECs γmk
for UEs.

certain UEs are served by ECs only, while certain UEs obtain
computing offloading services from ECs and UAV.
Then, Fig. 7 presents the optimal task splitting ratio allo-
cation strategy. Since the two UAVs cover the two hotspots
respectively, UAV1 (UAV2) does not offer the computing
offloading services for the UEs of the hotspot covered by
Fig. 5. Training curves of MATD3. UAV2 (UAV1). In this case, the first 10 UEs are served by
UAV1, while the last 20 UEs are served by UAV2. Further-
more, we observe that for UEs (5 and 6) with large size of
input tasks, over 40% of the tasks are first processed at UAV1
1
(i.e., γm0 ). After that, the remaining tasks will be offloaded to
ECs for subsequent executing. While 75% of the last 20 UEs
are served by both UAV2 and ECs.
Next, Fig. 8 indicates the effect of the per-device bandwidth
on the optimal task partition ratios. The per-device bandwidth
B1 of EC 1 changes from 0.1 to 3 MHz while the other
per-device bandwidth B2 remains 0.5 MHz, and vice versa.
With the bandwidth assigned to ECs increasing, the more
bandwidth will be assigned to UEs when computing tasks
are offloaded from the UAVs to the ECs, leading to the
higher downlink data rates. Then, we can achieve the less
transmission delay and energy consumption. Moreover, when
Fig. 6. Optimal location of the UAVs. B1 = B2 = 0.5, we can achieve the same total system cost
in both cases, that is, two lines intersect at the same point.
Fig. 5. The training steps are very large at the beginning of Specifically, EC 1 with the greater weight on total system
learning. As the number of episodes increases, learning steps cost when B1 = B2 = 0.5 will have a greater impact on
converge to less than 10 within 30 episodes, which makes the reducing total system cost with more assigned bandwidth.
convergence speed tend to increase. Moreover, as the number While Bk > 0.5 with the other fixed 0.5 MHz, the more
of episodes increases, the two UAVs cover the area of served bandwidth will be assigned to EC k. The case of EC 1 will
UEs more rapidly. Then, the value of [penalty in the reward achieve the less total system cost compared with that of
function will tend to zero, leading to the convergence of the EC 2. However, when Bk < 0.5, EC k will receive the less
training reward. bandwidth. In the case of EC 2, EC 1 has a greater impact on
Figures 6 and 7 present the corresponding optimal location reducing total system cost with B1 = 0.5 > B2 .
and computing task allocation of UAVs, respectively. From Figure 9 plots total system cost with the various computa-
Fig. 6, we can observe that each UAV can be almost located tion capacities of and different per-device bandwidths B1 . The
in the center of one hotspot, which make UAVs provide computation capacity UAVs Fu increases from 3 to 10 GHz.
computing offloading efficiently. Moreover, the dodgerblue The bandwidth B1 of EC 1 increases from 0.5 to 2 MHz
shade represents the coverage of UAVs. The higher the UAV’s with B2 = 0.5. With the growing bandwidth of EC 1, the
location is, the larger its coverage becomes. Considering the higher downlink data rate will be obtained, resulting in the less
collision avoiding constraints of UAVs and channel condition, transmission delay, energy consumption and total system cost.
our proposed method can obtain the optimal location of UAVs Moreover, with the computation capacity UAVs Fu increasing,
to provide offloading opportunities for UEs. Furthermore, the more computation resource is allocated UEs, leading to the
according to the optimal task splitting ratio allocation strategy, less computation delay and total system cost.

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6957

Fig. 8. Total system cost with different per-device bandwidths Bk . Fig. 10. Total system cost as a function of the UAVs’ numbers.

Fig. 11. Energy consumption and execution delay vs weight w2 .


Fig. 9. Total system cost with different computation capacitis of UAVs.
computing capacity that the UAVs can provide is limited and
more tasks lead to higher processing delay.
To further analyze the scalability of our proposed MATD3
method, we evaluate the performance with different numbers B. Optimization Performance With Various Approaches
of UAVs and UEs, as shown in Fig. 10. The UEs are distributed In this section, we evaluate the performance with various
over N hotspots uniformly, with the number M increasing optimization approaches in both fixed UEs and mobile UEs
from 30 to 80. From Fig. 10, we can observe that as the scenarios. In the mobile UEs scenarios, UEs can walk ran-
numbers of UEs M increases, the more computation tasks domly with a normal distribution movement in each episode.
are required to be processed, which results in the higher total The MATD3 approach is compared with the following
system cost. Moreover, with the number of UAVs N increas- five other optimization methods. The degraded versions of
ing, there will be more UAVs participating in computation the MATD3 approach with the fixed power scheme (Pnt =
offloading. In the case of the same numbers of UEs and tasks, 3 W ), the fixed hight of UAVs (zn = 80m) are considered,
the greater number of participating UAVs is, the smaller total which are denoted as MATD3-FP and MATD3-FH. Multi-
system cost will be achieved. However, if the numbers of UEs agent DDPG (MADDPG) approach is also considered. In the
and tasks are so small, it may be not suitable to obtain much MATD3-EC method, UAVs offload all tasks to ECs for
more UAVs to participate. For example, when M = 30, the processing directly. In the random scheme, all UAVs randomly
performance of two UAVs is almost close to that of three select each element of action space within the constraints, that
UAVs. Furthermore, as we increase the numbers of UEs M is, the horizontal flying distance ln (t) ∈ [0, Lhmax ], the flying
to 80 with N = 3, the MATD3 method can still deal with angle ϑi (t) ∈ [0, 2π), the vertical flying distance Δzn (t) ∈
the multi-UAV optimization problem. This confirms the high [−Lvmax , Lvmax ], the transmission power Pnt (t) ∈ [0, Pmax ],
n
scalability of the MATD3 strategy with respect to the size of and the task splitting ratio γmk (t) ∈ [0, 1].
UAVs, state and action spaces. Figure 12 presents total system cost as a function of uplink
Fig. 11 depicts the relationship between the energy con- channel bandwidths Bu with different optimization methods.
sumption and execution delay of task offloading problem under As the uplink channel bandwidth Bu increases, the higher
the weight parameter w2 . The weight w2 increases from 0.2 to uplink data rate from UEs is achieved, which leads to the
1.8 with w1 = 1. As can be seen, a small w2 puts more weight less G2A transmission delay and energy consumption. Then,
to the energy consumption. With the weight w2 increasing, the total system cost decreases in all optimization methods.
the execution delay is more emphasized and more tasks are Moreover, compared with the case of N = 2, more UAVs
offloaded to UAVs, which results in less delay and more will participate in computation tasks with the number of UAVs
energy consumption. However, when w2 is large enough, the N = 3, resulting in the smaller total system cost in all
execution delay even does not decrease any more since the optimization methods.

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6958 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

Fig. 12. Total system cost with different optimization methods and uplink channel bandwidths Bu .

Fig. 13. Total system cost with different optimization methods and arrival rate of tasks λm .

Fig. 14. Total system cost with different optimization methods and maximum transmission power of UAVs Pmax .

Furthermore, since the random approach selects a random arrival rate of tasks λm increasing, the more total energy needs
action to achieve the maximum immediate reward, the large to be consumed for UAVs, resulting in the higher total system
total system cost is experienced with both numbers of UAVs, cost decreases in all optimization approaches. In addition, with
especially in the mobile UEs scenarios. With the fixed power more UAVs participating in task offloading, the smaller total
allocation and fixed height of UAVs, the MATD3-FP and system cost can be achieved in Fig. 14(b). Moreover, with
MATD3-FH methods always obtain the larger total system the relatively high fixed transmission power, the largest total
cost compared with our proposed MATD3 approach with both system cost is obtained with the MATD3-FH method in the
numbers of UAVs. In the case of the MADDPG method, as the case of N = 2. The random scheme always obtains the large
number of UAVs increasing, it will be more difficult to obtain total system cost with the high arrival rate of tasks. Without
the optimal action, leading to the worse performance in the UAVs participating in tasks processing, it will be challenging
case of the three UAVs. In the MATD3-EC method, without for the MATD3-EC method to deal with so much tasks,
UAVs participating in tasks processing, it always achieves especially in the case of N = 3. Compared with the other
larger total system cost compared with our proposed method. four learning approaches, our proposed MATD3 approach can
Our MATD3 method can always achieve the smallest total achieve the smallest total system cost with both numbers of
system cost among the six approaches in both fixed UEs and UAVs.
mobile UEs scenarios. Figure 14 shows total system cost as a function of maximum
Figure 13 plots total system cost as a function of arrival transmission power of UAVs Pmax with different optimization
rate of tasks λm with different optimization methods. With the methods. The MATD3-FP approach is considered with the

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: MULTI-AGENT DEEP REINFORCEMENT LEARNING 6959

fixed power scheme (Pnt = Pmax ). With the maximum [2] X. Kang, Y.-C. Liang, and J. Yang, “Riding on the primary: A new
transmission power of UAVs Pmax increasing, we may need to spectrum sharing paradigm for wireless-powered IoT devices,” IEEE
Trans. Wireless Commun., vol. 17, no. 9, pp. 6335–6347, Sep. 2018.
use the higher transmission power of UAVs Pnt . Considering [3] C. Park and J. Lee, “Mobile edge computing-enabled heterogeneous net-
that the transmission energy consumption is an increasing works,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1038–1051,
function of Pnt , the higher system cost can be obtained as Feb. 2021.
[4] Q. Chen, H. Zhu, L. Yang, X. Chen, S. Pollin, and E. Vinogradov, “Edge
Pmax increases in all cases. It can be also observed that with computing assisted autonomous flight for UAV: Synergies between
more UAVs offering task offloading services, the scenario of vision and communications,” IEEE Commun. Mag., vol. 59, no. 1,
N = 3 can achieve the smaller total system cost than that of pp. 28–33, Jan. 2021.
[5] P. A. Apostolopoulos, G. Fragkos, E. E. Tsiropoulou, and
N = 2. S. Papavassiliou, “Data offloading in UAV-assisted multi-access edge
Moreover, since the MATD3-FP approach always allo- computing systems under resource uncertainty,” IEEE Trans. Mobile
cates the fixed transmission power of UAVs with Pnt = Comput., early access, Mar. 31, 2021, doi: 10.1109/TMC.2021.3069911.
[6] G. Yang, Y.-C. Liang, R. Zhang, and Y. Pei, “Modulation in the air:
Pmax , it may achieve the maximum downlink transmission Backscatter communication over ambient OFDM carrier,” IEEE Trans.
energy consumption among the six approaches, especially Commun., vol. 66, no. 3, pp. 1219–1233, Mar. 2018.
in the large maximum transmission power of UAVs Pmax . [7] X. Xu, H. Zhao, H. Yao, and S. Wang, “A blockchain-enabled energy-
efficient data collection system for UAV-assisted IoT,” IEEE Internet
As for the random method, the relatively higher total system Things J., vol. 8, no. 4, pp. 2431–2443, Feb. 2021.
cost is achieved compared with other four learning schemes [8] N. Zhao, Z. Liu, and Y. Cheng, “Multi-agent deep reinforcement learning
(MATD3-FH, MADDPG, MATD3-EC, and MATD3). With for trajectory design and power allocation in multi-UAV networks,” IEEE
Access, vol. 8, pp. 139670–139679, 2020.
the fixed height of UAVs, the MATD3-FH method may need [9] G. Yang, R. Dai, and Y. C. Liang, “Energy-efficient UAV backscatter
the more transmission power of UAVs to guarantee the enough communication with joint trajectory design and resource optimization,”
downlink transmission data rate, which results in the larger IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 926–941, Feb. 2021.
[10] M. Li, N. Cheng, J. Gao, Y. Wang, L. Zhao, and X. Shen, “Energy-
transmission energy consumption. In the MATD3-EC method, efficient UAV-assisted mobile edge computing: Resource allocation and
since all UAVs only offload all tasks to ECs for process- trajectory optimization,” IEEE Trans. Veh. Technol., vol. 69, no. 3,
ing directly, the downlink transmission energy consumption pp. 3424–3438, Mar. 2020.
[11] Y. Wang, Z.-Y. Ru, K. Wang, and P.-Q. Huang, “Joint deployment and
accounts for a large proportion in the total system cost. Then, task scheduling optimization for large-scale mobile users in multi-UAV-
as Pmax increases, it may achieve the larger total system enabled mobile edge computing,” IEEE Trans. Cybern., vol. 50, no. 9,
cost, especially in the case of N = 3. Clearly, MADDPG pp. 3984–3997, Sep. 2020.
[12] Y. Xu, T. Zhang, D. Yang, Y. Liu, and M. Tao, “Joint resource and
experiences the worse performance with the larger number of trajectory optimization for security in UAV-assisted MEC systems,”
UAVs compared with other methods. Our proposed approach IEEE Trans. Commun., vol. 69, no. 1, pp. 573–588, Jan. 2021.
greatly outperforms the above four schemes with the smallest [13] Z. Yu, Y. Gong, S. Gong, and Y. Guo, “Joint task offloading and
resource allocation in UAV-enabled mobile edge computing,” IEEE
total system cost with both numbers of UAVs. Especially when Internet Things J., vol. 7, no. 4, pp. 3147–3159, Apr. 2020.
N = 2, our proposed approach always obtains the optimal [14] Y. Liu, S. Xie, and Y. Zhang, “Cooperative offloading and resource
transmission power of UAVs regardless of the maximum management for UAV-enabled mobile edge computing in power IoT
system,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12229–12239,
transmission power Pmax . Oct. 2020.
[15] J. Ji, K. Zhu, C. Yi, and D. Niyato, “Energy consumption minimization
in UAV-assisted mobile-edge computing systems: Joint resource allo-
V. C ONCLUSION cation and trajectory design,” IEEE Internet Things J., vol. 8, no. 10,
pp. 8570–8584, May 2021.
This paper investigated a UAV-assisted MEC system with [16] J. Zhang et al., “Stochastic computation offloading and trajectory
multiple UAVs and multiple ECs offloading computation tasks scheduling for UAV-assisted mobile edge computing,” IEEE Internet
of UEs collaboratively. An optimization problem was for- Things J., vol. 6, no. 2, pp. 3688–3699, Apr. 2019.
[17] C. Sun, W. Ni, and X. Wang, “Joint computation offloading and
mulated to obtain the minimum sum of execution delays trajectory planning for UAV-assisted edge computing,” IEEE Trans.
and energy consumptions by jointly designing the trajecto- Wireless Commun., vol. 20, no. 8, pp. 5343–5358, Aug. 2021, doi:
ries, computation task allocation, and communication resource 10.1109/TWC.2021.3067163.
[18] C. Zhan, H. Hu, Z. Liu, Z. Wang, and S. Mao, “Multi-UAV-enabled
management. A cooperative MADRL framework was devel- mobile-edge computing for time-constrained IoT applications,” IEEE
oped to tackle the non-convexity of the task offloading opti- Internet Things J., vol. 8, no. 20, pp. 15553–15567, Oct. 2021, doi:
mization issue. Considering the high-dimensional continuous 10.1109/JIOT.2021.3073208.
[19] R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduction.
action space, MATD3 algorithm was presented to obtain the MIT Press Cambridge, 1998.
optimal policy efficiently. Numerical evaluations were given [20] N. C. Luong et al., “Applications of deep reinforcement learn-
to indicate that the proposed collaborative UAV-EC offloading ing in communications and networking: A survey,” IEEE Com-
mun. Surveys Tuts., vol. 21, no. 4, pp. 3133–3174, May 2019, doi:
method can adapt to the mobility of UEs, the change of com- 10.1109/COMST.2019.2916583.
munication and computation resources, and the dynamicity [21] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
of computation tasks. The proposed scheme can significantly reinforcement learning for user association and resource allocation
in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
reduce the total system cost compared with other optimization vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
approaches. [22] H. Peng and X. Shen, “Multi-agent reinforcement learning based
resource management in MEC- and UAV-assisted vehicular networks,”
IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 131–141, Jan. 2021.
R EFERENCES [23] A. Asheralieva and D. Niyato, “Hierarchical game-theoretic and
reinforcement learning framework for computational offloading in
[1] G. Yang, Q. Zhang, and Y.-C. Liang, “Cooperative ambient backscatter UAV-enabled mobile edge computing networks with multiple service
communications for green Internet-of-Things,” IEEE Internet Things J., providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 8753–8769,
vol. 5, no. 2, pp. 1116–1130, Apr. 2018. Oct. 2019.

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.
6960 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 9, SEPTEMBER 2022

[24] S. Zhu, L. Gui, D. Zhao, N. Cheng, Q. Zhang, and X. Lang, “Learning- Yiyang Pei (Senior Member, IEEE) received the
based computation offloading approaches in UAVs-assisted edge com- B.Eng. and Ph.D. degrees in electrical and electronic
puting,” IEEE Trans. Veh. Technol., vol. 70, no. 1, pp. 928–944, engineering from Nanyang Technological Univer-
Jan. 2021. sity, Singapore, in 2007 and 2012, respectively.
[25] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. S. Shu, “Path planning From 2012 to 2016, she was a Research Scientist
for UAV-mounted mobile edge computing with deep reinforcement with the Institute for Infocomm Research, Singa-
learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, pore. She is currently an Associate Professor with
May 2020. the Singapore Institute of Technology, Singapore.
[26] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan, Her current research interests include reconfigurable
“Deep reinforcement learning based dynamic trajectory control for UAV- intelligent surface, dynamic spectrum access, and
assisted mobile edge computing,” IEEE Trans. Mobile Comput., early application of machine learning to wireless commu-
access, Feb. 16, 2021, doi: 10.1109/TMC.2021.3059691. nications and networks. She was a recipient of the IEEE Communications
[27] T. Ren et al., “Enabling efficient scheduling in large-scale UAV- Society Stephen O. Rice Prize Paper Award in 2021. She is an Editor of IEEE
assisted mobile edge computing via hierarchical reinforcement learn- T RANSACTIONS ON C OGNITIVE C OMMUNICATIONS AND N ETWORKING.
ing,” IEEE Internet Things J., early access, Apr. 7, 2021, doi:
10.1109/JIOT.2021.3071531.
[28] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multi-
agent deep reinforcement learning-based trajectory planning for multi-
UAV assisted mobile edge computing,” IEEE Trans. Cognit. Commun.
Netw., vol. 7, no. 1, pp. 73–84, Mar. 2021.
[29] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D place- Ying-Chang Liang (Fellow, IEEE) was a Professor
ment of an unmanned aerial vehicle base station (UAV-BS) for energy- with The University of Sydney, Australia, a Prin-
efficient maximal coverage,” IEEE Wireless Commun. Lett., vol. 6, no. 4, cipal Scientist and a Technical Advisor with the
pp. 434–437, Aug. 2017. Institute for Infocomm Research, Singapore, and
[30] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge com- a Visiting Scholar with Stanford University, USA.
puting: Partial computation offloading using dynamic voltage scaling,” He is currently a Professor with the University of
IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, Oct. 2016. Electronic Science and Technology of China, China,
[31] F. Ding, X. Zhang, and L. Xu, “The innovation algorithms for multivari- where he leads the Center for Intelligent Networking
able state-space models,” Int. J. Adapt. Control Signal Process., vol. 33, and Communications (CINC). His research interests
no. 11, pp. 1601–1608, Oct. 2019. include wireless networking and communications,
[32] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approx- cognitive radio, symbiotic communications, dynamic
imation error in actor-critic methods,” 2018, arXiv:1802.09477. spectrum access, the Internet of Things, artificial intelligence, and machine
[33] T. Yuan, W. D. R. Neto, C. E. Rothenberg, K. Obraczka, C. Barakat, and learning techniques.
T. Turletti, “Dynamic controller assignment in software defined Internet Dr. Liang is a Foreign Member of Academia Europaea. He was a Distin-
of vehicles through multi-agent deep reinforcement learning,” IEEE guished Lecturer of the IEEE Communications Society and the IEEE Vehicu-
Trans. Netw. Service Manage., vol. 18, no. 1, pp. 585–596, Mar. 2021. lar Technology Society. He received the Prestigious Engineering Achievement
[34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Award from the Institution of Engineers, Singapore, in 2007, the Outstanding
“Deterministic policy gradient algorithms,” in Proc. 31st Int. Conf. Contribution Appreciation Award from the IEEE Standards Association in
Mach. Learn., vol. 32, Jun. 2014, pp. 387–395. 2011, and the Recognition Award from the IEEE Communications Society
[35] F. Ding, L. Xu, D. Meng, X.-B. Jin, A. Alsaedi, and T. Hayat, “Gradient Technical Committee on Cognitive Networks in 2018. He was a recipient of
estimation algorithms for the parameter identification of bilinear systems numerous paper awards, including the IEEE Communications Society Stephen
using the auxiliary model,” J. Comput. Appl. Math., vol. 369, May 2020, O. Rice Prize Paper Award in 2021, the IEEE Jack Neubauer Memorial
Art. no. 112575. Award in 2014, and the IEEE Communications Society APB Outstanding
[36] M. Sipper, “A serial complexity measure of neural networks,” in Proc. Paper Award in 2012. He was the Chair of the IEEE Communications
IEEE Int. Conf. Neural Netw., San Francisco, CA, USA, Mar. 1993, Society Technical Committee on Cognitive Networks and served as the TPC
pp. 962–966. Chair and the Executive Co-Chair for IEEE GLOBECOM’17. He was a
Guest/an Associate Editor of IEEE T RANSACTIONS ON W IRELESS C OMMU -
NICATIONS , IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS ,
IEEE Signal Processing Magazine, IEEE T RANSACTIONS ON V EHICULAR
T ECHNOLOGY, and IEEE T RANSACTIONS ON S IGNAL AND I NFORMATION
Nan Zhao (Member, IEEE) received the B.S., M.S., P ROCESSING OVER N ETWORKS . He was the Associate Editor-in-Chief of
and Ph.D. degrees from Wuhan University, Wuhan, the Random Matrices: Theory and Applications (World Scientific). He is the
China, in 2005, 2007, and 2013, respectively. She Founding Editor-in-Chief of IEEE J OURNAL ON S ELECTED A REAS IN C OM -
is currently a Professor with the Hubei Univer- MUNICATIONS : Cognitive Radio Series, and the Key Founder and the Editor-
sity of Technology, Wuhan, and also works as a in-Chief of IEEE T RANSACTIONS ON C OGNITIVE C OMMUNICATIONS AND
Post-Doctoral Research Fellow at the University N ETWORKING. He is also serving as the Associate Editor-in-Chief for China
of Electronic Science and Technology of China. Communications. He has been recognized by Thomson Reuters (now Clarivate
Her current research involves machine learning in Analytics) as a Highly Cited Researcher since 2014.
wireless communications.

Zhiyang Ye received the bachelor’s degree from Dusit Niyato (Fellow, IEEE) received the B.Eng.
Nanchang Hangkong University in 2019. He is cur- degree from the King Mongkut’s Institute of Tech-
rently pursuing the master’s degree in electrical engi- nology Ladkrabang in 1999 and the Ph.D. degree
neering with the Hubei University of Technology. in electrical and computer engineering from the
His main research focuses on machine learning in University of Manitoba, Canada, in 2008. He is cur-
wireless communications. rently a Full Professor with the School of Computer
Science and Engineering, Nanyang Technological
University, Singapore. His research interests are in
the areas of green communications, the Internet of
Things, and sensor networks.

Authorized licensed use limited to: KTH Royal Institute of Technology. Downloaded on August 30,2024 at 11:16:54 UTC from IEEE Xplore. Restrictions apply.

You might also like