0% found this document useful (0 votes)
52 views9 pages

Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views9 pages

Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SPECIAL SECTION ON EDGE COMPUTING AND NETWORKING FOR UBIQUITOUS AI

Received June 13, 2020, accepted June 30, 2020, date of publication July 3, 2020, date of current version July 14, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3007002

Collaborative Edge Computing and Caching


With Deep Reinforcement Learning
Decision Agents
JIANJI REN, HAICHAO WANG , TINGTING HOU, SHUAI ZHENG, AND CHAOSHENG TANG
College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000, China
Corresponding author: Jianji Ren ([email protected])

ABSTRACT Large amounts of data will be generated due to the rapid development of the Internet
of Things (IoT) technologies and 5th generation mobile networks (5G), the processing and analysis
requirements of big data will challenge existing networks and processing platforms. As the most promising
technology in 5G networks, edge computing will greatly ease the pressure on network and data processing
analysis on the edge. In this paper, we considered the coordination between compute and cache resources
between multi-level edge computing nodes (ENs), users under this system can offload computing tasks to
ENs to improve quality of service (QoS). We aimed to maximize the long-term profit on the edge, while
satisfying the low-latency computing of the users, and jointly optimize the edge-side node offloading strategy
and resource allocation. However, it is challenging to obtain an optimal strategy in such a dynamic and
complex system. To solve the complex resource allocation problem on the edge and make edge have certain
adaptation and cooperation, we used double deep Q-learning (DDQN) to make decisions, ability to maximize
long-term gains while making quick decisions. The simulation results prove the effectiveness of DDQN in
maximizing revenue when allocation resources on the edge.

INDEX TERMS Collaborative computing, edge computing, optimization strategy.

I. INTRODUCTION Edge computing and cloud computing, distributed comput-


As mobile communication and IoT technologies advances, ing, parallel computing, etc. provide the necessary technical
smart cities, health care systems, etc. are deeply integrated means to achieve accurate and fast integrated computing anal-
with IoT technologies, and a large amount of data gener- ysis. Cloud computing is based on platform virtualization,
ated will pose challenges to data analysis and processing. distributed storage, and parallel computing, flexible comput-
Although the cloud computing [1] platform provides an effi- ing resources are allocated. Edge computing can be used as an
cient computing platform for big data processing, high band- extension of cloud computing [2], [3]. It provides ubiquitous
width and high latency are unacceptable for scenarios with and low latency and reliable computing.
low latency requirements such as industrial control and real- However, edge computing has no powerful computing
time analysis. power of cloud computing. When a single computing node
In recent years, edge computing [2] as a new computing has many computing tasks, it is prone to high latency caused
platform attracted the attention of researchers, although edge by long task queues. Therefore, edge computing still has great
computing does not have a uniform definition, in essence, challenges in deployment and application.
it is by deploying computing resources at the edge of the (1) Firstly, the uncertainty of the computing task, due to
Internet, thereby reducing service delays, mitigating traffic the uncertainty of factors such as the size of the computing
pressure on the backhaul link and meeting the computational task, the length of computing time, and the delay of the
requirements of low latency applications. task, the workload between edge computing nodes may vary
greatly;
(2) Secondly, the workload scheduling of a single node,
The associate editor coordinating the review of this manuscript and
the task dynamic scheduling and computing resource alloca-
approving it for publication was Xu Chen. tion between nodes when collaborative computing between

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
120604 VOLUME 8, 2020
J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

multiple nodes. The nodes of the same level have the same allocation decisions and adjusts the CPU frequency of mobile
computing power, but there are differences in the number of devices. Mao et al. [10] proposes a dynamic calculation
tasks. It is of great significance to coordinate the workload offload algorithm, which is based on Lyapunov optimization.
balance between nodes and maintain low latency. This algorithm can jointly determine the CPU frequency and
(3) Finally, the time interval, the most valuable part of edge offload strategy of the energy harvesting equipment MECO
computing is the computing of low latency, so collaborative problem.
computing should meet the requirements of low latency. For global model training, a sorted list network with mul-
To solve the above problems, in this work, we use deep tiple losses is proposed by Sheng et al. [11] to speed up train-
reinforcement learning agents to determine the relevant nodes ing. This method can effectively mine training samples and
of collaborative computing. Specifically, we are using double avoid time-consuming initialization. An online orchestration
deep Q-learning [4], [5] tomaximize the long-term profit of framework that can be used for cross-edge service function
collaborative computing and ensure load balancing between chains is proposed by Zhou et al. [12], the framework can
nodes. dynamically optimize the flow routing and resource alloca-
The rest of the paper is organized as follows: The second tion jointly to improve the overall cost efficiency as much as
part summarizes the related work of collaborative computing possible. In order to sample and improve network decisions
in edge computing. The third part describes the dynamic in flow-aware software-defined networks, Wang et al. [13]
system model of edge collaborative computing. The fourth proposed a space-time cooperative sampling (STCS) frame-
part describes the collaborative computing strategy based on work, and the experimental results prove the effectiveness of
DDQN. We provided the results of simulation experiments its sampling.
and experiments in the fifth part. Finally, we summarized our Recently, researchers have begun to use machine learning
work and discusses the direction of future work. or deep learning to optimize the computational offload strat-
egy for edge computing. Zhang et al. [14] proposed an inter-
II. RELATED WORK mittent connection cloudlet system based on the Markov deci-
In recent years, edge computing networks based on multiple sion process for the dynamic offloading problem of mobile
access have received extensive attention from academia and users. But in the literature [15], in the mobile edge cloud,
industry. Edge computing eliminates latency by providing a the author studied the dynamic service migration problem and
large number of computing resources for application services proposed a sequential offload decision framework based on
that require low latency and high computational demands. the Markov decision process.
Although cloud computing has become very popular due Li et al. [16] proposed an RL-based optimization frame-
to its powerful computing and flexible resource allocation work to solve the resource allocation problem in wireless
strategies. However, due to the long distance between the MEC. The framework optimizes the offloading decision and
end device and the cloud, cloud computing services may not computing resource allocation by optimizing the total cost of
provide assurance for low latency applications in the edge delay and energy consumption of all UEs. Yang et al. [17]
network. proposed a computing resource allocation strategy based on
To solve these problems, Edge Computing (EC) [2], [3], deep reinforcement learning for URLLC edge computing
[6] has been studied to deploy computing resources closer networks with multiple users.
to the user device, which can effectively improve applica- Wang et al. [18] considered the decision-making ability of
tions’ Quality of Service(QoS) that requires large amounts reinforcement learning and the security of federated learning,
of computation and low latency. The computing of the task a framework combining the two is proposed to optimize
at the edge is complicated by the complex factors of com- communication, caching, and computation on the edge side.
puting, storage, caching, network, energy consumption, etc., Ren et al. [19] considered the dynamic workload and complex
it is difficult to make an offload strategy under low latency radio environment in the IoT environment, indicate the deci-
calculation limits, therefore, researchers used game theory sion of the IoT device through multiple Deep Reinforcement
to solve such problems. Zheng et al. [7] introduced ran- Learning (DRL) agents, distributed training is performed on
dom games to represent the mobile user’s dynamic offload DRL agents through federated learning, and agents are also
decision-making process and proposed a multi-agent random distributed on multiple edge nodes in a distributed manner.
learning algorithm to solve the multi-user computing offload For intelligent IoT applications, a framework is proposed
problem. Due to the problem of MEC multi-user computing by Liu et al. [20], which is based on the cloud edge architec-
offload in a multi-channel wireless interference environment, ture, apply federal learning to make smart applications avail-
Chen et al. [8] proposed to use the game theory, and proves able. In order to solve the heterogeneous problem in the IoT
the advantages of the algorithm in energy consumption and environment. Zhou et al. [21] made a comprehensive review
computing execution time. of recent studies on EI. First, he reviewed and analyzed the
In addition, heuristic algorithms or dynamic programming motivation and background of artificial intelligence running
methods can also be used to solve computational offload- at the edge of the network. Then he summarized several
ing problems. Dinh et al. [9] proposed a joint optimization key technologies on the edge, deep learning frameworks,
computational offloading framework that can improve task models, etc.. Edge intelligence builds intelligent edges by

VOLUME 8, 2020 120605


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

Secondly, the base station, cellular network, wireless net-


work access point, etc., in which the user device is connected.
These equipment are located at the edge of the Internet,
connecting users and the Internet, and closest to the user
device. Placing the edge computing nodes here will greatly
reduce the delay and improve the users’ experience, this paper
assumes that the base station has an edge computing node
which user connected to, and the node is marked as a level 1
computing node.
Then there is a level 2 compute node. The level 2 compute
node is located between the level 1 compute node and the
FIGURE 1. Edge computing supported IoT system. cloud computing platform of the core network. It acts as a col-
laborator for the compute node of the level 1 compute node.
integrating DL into the edge computing framework to achieve A level 2 compute node can coordinate a cache, calculation,
dynamic and adaptive edge maintenance and management. etc. There are several levels 1 compute nodes in the area, and
Wang et al. [22] introduced and discussed the application sce- the level 2 compute nodes are close to the user, but not as
narios of edge intelligence, methods and technologies used, close as the level 1 node.
and future work challenges. Finally, the cloud computing platform is located in the core
The ‘‘Edge Artificial Intelligence’’ framework is network. The cloud computing platform stores the running
designed [18] to intelligently use the collaboration between environment of the user application and the latest data and
the device and the edge nodes to exchange data and model can release the file image to the computing node near the user
parameters, thereby better training and inferring the model. when needed.
In order to deal with complex dynamic control problems,
Wang et al. [23] proposed a FADE framework to accelerate A. COMMUNICATION MODEL
training. Shen et al. [24] by deployed deep reinforcement The system includes β level 1 computing nodes and α level
learning (DRL) agents on IoT devices to make offload com- 2 computing nodes, wherein the level 1 computing node β
puting decisions and used federated learning (FL) to conduct belongs to β, β = {1, 2, . . . , β}, and the level 2 computing
distributed training for DRL. Wu et al. [25] proposed a node α belongs to α, α = {1, 2, . . . , α}. There are a total
hierarchical edge artificial intelligence learning framework of γ , γ = {1, 2, . . . , γ } user devices, and user device γ
HierTrain, which effectively deploys DNN training tasks on belong to γ . Among them, γ devices are divided into β
a hierarchical MECC structure. groups, and user devices in each group are connected to β.
When we consider communication, computing resource For quantitative analysis, the time horizon is discretized into
allocation, delay constraints, etc., the complexity of the edge time epochs indexed by δ with equivalent duration as  (in
computing system will be very high, it is challenging to obtain seconds).
an optimal strategy in such a dynamic and complex system. We described the network model using a single base station
Deep reinforcement learning is an improvement in reinforce- β and the user device γ connected to it. When the device
ment learning. The deep Q network is used to approximate γ establishes a connection with the base station β, the base
the Q value function [4] to avoid excessively high estimates. station allocates W Hz spectrum resources to the device, but
It can be used to implement automatic resource allocation in the base station channel experiences a time-space change of
wireless networks. rayleigh fading and following the flat fading model.
β
Therefore, we proposed that edge nodes use deep rein- We denote ζδ as the channel gain during the epoch δ
forcement learning agents to determine the allocation of between the device and an EN β ∈ β, which is assumed static
η
computing resources and the maximum long-term benefits. and independently taken from a finite state space ζ . The ρδ
η
Specifically, due to the complex resource allocation problem is the transmit power with maximum limitation ρmax , which
at the edge, we used DDQN as a decision agent, which makes 9 is the power of interference plus noise. The transmission
the edge have certain adaptation and cooperation. Ability to speed θ of the user device is calculated as follows:
maximize long-term gains while making quick decisions.
β η
θ = W ∗ log2 (1 + ζδ · ρδ /9) (1)
III. SYSTEM MODEL
This paper used the nodes with computing ability in the edge B. COMPUTING MODEL
computing environment to analyze, as shown in Figure 1. We assume that every computing node in the system has
Overall, the system is divided into four levels. The first is the ability to be virtualization, and the application running
the device layer where the user device is located, including environment image of the user device can be found in the
various networked devices of the user, such as mobile phones, cloud. Each level 1 compute node has an IP address and a
IoT, VR, PC, etc., which establish connections with the Inter- remaining computing resource table of other level 1 nodes
net through a wireless network access point or 5G. connected to it and a level 2 computing node α of the upper

120606 VOLUME 8, 2020


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

level. Within a certain period of time δ, the node connects C. PAYMENT STRATEGY
γ
to the surrounding node and one level 2 node α. The level After the user’s computing request Rβ is completed by node
1 node β broadcasts its own remaining computing resource β and the computing result is returned to the user device,
χβ , χβ = (Cβ , Sβ ) and accepts the remaining computing the user device will pay to the node β according to the delay of
β
resource χβ0 broadcasts from other nodes, updating the local the computing completion λδ . If the computing is completed
resource table based on the broadcast content. within the limited time of Tl , the user pays according to the
We treat γ β as a set of service requesters, where each actual delay time. If the edge node times out to complete the
requester γ belongs to γ β , connects to the nearest base station computing task, users will not pay it.
β according to its signal strength, and then sends a request  β β
to the compute node β where the base station is located, π ∗ λδ if λδ < Tl , π is the price of edge;

γ γ γ β
where Rβ , Rβ = (Ds , Tl , Cr , Sr ) is computing request send zβ = 0 if λδ > Tl ; (3)
γ
from user γ to the node β. Where Ds includes the task data η if task Rβ failed, and η < 0.


and image file globally unique identifier (GUID), Tl is the
β
computational delay limit of node β, Cr is the computing The delay λδ in this paper is defined as the time interval
resource size required, and Sr is the storage resource size between when the user equipment initiates the calculation
required. request and when the device receives the node calculation
γ
After node β receives the task request Rβ , First check the result. If the computing task is placed at the base station β
remaining computing resources χβ = (Cβ , Sβ ) at the node β, to which the user device is connected, the computing delay
β
and if the remaining computing resource χβ meets the user λδ can be expressed as:
computing requirement Cr < Cβ &Sr < Sβ , then the service β
is started. During the service start phase, the node β searches λδ = σγ + σr0 + hβ + σβ (4)
γ
for the image cache resource image required for the user task where σγ is the time spent on the transmit task Rβ , and σr0
computing from the local cache area. If the image resource is the time it takes to transmit the result. hβ is the time taken
is in the local cache area, the image is loaded from the local by the computing node β to switch the computing task at the
cache area and the computing service is started. If there is no base station, which is a small fixed value.
cached image locally, the node downloads the image file from
the cloud platform to the node β and starts the computing σγ = Ds /θ (5)
service. When the calculation ends, the node β returns the σβ is the time required for the computing node to complete
computing result to the user device, completes the computing the computing task. It is usually related to the CPU frequency
task of this period, waits for the user to compute the task for fcpu , the size of CPU cache ccpu , and the data size Ds of the
the next period or the user ends the computing command and task data.
γ
pay for the task Rβ .
If the remaining computing resource χβ at node β cannot σβ = Ds /(fcpu ∗ ccpu ) (6)
satisfy the user computing requirement, Cr > Cβ kSr > Sβ , If the computing task is placed at the neighboring base
the service cannot be opened locally, and node β will forward β0
station β 0 , the delay λδ is:
the user request to the node in the computing resource table
β0 β0
that meets the user’s computing requirements. If the node x λδ = σγ + σr0 + hβ 0 + σβ 0 + 2 ∗ dβ /c (7)
accepts the computing task, the user’s computing service will
β0
be completed by the node x, and the base station β performs where dβ is the fixed distance between the base station β
the transfer function here. and the neighbor node β 0 , c is the speed of light. Finally, our
If there is no node in the calculation resource table that objective function is:
meets user requirements, the length of the queue determines X X γ
whether the task is placed in the task queue or offloaded to max zβ
the cloud. If the queue at node β is not full, φβ < φβmax , β∈{β,α} δ

the computing task is placed in the local task queue, and the s.t. ∀γ ∈ γ , ∀β ∈ {β, α}
task queue is a first in first out FIFO model; if the task queue Cr >= 0, Sr >= 0 (8)
at node β is full φβ = φβmax , the computing task is offloaded
γ
to the cloud. In summary, the user computing task Rβ offload IV. COLLABORATIVE COMPUTING STRATEGY BASED ON
γ
policy ωβ is: DOUBLE DEEP Q-LEARNING
In order to better understand the DDQN agent, we briefly
0 if Cr < Cβ , Sr < Sβ and φβ = 0, node β;


 introduce DDQN in this paper. First, we introduce reinforce-
1 if C < C , S < S and φ = 0, node x;
 ment learning. Reinforcement learning is an important branch
γ r x r x x
ωβ = of machine learning, agents in reinforcement learning can
 2 if Cr < C ,
x r S < S β and φ β < φβ , queue φβ ;
max


 γ learn the actions that maximize the return through interaction
3 Offload task Rβ to the cloud. with the environment. Unlike supervised learning, reinforce-
(2) ment learning does not learn from the samples provided.

VOLUME 8, 2020 120607


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

Instead, act and learn from their own experience in an uncer-


tain environment.

Algorithm 1 Collaborative Computing Framework


1: Initialize:
2: Each edge node loads DDQN model as agent;
3: The compute node β broadcasts its remaining computing
power χβ (Cβ , Sβ ) to other nodes;
4: Node β receives the remaining computing power broad-
cast and adds it to the calculation table C;
γ FIGURE 2. Decision Agent Based on DDQN.
5: If: there is computing task Rβ = (Ds, Tl, Cr, Sr);
γ
6: The agent takes the node β status s and the task Rβ as
input, s = C; Q
which Yt is defined as:
7: Generate a decision policy according to action a;
Q
8: Switch(a):; Yt = Rt+1 + γ · maxQ(St+1 , At ; θt ) (11)
a
9: case 0: Immediately allocate computing resources and
perform computing tasks; The formula of deep Q-learning is:
10: case 1: Put tasks into the local task queue φ and wait DQN
Yt = Rt+1 + γ · maxQ(St+1 , a; θt0 ) (12)
to allocate computing resources; a
11: case 2: Send task to the node in the computing resource Improved DQN: double deep Q-learning. DDQN-based
table; agent was shown in Figure 2. In conventional DQN, selecting
12: case 3: Send tasks to the cloud computing platform; an action and evaluating the selected action uses a maximum
13: Return: Send the results to the user device;
value that exceeds the Q value, which results in an overly
14: The user pays the node according to the calculation delay
optimistic estimate of the Q value. In order to alleviate the
and the task resource allocation amount; problem of overestimation, the target value in DDQN is
designed and updated to
Reinforcement learning has two salient features: multiple Q
trials and delayed rewards. Trial testing means weighing Yt = Rt+1 + γ · Q(St+1 , argmaxQ(St+1 , a; θt ); θt ) (13)
a
trade-offs between exploration and development. Agents will
The error function in DDQN is rewritten as:
try some effective actions that can generate rewards based on
past experience, but in order to generate higher returns, there Yt
DoubleQ
= Rt+1 + γ · Q(St+1 , argmaxQ(St+1 , a; θt ); θt0 )
is also a certain probability to explore new actions. Agents a
must take a variety of actions and gradually get the most out (14)
of it. Another feature of reinforcement learning is that agents
Among them, the action selection is separated from the
should look at the global, not only considering immediate
target Q value generation. This simple technique makes the
rewards, but also long-term cumulative rewards, which are
overestimation significantly reduced and the training process
designated as reward functions.
runs faster and more reliably.
Model-free reinforcement learning has been successfully
applied to the processing of deep neural networks and value
V. SIMULATION RESULTS
functions [4]. It can directly use the original state representa-
A. EXPERIMENT SETUP
tion as a neural network input to learn the strategy of difficult
This paper uses a simulation experiment method to instan-
tasks [5]. Q-Learning is a model-free reinforcement learning
tiate user device and edge computing nodes for simulation
algorithm. The most important component of the Q-learning
through Python programming. The operating system used in
algorithm is a method for correctly and effectively estimating
the experiment was CentOS7, the processor was Intel E5-
the Q value. Q-functions can be implemented simply by look-
2650 V4, the hard disk size was 480G SSD + 4T enterprise
up tables or function approximators, sometimes by nonlinear
hard disk, and the memory was 32G. The code interpreter
approximators, such as neural networks or even more com-
is Python, version 3.6, and the code runtime dependen-
plex deep neural networks. Q-learning is combined with deep
cies include Tensorflow, Keras, Numpy, Scipy, Matplotlib,
neural networks, so-called Deep learning Q-learning(DQN).
CUDA, etc..
The formula for Q-learning is:
The experimental data includes the computational task of
Qπ (s, a) = E[R1 + γ R2 + · · · |S0 = s, A0 = a, π] (9) the user offloading to the edge node in the time period i, which
is randomly generated by calling the Bernoulli and Poisson
The parameter update formula is:
h functions in the Scipy library. The experiment assumes that
Q
θt+1 = θt + α(Yt − Q(St , At ; θt )) Q(St , At ; θt ) (10) the user device has the ability to connect to the network and
θt can offload computing tasks and receive computing results.

120608 VOLUME 8, 2020


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

Algorithm 2 Collaborative Edge Computing Strategy Based


on Double Deep Q-Learning
1: Initialization:
2: Initialize replay memory: R and the memory capacity: M ;
3: Main deep-Q network with random weights: θ ;
4: Target deep-Q network with weights: θ − = θ ;
5: For epoch i in I :
6: Input the system state s into the generated Q-network;
7: Compute the Q-value Q(s, a; θ );
8: Input the system state s0 into the generated Q-network;
9: Compute the Q-value Q(s0 , a; θ );
10: Input the system state s0 into the target Q-network; FIGURE 4. Detailed of rewards obtained by agents (6000-12000 steps).
11: Compute the Q-value Q(s0 , a; θ − );
12: Compute the target Q-value;
13: Y = p(s, a) + γ Q(s0 , argmax(s0 , a ; θ ), θ − );
14: Output: action a;
15: Record the changed status s00 and reward z after action
a to memory R;
16: End For
17: Save model.

FIGURE 5. The number of failed tasks changes during agent training.

FIGURE 3. Rewards obtained by the agents during training.

Experiments in this paper compared Double deep Q-


learning (DDQN), Deep Q-learning (DQN), Dueling deep Q-
learning and Natural Q-learning, where the learning rate is set
FIGURE 6. Detailed changes in the number of failed
to 0.001, the replay memory size is 200, and the total training tasks(6000-12000 steps).
steps is 12000, the neural network update iteration cycle was
set to update every 100 times. The penalty for mission failure
is −30. After the task is successfully completed, agents will However, with the increase of training and the update
get rewards, the size of the reward obtained is closely related of neural networks, the agent can make the correct offload
to the time the task is completed, the time the task is transmit- decision, and get more rewards than punishments, so the
ted, and the total time spent offloading the calculation. In the total rewards continue to increase with training. And it is
experiment, the wireless channel was set to 6 different levels. obvious that agents based on DDQN can get more rewards
with training.
B. RESULT ANALYSIS In order to more intuitively show the difference in rewards
The reward obtained by the agent in the experiment is shown obtained by different neural network agents, the zoomed-in
in Figure 3. During the period when training started, the neu- reward total changes are shown in Figure. 4.
ral network weights had just been initialized, and the agent Figure 5 shows the change in the number of task failures
could not give a good offload decision. At this time, the deci- during the training process. At the beginning of the training,
sion was randomly generated, resulting in The calculation of the task calculation failed more, and the total number of task
the received offload task fails and is punished, so the total failures increased rapidly. With the training, the increased
reward is negative. rate of task failures decreased and eventually reached a stable

VOLUME 8, 2020 120609


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

FIGURE 7. Agent training loss.

FIGURE 9. Value changes in agents.

FIGURE 8. Distribution map of transmission time when task is offloaded. agation time of user data is normally distributed, but it is
irregular. In the experiment, some user data are more, but
level. Observation shows that the DDQN-based agent has the network channel is poor. Therefore, the long transmission
fewer task failures during the training process. time leads to a high delay. However, some users have fewer
In order to more intuitively show the difference in the data and the network is better, and the transmission time is
number of failed tasks of different neural network agents, shorter.
the enlarged number change is shown in Figure. 6. The num- Figure 9 shows the value of the value calculated by the
ber of DDQN-based agent failures is always lower than other value function in the agent. At the beginning of the training,
reinforcement learning. the value cannot be well estimated, but as the training pro-
Figure 7. shows the loss of change during training. The loss gresses, the value estimation continues to approach the true
function of DDQN is defined as the square of the difference level. And reached a stable level in the end.
between the estimation function and the value function. The
loss is recorded every 6 pieces of training. The training loss VI. CONCLUSION AND FUTURE WORK
based on DDQN is less than the other reinforcement learning In this paper, we consider the bandwidth, computing, and
at the beginning. As the weight of the neural network is cache resources of the ENs, benefit from the deep learning
updated, the loss is getting smaller and smaller, the training and powerful learning ability and decision-making charac-
loss of several other reinforcement learning agents still fluc- teristics, maximize the edge while satisfying the low-latency
tuates. computing of users at the edge. In addition, it also considers
Record and display the user data transmission time in the the horizontal and vertical coordination cache and computing
experiment as shown in Figure 8. On the whole, the prop- at the edge, which has certain adaptability and can fully

120610 VOLUME 8, 2020


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

coordinate the computing resources at the edge to maximize [19] J. Ren, H. Wang, T. Hou, S. Zheng, and C. Tang, ‘‘Federated learning-
the value. However, the DDQN on the EN has a long training based computation offloading optimization in edge computing-supported
Internet of Things,’’ IEEE Access, vol. 7, pp. 69194–69201, 2019.
period and the effect is unstable. It needs to train for a while [20] D. Liu, X. Chen, Z. Zhou, and Q. Ling, ‘‘HierTrain: Fast hierarchical edge
to make better decisions. In addition, when multiple ENs in AI learning with hybrid parallelism in mobile-edge-cloud computing,’’
the same group perform collaborative computing, we have IEEE Open J. Commun. Soc., vol. 1, pp. 634–645, 2020, doi: 10.1109/
OJCOMS.2020.2994737.
not studied the prioritization strategy of computing resources [21] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, ‘‘Edge intelli-
or the computing resource bidding strategy. Future work we gence: Paving the last mile of artificial intelligence with edge computing,’’
will focus on competitive bidding and allocation priorities. Proc. IEEE, vol. 107, no. 8, pp. 1738–1762, Aug. 2019, doi: 10.1109/
JPROC.2019.2918951.
In addition, the security of users on the edge is also the focus [22] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen,
of research. ‘‘Convergence of edge computing and deep learning: A comprehensive
survey,’’ IEEE Commun. Surveys Tuts., vol. 22, no. 2, pp. 869–904,
2nd Quart., 2020, doi: 10.1109/COMST.2020.2970550.
REFERENCES [23] X. Wang, C. Wang, X. Li, V. C. M. Leung, and T. Taleb, ‘‘Federated deep
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, reinforcement learning for Internet of Things with decentralized cooper-
G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, ‘‘A view of ative edge caching,’’ IEEE Internet Things J., early access, Apr. 9, 2020,
cloud computing,’’ Int. J. Networked Distrib. Comput., vol. 53, no. 4, doi: 10.1109/JIOT.2020.2986803.
pp. 50–58, 2013. [24] S. Shen, Y. Han, X. Wang, and Y. Wang, ‘‘Computation offloading with
[2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge computing: Vision and multiple agents in edge-computing-supported IoT,’’ ACM Trans. Sensor
challenges,’’ IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016. Netw., vol. 16, no. 1, pp. 1–27, Feb. 2020, doi: 10.1145/3372025.
[3] M. Satyanarayanan, ‘‘The emergence of edge computing,’’ Computer, [25] Q. Wu, K. He, and X. Chen, ‘‘Personalized federated learning for intel-
vol. 50, no. 1, pp. 30–39, Jan. 2017. ligent IoT applications: A cloud-edge based framework,’’ IEEE Open J.
[4] H. H. Van, A. Guez, and D. Silver, ‘‘Deep reinforcement learning with Comput. Soc., vol. 1, pp. 35–44, 2020, doi: 10.1109/OJCS.2020.2993259.
double Q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016, pp. 1–13.
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
deep reinforcement learning,’’ Nature, vol. 518, no. 7540, p. 529, 2015.
[6] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, ‘‘Mobile edge JIANJI REN received the B.S. degree from the
computing: A key technology towards 5G,’’ ETSI White Paper, vol. 11, Department of Mathematics, Jinan University,
no. 11, pp. 1–16, Sep. 2015. in 2005, and the M.S. and Ph.D. degrees from
[7] J. Zheng, Y. Cai, Y. Wu, and X. S. Shen, ‘‘Stochastic computation offload- the School of Computer Science and Engineering,
ing game for mobile cloud computing,’’ in Proc. IEEE/CIC Int. Conf. Dong-A University, in 2007 and 2010, respec-
Commun. China (ICCC), Jul. 2016, pp. 1–6. tively. He is currently an Associate Professor
[8] X. Chen, L. Jiao, W. Li, and X. Fu, ‘‘Efficient multi-user computation with the College of Computer Science and Tech-
offloading for mobile-edge cloud computing,’’ IEEE/ACM Trans. Netw., nology, Henan Polytechnic University. His cur-
vol. 24, no. 5, pp. 2795–2808, Oct. 2016. rent research interests include mobile content-
[9] T. Quang Dinh, J. Tang, Q. Duy La, and T. Q. S. Quek, ‘‘Offloading centric networks and collaborative caching in edge
in mobile edge computing: Task allocation and computational frequency computing.
scaling,’’ IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
[10] Y. Mao, J. Zhang, and K. B. Letaief, ‘‘Dynamic computation offloading
for mobile-edge computing with energy harvesting devices,’’ IEEE J. Sel.
Areas Commun., vol. 34, no. 12, pp. 3590–3605, Dec. 2016.
[11] H. Sheng, Y. Zheng, W. Ke, D. Yu, X. Cheng, W. Lyu, and Z. Xiong, ‘‘Min-
ing hard samples globally and efficiently for person re-identification,’’
IEEE Internet Things J., early access, Mar. 13, 2020, doi: 10.1109/ HAICHAO WANG received the B.S. degree in
JIOT.2020.2980549. natural geography and resource environment from
[12] Z. Zhou, Q. Wu, and X. Chen, ‘‘Online orchestration of cross-edge ser- Henan Polytechnic University, Jiaozuo, Henan,
vice function chaining for cost-efficient edge computing,’’ IEEE J. Sel. China, in 2018, where he is currently pursuing
Areas Commun., vol. 37, no. 8, pp. 1866–1880, Aug. 2019, doi: 10. the master’s degree in software engineering with
1109/JSAC.2019.2927070. the College of Computer Science and Technology
[13] X. Wang, X. Li, S. Pack, Z. Han, and V. C. M. Leung, ‘‘STCS: Spatial- (Software College). His research interests include
temporal collaborative sampling in flow-aware software defined net- edge computing, edge caching, big data analy-
works,’’ IEEE J. Sel. Areas Commun., vol. 38, no. 6, pp. 999–1013,
sis, deep learning, and the Internet of Things
Jun. 2020, doi: 10.1109/JSAC.2020.2986688.
technology.
[14] Y. Zhang, D. Niyato, and P. Wang, ‘‘Offloading in mobile cloudlet systems
with intermittent connectivity,’’ IEEE Trans. Mobile Comput., vol. 14,
no. 12, pp. 2516–2529, Dec. 2015.
[15] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung,
‘‘Dynamic service migration in mobile edge-clouds,’’ in Proc. IFIP Netw.
Conf. (IFIP Netw.), May 2015, pp. 1–9.
[16] J. Li, H. Gao, T. Lv, and Y. Lu, ‘‘Deep reinforcement learning based TINGTING HOU is currently pursuing the B.S.
computation offloading and resource allocation for MEC,’’ in Proc. IEEE degree with the College of Computer Science
Wireless Commun. Netw. Conf. (WCNC), Apr. 2018, pp. 1–6. and Technology, Henan Polytechnic University,
[17] T. Yang, Y. Hu, M. C. Gursoy, A. Schmeink, and R. Mathar, ‘‘Deep Jiaozuo, Henan, China. Her current major interests
reinforcement learning based resource allocation in low latency edge include edge computing, edge caching, deep learn-
computing networks,’’ in Proc. 15th Int. Symp. Wireless Commun. Syst. ing, big data analysis, and the Internet of Things
(ISWCS), Aug. 2018, pp. 1–5. technology.
[18] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, ‘‘In-edge AI:
Intelligentizing mobile edge computing, caching and communication by
federated learning,’’ IEEE Netw., vol. 33, no. 5, pp. 156–165, Sep. 2019.

VOLUME 8, 2020 120611


J. Ren et al.: Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents

SHUAI ZHENG is currently pursuing the B.S. CHAOSHENG TANG received the Ph.D. degree
degree with the College of Computer Science in management science and engineering from
and Technology, Henan Polytechnic University, Yanshan University, Qinhuangdao, Hebei, China,
Jiaozuo, Henan, China. His current major interests in 2015. His major interests include machine learn-
include edge computing, edge caching, deep learn- ing, complexity theory, multimedia applications,
ing, big data analysis, and data mining. and online social networks.

120612 VOLUME 8, 2020

You might also like