Resource_Allocation_for_Edge_Computing_in_IoT_Networks_via_Reinforcement_Learning
Resource_Allocation_for_Edge_Computing_in_IoT_Networks_via_Reinforcement_Learning
Abstract—In this paper, we consider resource allocation for edge devices, i.e., a edge server, is finite. Therefore, it cannot support
computing in internet of things (IoT) networks. Specifically, each the massive computation tasks from all the end devices in
end device is considered as an agent, which makes its decisions on its coverage area. Furthermore, offloading computation tasks
whether offloading the computation tasks to the edge devices or
not. To minimize the long-term weighted sum cost which includes of those end devices requires abundant spectrum resources or
the power consumption and the task execution latency, we consider it might bring about the congestion of wireless channels [5].
the channel conditions between the end devices and the gateway, Therefore, resource allocation, such as computation capacity,
the computation task queue as well as the remaining computation power and spectrum resource allocation, is quite important
resource of the end devices as the network states. The problem for such types of resource-constrained networks. Dynamic
of making a series of decisions at the end devices is modelled
as a Markov decision process and solved by the reinforcement computation tasks offloading scheme, i.e., the task is executed
learning approach. Therefore, we propose a near optimal task at a local end device or edge server, is an effective method.
offloading algorithm based on ϵ-greedy Q-learning. Simulations It has been mostly discussed in the context of mobile edge
validate the feasibility of our proposed algorithm, which achieves computing (MEC), in which the mobile user makes a binary
a better trade-off between the power consumption and the task decision to either offload the computation tasks to the edge
execution latency compared to these of edge computing and local
computing modes. device or not [6].
Some research work has proposed optimal computation task
I. I NTRODUCTION offloading schemes by minimizing the energy consumption or
The phenomenon of the increasing number of end devices, task execution latency in the network. Most of them have
such as sensors and actuators etc., has caused an exponential adopted the conventional optimization methods to solve the
growth of requirements for data processing, storage and com- formulated optimization problem, like Lyapunov optimization
munications. A cloud platform has been proposed to connect and convex optimization techniques [7]. However, these opti-
a large number of internet of things (IoT) devices, and a mization techniques can construct an approximately optimal
massive amount of data generated by those devices can be solution only. Note that designing the computation task of-
offloaded to a cloud server for further processing [1]. The floading scheme can be modeled as a Markov decision pro-
cloud server generally has an infinite ability of computation cess (MDP). Reinforcement learning has been adopted as an
and storage, however, it is physically and/or logically far from effective method to solve this optimization problem without
its clients, implying that offloading big data to the cloud server requiring the priori knowledge of environment statistics [8].
is inefficient due to intensive bandwidth requirements. More- But the explosion of the state and action space makes the
over, it cannot satisfy the ultra-low latency requirements for conventional reinforcement learning algorithm inefficient and
time-sensitive applications and provide location-aware services. even infeasible. Deep reinforcement learning approaches, such
Edge computing has been proposed to address this problem by as deep Q-network (DQN) has been proposed to explore the
moving data processing to the edge computing devices, such optimal policy by solving the aforementioned optimization
as devices with computing capacity (e.g., desktop PCs, tablets problem [9, 10].
and smart phones), data centers (e.g., IoT gateway) and devices The increase of computation capacity at edge devices con-
with virtualization capacity, which are closer to end devices, tributes to a new research area, called edge learning, which
and then a distributed data processing network is implemented crosses and revolutionizes two disciplines: wireless commu-
[1, 2]. The edge is not located on the IoT devices but as close nication and machine learning [11, 12]. Edge learning can be
as one hop to them, or even more than one hop away from accomplished by leveraging MEC platform. Deep reinforce-
them. ment learning (DRL) is an effective method to design the
Compared with the cloud server, edge devices can support computation task offloading policy in wireless powered MEC
latency-critical services and a variety of IoT applications. The networks by considering the time-varying channel qualities,
end devices are in general resource-constrained, for instance, harvested energy units and task arrivals [9]. [10] has designed
the battery capacity and local CPU computation capacity are an offloading policy for mobile user to minimize its monetary
limited [3]. Offloading computation tasks to relatively resource- cost and energy consumption by implementing the DQN-based
rich edge devices can meet the quality of service (QoS) offloading algorithm. In [13], an In-Edge artificial intelligence
requirements of applications as well as augment the capabilities has been evaluated and could achieve near-optimal performance
of end devices for running resource-demanding applications by investigating the scenarios of edge caching and computation
[4]. However, in practice, the computation capacity of edge offloading in MEC systems. Moreover, DRL could also achieve
978-1-5386-8088-9/19/$31.00
Authorized licensed use©2019
limitedIEEE
to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.
a good performance by developing a decentralized resource
allocation mechanism for vehicle-to-vehicle communications
[14]. Deep learning achieves excellent performance with large
amount of data generated by IoT applications [15].
In the previous works, the task execution latency and the
power consumption have rarely been considered together when
designing the optimal computation task offloading scheme. [16]
has optimized the task offloading scheduling by minimizing
the weighted sum of execution delay and end device energy Gateway with edge server
consumption with conventional optimization tools. Encouraged « «
by [16], we formulate a task offloading problem with its
Data size
objective function including not only the cost in [16], but T1 T3
TT −1 TT
also the power consumption of the edge device. Specifically, T2 T4 u3
« u1 u4 « uU
we are proposing to use reinforcement learning techniques to u2
Computation tasks
solve this problem. This approach has been put in use recently
to solve the task offloading problem while only considering : Offload to cloud
: Local computing device
either the execution delay or the energy consumption as the : Offload to edge server
negative reward function [9, 10]. Moreover, we will discuss : Edge computing device
: Computation tasks producing
the remaining computation resource of the end device since
it affects the decision making on task offloading when it is run Fig. 1. Computation tasks offloading model in IoT networks.
out. The major contribution of this paper is as follows:
1) We first consider resource allocation in IoT networks with
edge computing to design a task offloading scheme for picking from G possible values. We use a finite-state discrete
IoT devices. We formulate the weighted sum cost min- time Markov chain to model the channel gain state transition
imization problem with its objective function including over time epochs. Assuming each end device executes a lot of
the task execution latency and the power consumption of independent computation tasks, these tasks are in different sizes
both the edge device and the end device. and need to be processed with different CPU cycles. Then we
2) We solve this optimization problem with reinforcement denote the task queue at the end device as T = {T1 , ..., Tmax },
learning technique. And then we propose the near optimal where Tmax is the maximum number of tasks that can be stored
task offloading algorithm based on ϵ-greedy Q-learning. at the end device. The task arrival is assumed to be I = {0, 1},
3) Numerical results show that our proposed task offloading where I = 1 indicates there is one task generated with its task
algorithm achieves a better trade-off between the power size randomly picked from M = {m1 , ..., mM }, otherwise,
consumption and the task execution latency compared to there is no task arrived at current time epoch.
the other two baseline computing modes. From Fig. 1, computation tasks can be either executed at
the end device or offloaded to the gateway and executed at the
II. S YSTEM M ODEL edge server. In each IoT network, some end devices execute
As shown in Fig. 1, we consider an IoT network with many the computation tasks locally, while others offload their tasks
end devices (i.e., IoT devices) and a gateway (i.e., the edge to the gateway in the same time epoch. At the beginning of
device), where the gateway collects data from end devices in each time epoch k, each end device makes its own decisions on
its coverage area and processes them with its equipped edge computation task offloading O = {1}∪{0}∪{−1} and transmit
server. Each end device generates a variety of computation power level Pt = {P1k , ..., Pmax
k
} if the end device decides to
tasks continuously and has limited computation capacity and offload the computation tasks to the edge device. Note that
power, so offloading their tasks to the gateway may improve if the end device decides not to offload the computation task
the computation experience in terms of power consumption when Ok = 0, then the cost only contains the local computation
and task execution latency. We focus on a representative end power consumption and the local task execution latency, and
device making its own decisions on task offloading. We discrete the transmit power is defined as Ptk = 0 in this case. Ok = 1
the time horizon into epochs, with each epoch equalling to indicates that the end device decides to offload computation
duration η and indexed by an integer 0 < k ≤ K, K is the task to the gateway, with the transmit power Ptk ∈ Pt . In both
maximum number of time epochs in each time horizon. The end cases, the computation task is executed successfully, however, if
device operates over common license-free sub-gigahertz radio the computation task transmission suffers from outage between
frequency and the frequency bandwidth is denoted by Bw . We end device and the gateway, the computation task execution
denote end devices in the network as U = {u1 , ..., uU }. The fails and Ok = {−1}.
channel condition between the end device and the gateway is The task execution latency and power consumption are two
assumed to be time-varying. we assume the end device knows critical challenges in edge computing networks, both of them
some stochastic information about the channel condition in depend on the adopted task offloading scheme and transmit
time slot k, which is indicated by the channel gain states power allocation. In this paper, we consider them as the main
G = {g1k , ..., gG
k
} where the channel gain at each time epoch is cost of our considered IoT network. Therefore, we formulate
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.
an optimization problem to minimize the cost function, the III. P ROPOSED Q- LEARNING BASED R ESOURCE
weighted sum of the task execution latency and the power A LLOCATION FOR E DGE C OMPUTING
consumption. In this section, we formulate our task offloading problem to
A. Local Computing Mode minimize the weighted sum of power consumption and task
execution latency of both the end device and the edge device
We have Ok = 0 if the computation task is executed at local
by optimizing the task offloading decisions, the weight factor
end device. We assume the edge server allocates fixed and equal
and the transmit power of the end device. Since the formulated
CPU resource for each end device, and it is enough for the
problem is non-convex, the conventional algorithm is hard or
computation task to execute in each time frame. Considering
even impossible to solve it, we propose a near optimal task
that during any time epoch k, fd denotes the fixed CPU
offloading algorithm based on ϵ-greedy Q-learning.
frequency of any end device, which presents the number of
CPU cycles required for computing 1-bit of input data. The A. Task Offloading Problem Formulation
power consumption per CPU cycle is denoted by Pd . Then Computation tasks from the end device can be offloaded to
fd Pd indicates computing power consumption per bit at the end the gateway depending on the channel conditions, computation
device. The total power consumption of one computation task task queue and the remaining percentage of the end device’s
k
at the end device in any time epoch k, denoted by Pcd , is given CPU resource. We denote sk = (g k , T k , rdk , ) ∈ S = G × T ×
k k
by Pcd = fd Pd m . Moreover, let Dd denote the computation Rd as the network state of any end device in each time epoch k.
capacity of the end device, which is measured by the number By observing the network state sk at the beginning of each time
of CPU cycles per second. The remaining CPU resource of epoch k, the end device chooses an action ak = (Ok , Ptk ) ∈
the end device in each time epoch is denoted by the remaining A = O × Pt by following a stationary policy π. An agent, e.g.,
percentage of computation resource Rk = {rd1 , rd2 , ..., 1}. The each end device, decides whether to offload the computation
local computing latency Lkd is defined as Lkd = (fd mk )/Dd . task and chooses the transmit power level, and we define a
However, the power consumption and the task execution latency penalty function δ k as the cost when the task transmission fails.
are two contradictory challenges in the edge computing net- Therefore, the cost function is expressed as
work, we cannot reduce them simultaneously, so we are trying
to achieve a good trade-off between them. Then we define the C k = Cloc
k k
+ Cof k k k
f + δ = Pc + βL + δ
k
(4a)
cost function of the local computing mode as = k
Pcd + k
Pcs + Ptk + β(Lks + Lkt + Lkd ) k
+δ . (4b)
k k
Cloc = Pcd + βLkd , (1) In this paper, we propose to design an optimal task offloading
where β indicates the weight factor between power consump- scheme to minimize the long-term cost of the IoT network, that
tion and the task execution latency. is, both the immediate cost and the future cost are included. The
optimization problem is formulated as
B. Offloading Computing Mode
∑
K
We assume end devices adopt the time division multiple (P1) min Ck (5a)
access (TDMA) scheme to transmit their data to the gateway, β, Pt , O
k=1
that is, the interferences from other end devices are negligible s.t. C1 : 0 ≤ β ≤ 1; (5b)
when they are transmitting data over the same time epoch, k.
Let g k denote the channel gain from any end device to the C2 : 0 ≤ Pt ≤ Pmax ; k
(5c)
gateway, which is constant during the offloading time epoch. C3 : O = {0, 1, −1}.
k
(5d)
Ptk indicates the transmit power of the end device, then the
where C1 denotes the value range of weight factor β which
achievable transmission rate (bit/s) is denoted by
balances the power consumption and the task execution latency.
Ptk g k C2 is the transmit power of the end device when it decides
Rk = Bw log2 (1 + ), (2)
σ2 to offload the computation task to the gateway. C3 presents
where Bw and σ 2 indicate the bandwidth and the variance the task execution set. It is easily noticed that P1 is a mixed
of additive white Gaussian noise (AWGN), respectively. Then integer nonlinear programming (MINLP) problem as the integer
the power consumption of the end device caused by the data variable Ok , continuous variable Pt and the discrete variable
transmission is indicated as Ptk , and the transmission latency δ k need to be optimized. It is difficult or impossible to find the
is denoted by Lkt = T k /Rk . Similarly, let fs denote the optimal solution by using conventional optimization techniques.
computation frequency of the edge server, Ps denote the power The conventional algorithm has to decouple the optimization
consumption per CPU cycle at the edge server. Ds indicates the problem into many sub-optimization problems and solves them
computation capacity allocated to each end device. The compu- separately, which is inefficient and complicated, so we explore
tation power of the edge server is given by Pcs k
= fs Ps mk , and the reinforcement learning techniques to address this problem
the computation latency is calculated as Ls = (fs M k )/Ds .
k including multiple optimization variables.
Therefore, we can obtain the cost function of the offloading We consider optimizing the variables together in each time
computing mode, and it is presented as epoch, and denote the objective function as a negative reward
k k k k k
in (4). In addition, the state transition and cost are stochastic
Cof f = Pcs + Pt + β(Ls + Lt ). (3) and can be modelled as a Markov decision process, where
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.
the state transition probabilities and cost depend only on the where C k is the cost observed for the current state, α is the
environment and the obtained policy. The transition probability learning rate (0 < α ≤ 1). Q learning is an online action-
P = (sk+1 , C k |sk , ak ) is defined as the transition from state sk value function learning with an off-policy, in each time epoch,
to sk+1 with the cost C k when the action ak is taken according we calculate the Q-value in the next step with all the possible
to the policy. Therefore, the long-term expected cost is given actions that it can take, then choose the minimum Q-value and
by record the corresponding action.
∑K
Therefore, the computation task offloading optimization
V (s, π) = Eπ [ γ k C k ], (6) problem P1 is solved by using the Q-learning algorithm, and to
k=1
explore the unknown states instead of trusting the learn values
k
where s = (g , T k
γ ∈ [0, 1] is the discount factor and E
, rdk ), of Q(s, a) completely, the ϵ-greedy approach is used in the
indicates the statistical conditional expectation with transition Q-learning algorithm, where the agent picks a random action
probability P. with small probability ϵ, or with 1 − ϵ it chooses an action
that minimizes the Q(sk+1 , a) as shown in (9) in each time
B. Q-learning Approach epoch. Then a computation task offloading algorithm based on
Generally, the conventional solutions, like policy iteration ϵ-greedy Q-learning is proposed as shown in Algorithm 1.
and value iteration [17], can be used to solve the MDP
optimization problem with a known transition matrix. But it is Algorithm 1 Computation Task Offloading Algorithm based
hard for the agent to know the prior information of the transition on ϵ-greedy Q-Learning
matrix, which is determined by the environment. Therefore, Initialization
a model-free reinforcement learning approach is proposed to Initialize parameters: discount factor γ, learning rate α,
investigate this decision-making problem since the agent cannot exploration rate ϵ.
make predictions about what the next state and cost will be Initialize action-value function Q : S × A
before it takes each action. Initialize states: set g 1 randomly, set T 1 := T , rd1 := Rd .
In (P1 ), each end device is trying to design an optimal task Set k := 1,
offloading scheme according to some statistical information, Procedure
such as the possible channel conditions, the possible remaining 1: while k ≤ K and T > 0 and rd > 0 do
percentage of computation resource and the possible task queue, 2: g k is changed according to a random matrix.
observed from the environment. Particularly, we focus on 3: e ← random number from [0,1]
finding the optimal policy π ∗ that minimizes the cost V (s, π). 4: if e < ϵ then
For any given network state s, the optimal policy π ∗ can be 5: Choose action ak randomly.
obtained by 6: else
π ∗ = arg min V (s, π), ∀s ∈ S. (7) 7: Choose action ak according to arg min Q(sk , ak )
ak ∈A
π
8: end if
The computation task offloading optimization problem at 9: Set sk+1 = (Gk+1
g , T k+1 , rdk+1 ), where
each end device is a classic single-agent finite-horizon MDP T k+1
= T − a + Ik,
k k
with the discounted cost criterion. Then we adopt the clas- rd = rdk − ak (fd mk ).
k+1
sic model-free reinforcement learning approach, Q-learning 10: calculate the cost C k by (4)
algorithm, to explore the optimal task offloading policy by 11: update Q(sk , ak ) by (9)
minimizing the long-term expected accumulated discounted 12: Set k := k + 1
cost, C. We denote the Q-value, Q(s, a), as the expected 13: end while
accumulated discounted cost when taking an action ak ∈ A
following a policy π for a given state-action pair (s, a).
Thus, we define the action-value function Q(s, a) as C. Optimality and Approximation
Q(s, a) = Eπ [C k+1
+γQπ (s k+1
, a k+1 k k
)|s = s, a = a]. (8) The agent in the reinforcement learning algorithm aims to
solve sequential decision making problems by learning an
In our proposed algorithm, Q(s, a) indicates the value calcu- optimal policy. In practice, the requirement for Q-learning to
lated from cost function (4) for any given state s and action obtain the correct convergence is that all the state action pairs
a, it is stored in the Q-table which is built up to save all Q(s, a) continue to be updated. Moreover, if we explore the
the possible accumulative discounted cost. And the Q-value is policy infinitely, Q value Q(s, a) has been validated to converge
updated during the time epoch if the new Q-value is smaller with possibility 1 to Q∗ (s, a) , which is given by
than the current Q-value. The Q(s, a) is updated incrementally
based on the current cost function C k and the discounted Q- lim Pr (|Q∗ (s, a) − Q(s, a)n | ≥ ς) = 0, (10)
n→∞
value Q(sk+1 , a), ∀a ∈ A in the next time epoch.
This is achieved by the one-step Q-update equation where n is the index of the obtained sample, and Q∗ (s, a) is the
optimal Q value while Q(s, a)n is one of the obtained samples.
Q(sk , ak ) ← (1 − α) · Q(sk , ak ) + α(C k + γ · min Q(sk+1 , a)), Therefore, Q-learning can identify an optimal action selection
a
(9) policy based on infinite exploration time and a partly-random
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.
policy for a finite MDP model. In this paper, we approximate
the state and action space into finite states, and we use Monte- 0.24
Carlo simulation to explore the possible policy, so we can
obtain a near-optimal policy. 0.22
0.2
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.
two kinds of cost to meet different quality requirements of the
0.15
computation task.
V. C ONCLUSIONS
0.125
Task execution latency (s), L
2017.
[5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “Mobile edge
1.5 computing: Survey and research outlook,” arXiv preprint arXiv, vol. 1701,
Jan. 2017.
[6] S. Bi and Y. J. Zhang, “Computation rate maximization for wireless
1 powered mobile-edge computing with binary computation offloading,”
IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–4190, Apr. 2018.
[7] X. He, H. Xing, Y. Chen, and A. Nallanathan, “Energy-efficient mobile-
edge computation offloading for applications with shared data,” arXiv
0.5 preprint arXiv:1809.00966, Sep. 2018.
[8] J. Xu and S. Ren, “Online learning for offloading and autoscaling in
renewable-powered mobile edge computing,” in Global Telecom. Conf.
(GLOBECOM 2016), Washington, DC USA, Feb. 2016, pp. 1–6.
0 [9] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Performance
1 3 5 7 9 11 13 15 optimization in mobile-edge computing via deep reinforcement learning,”
Time epochs, k arXiv preprint arXiv:1804.00514, Mar. 2018.
[10] C. Zhang, Z. Liu, B. Gu, and etc., “A deep reinforcement learning based
Fig. 5. Cumulative weight sum of cost C versus different time epochs k with approach for cost-and energy-aware multi-flow mobile data offloading,”
different weight factors β. IEICE Trans. on Commun., pp. 1625–1634, Jan. 2018.
[11] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an
intelligent edge: Wireless communication meets machine learning,” arXiv
preprint arXiv:1809.00343, Sep. 2018.
be executed more quickly by offloading it to the gateway. Our [12] Y. Du and K. Huang, “Fast analog transmission for high-mobility wireless
proposed scheme achieves the neutral performance because it data acquisition in edge learning,” arXiv preprint arXiv:1807.11250, Jul.
2018.
makes decisions on the task execution based on the current [13] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-Edge
channel qualities and remaining computation resource of the AI: Intelligentizing mobile edge computing, caching and communication
end device. Specially, from Fig. 4, we notice that the local by federated learning,” arXiv preprint arXiv:1809.07857, Sep. 2018.
[14] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation
computing curve is finished earlier than the other two curves, in V2V communications,” in IEEE Int. Conf. on Commun. (ICC 2018),
this is because the limited computation resource of the end Kansas City, MO, USA, May 2018, pp. 1–6.
device has run out before the end of the time frame. [15] H. Li, K. Ota, and M. Dong, “Learning iot in edge: deep learning for the
internet of things with edge computing,” IEEE Network, vol. 32, no. 1,
Fig. 5 illustrates the performance of the cumulative cost pp. 96–101, Jan. 2018.
with different weight factors β. It’s observed that the worst [16] Y. Mao, J. Zhang, and K. B. Letaief, “Joint task offloading scheduling and
case happens when β = 0.5 since the cumulative cost, C, is transmit power allocation for mobile-edge computing systems,” in IEEE
Wireless Commun. and Networking Conf. (WCNC 2017), San Francisco,
higher than others, which implies the task execution latency CA, Jan. 2017, pp. 1–6.
and the power consumption contribute different weights to the [17] M. L. Puterman, Markov decision processes: discrete stochastic dynamic
overall cost. It is important to make a trade-off between the programming. John Wiley & Sons, 2014.
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on November 28,2024 at 06:09:28 UTC from IEEE Xplore. Restrictions apply.