DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
4, DECEMBER 2021
Abstract—Edge computing is emerging to empower the future IoT devices and clouds. Edge or fog computing [1], [2], [3] is
of Internet of Things (IoT) applications. However, due to hetero- considered as a potential approach to fulfill these applications
geneity of applications, it is a significant challenge for the edge demands by moving more computing, storage, and intelligence
cloud to effectively allocate multidimensional limited resources
(CPU, memory, storage, bandwidth, etc.) with constraints of resources to the edge, which would benefit IoT applica-
applications’ Quality of Service (QoS) requirements. In this tions that are delay-sensitive, bandwidth/data intensive, or that
paper, we address the resource allocation problem in Edge-IoT require closer intelligence. We envision a future “Edge-IoT”
systems through developing a novel framework named DeepEdge environment where various IoT applications could use edge
that allocates resources to the heterogeneous IoT applications computing to fulfill their resource demands and performance
with the goal of maximizing users’ Quality of Experience (QoE).
To achieve this goal, we develop a novel QoE model that considers requirements. To enable such a vision, there are some sig-
aligning the heterogeneous requirements of IoT applications to nificant challenges to overcome. On the one hand, from the
the available edge resources. The alignment is achieved through demand side, a massive number of IoT devices can run het-
selection of QoS requirement range that can be satisfied by the erogeneous applications with various Quality of Service (QoS)
available resources. In addition, we propose a novel two-stage requirements and different priorities. On the other hand, from
deep reinforcement learning (DRL) scheme that effectively allo-
cates edge resources to serve the IoT applications and maximize the supply side, the edge clouds are expected to dynamically
the users’ QoE. Unlike the typical DRL, our scheme exploits allocate multidimensional resources (CPU, storage, and band-
deep neural networks (DNN) to improve actions’ exploration by width) at geospatially distributed points and different levels
using DNN to map the Edge-IoT state to joint resource alloca- of network hierarchy. This severely complicates the required
tion action that consists of resource allocation and QoS class. resource allocation and scheduling algorithms. Most of the
The joint action not only maximize users’ QoE and satisfies het-
erogeneous applications’ requirements but also align the QoS current edge computing research either focuses on resource
requirements to the available resources. In addition, we develop allocation without paying attention to QoS requirements of
a Q-value approximation approach to tackle the large space heterogeneous applications, or optimizes specific operations
problem of Edge-IoT. Further evaluation shows that DeepEdge such as mobile offloading, migration, placement, chaining and
brings considerable improvements in terms of QoE, latency and orchestration [4], [5], [6].
application tasks’ success ratio in comparison to the existing
resource allocation schemes. In this paper, we develop a new Edge-IoT framework named
DeepEdge using deep reinforcement learning (DRL) that allo-
Index Terms—Resource allocation, deepEdge, edge-IoT, deep cates resources to heterogeneous IoT applications with the goal
reinforcement learning (DRL), quality of experience (QoE).
of maximizing users’ Quality of Experience (QoE). Unlike the
existing resource allocation schemes in the Edge-IoT research,
I. I NTRODUCTION our proposed DeepEdge framework ensures IoT users’ satis-
faction with guaranteed heterogeneous application’s QoS and
ROWING Internet of Things (IoT) applications such as
G Google Home and Amazon Echo raise the demands for
cloud computing platforms for data processing. However, it
accounts for the dynamic resource availability at the edge in
the resource allocation decisions. The paper has the following
new contributions that align with DeepEdge goals.
is very difficult for the existing centralized cloud computing • We develop a novel QoE model that maps the applications
model to scale with projected large number of IoT devices QoS requirements to a cumulative QoE score that reflects
and ubiquitous applications, due to the large amount of gen- the IoT users’ satisfaction. The developed QoE model is
erated data to be sent over relatively long distance between noteworthy as it supports adjustment of QoS requirements
Manuscript received April 10, 2021; revised August 23, 2021; accepted acceptable ranges to match with the available resources at
October 19, 2021. Date of publication October 29, 2021; date of current the edge. In addition, it specifies certain weight for each
version December 9, 2021. The work was supported by National Science QoS performance metric to emphasize its impact on the
Foundation (NSF) CNS core grant No. 1909520. The associate editor coor-
dinating the review of this article and approving it for publication was overall application performance.
H. Lutfiyya. (Corresponding author: Ismail AlQerm.) • We propose a novel two-stage DRL to fulfill the QoE
The authors are with the Department of Computer Science, University of model objectives by generating joint actions includ-
Missouri–St. Louis, St. Louis, MO 63121 USA (e-mail: [email protected];
[email protected]). ing QoS class selection which aligns applications’ QoS
Digital Object Identifier 10.1109/TNSM.2021.3123959 requirements to the available resources in addition to
1932-4537
c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3943
TABLE I
the resource allocation action. The scheme exploits E DGE -I OT A PPLICATIONS AND T HEIR C HARACTERISTICS
deep neural networks (DNN) to map the edge-IoT state
information to resource allocation joint actions.
• The proposed DRL tackles the dimensionality problem in
the heterogeneous edge-IoT environment where the size
of state and action space is large. It formulates the Q-
value in a form of compact representation in which it is
approximated as a function of smaller set of variables.
• The proposed DRL scheme tackles the tradeoff between
exploration and exploitation encountered in the DRL
action generation by ranking the actions according to their
Q-values to avoid the equal probability of action selection
used in -greedy based exploration solutions [48].
The rest of the paper is organized as follows, the related
work, its shortcomings, and the motivation for the QoE and
DRL based resource allocation are presented in Section II.
Section III describes DeepEdge system architecture, system long-term planning problem solved by DRL. The authors
model and QoE optimization problem formulation. The two- in [30] proposed a resource allocation policy for the Edge-
stage DRL-based resource allocation scheme is illustrated in IoT system to improve the efficiency of resource utilization
Section IV. Section V presents the performance evaluation and using deep Q-networks (DQN). The work in [31] proposed
the paper concludes in Section VI. a DQN-based resource allocation scheme, which can allocate
computing and network resources to reduce the average ser-
II. R ELATED W ORK AND M OTIVATION vice time. In [32], a joint optimization solution solved by
actor-critic DRL was proposed for allocation of resources in
In this section, the related work is discussed. In addition,
fog-enabled IoT systems. The work in [33] proposed a frame-
we present the motivation for developing QoE model that is
work for edge offloading based on DRL with latency and
backed by DRL for resource allocation.
power consumption minimization as optimization objectives.
Task offloading with a single-user edge computing system was
A. Related Work explored in [34] where DRL was exploited to optimize the
The potential benefits of edge computing in different trade-off between energy consumption and slowdown of tasks
network applications have been studied extensively in the in the processing queue. An online computation offloading
recent literature. A large number of existing work has scheme based on DQN was studied in [35] under random task
focused on edge computing either about allocation for arrivals. The work in [36] investigated strategies for the alloca-
specific applications, or optimizing some operations such tion of computational resources using DRL in edge computing
as offloading, migration, and orchestration [4], [5], [6]. For networks.
offloading, many schemes have been proposed to make Given the related work, none of the existing schemes con-
offloading decisions to optimize energy consumption and delay sidered awareness of multiple heterogeneous applications’
performance [7], [8], [9], [10], [11]. Some of the proposals demands and aligning them with the available resources at
targeted allocation of edge resources. For example, the utiliza- the edge. Heterogeneous IoT applications may have dif-
tion of distributive game-theoretical approaches for resource ferent requirements and characteristics. These requirements
allocation in “cloud-edge” multi-level networks [12]. The might not be fulfilled with the available resources at the
authors in [9] proposed an optimization framework for energy- edge at certain time instant, given that the edge has lim-
efficient resource allocation, by assuming that the network ited computing power comparing with the cloud computing
operator is aware of the complete information of all users’ that is of virtually unlimited computing power but the rel-
applications. atively high latency. The problem of satisfying users’ QoE
DRL has been employed for solving decision-making and applications’ demands in multiple heterogeneous appli-
related problems in the context of edge computing such as cations and dynamic IoT environment with the ability to
computation offloading [13], [14], [15], [16], management adjust QoS requirements of applications to fit with the
problems in vehicular networks [17], [18], [19], [20], [21] available resources is not addressed. None of the proposed
and edge resource allocation [22], [23]. For vehicular DRL schemes for resource allocation considered using DNN
networks, DRL has been investigated to solve several problems to diversify action generation rather than approximation
including resource allocation [24] and computation offload- of value functions of reinforcement learning. In addition,
ing [25], [26], [27]. For instance, the work in [28] exploited the related work neither proposed an effective approach to
DRL to solve the problem of edge resource management tackle the problem of large state space in Edge-IoT nor
by leveraging hierarchical learning architectures. In [29], effectively handled the tradeoff of exploration and exploita-
the authors proposed a knowledge driven service offload- tion in reinforcement learning. A series of typical Edge-
ing decision framework for vehicular network in which the IoT applications and their characteristics are summarized in
offloading decision was formulated for multiple tasks as a Table I.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3945
TABLE II
achieve user satisfaction and efficient resources utilization in S AMPLE Q UALITY S CORE FOR H ETEROGENEOUS A PPLICATIONS
Edge-IoT with multiple heterogeneous applications. W ITH VARIOUS R EQUIREMENTS
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
is the resource allocation for the application. The performance QoS metrics are below minimum thresholds, the cumulative
metric weight w is selected to show how sensitive the appli- quality score Φς will be compromised. The proposed defi-
cation for the corresponding metric. For example, emergency nition of QoE reflects all its impacting parameters including
response is more sensitive to latency than to PLR or PER. The cumulative Φς normalized score, the metric class (ας ) and the
QoS class ας is set using DRL to maximize Φ and align to the priority (βς ). It maps the relationship between Φς and QoE
resource availability at the edge. For instance, if the current according to the applications’ characteristics since the appli-
resource request for certain application at certain time instant cations’ requirements vary from one type of application to the
cannot be fulfilled due to lack of resources at the edge, the other. Therefore, QoE for multiple applications is modeled
application ας will be altered in certain ranges that maintains using an exponential mapping function to the quality score
the application service and fit with the available resources. Φς as follows,
The metric classes indicated in Table II show examples of
e Φς −ας + e −Φς +ας
the metric ranges that correspond to certain ας . The quality QoE = ϑ +1 (2)
score Φ given in Table II shows how the selection of different ς
e ας + β ς
class ας affects the achieved QoE. All the presented values
where ϑ is scaling constant selected for the mapping func-
for metric ranges and Φ are for demonstration of the QoE
tion. The definition in (2) is non-linear exponential monotonic
model functionality and how Φ is influenced by the selected
mapping function, which suits our model as the performance
on ας . Moreover, the values of the metrics ranges are tied to
metrics considered cannot be scaled uniformly, i.e., equal per-
the metric weight specified. For instance, high latency weight
ceived performance difference does not correspond to equal
in emergency response causes its latency ranges of different
numerical difference in the Φς score. The considered QoS
classes to be lower than other applications.
metrics including LA, PLR, and PER have exponential inter-
Φ is mapped to the following metrics: (T), (RL ) and (RE ).
dependency with user QoE in the proposed edge-IoT system.
The latency T is calculated according to the link bandwidth,
For example, when the QoE value is high, any variation in
data size and the propagation medium. Packet loss rate RL is
MSS .η these metrics will heavily impact the QoE. However, consider-
evaluated according to [44] as, RL = goodput.RTT where MSS
able variation in these QoS metrics will not exhibit significant
is the maximum segment size, η is a constant that incorporates
impact if the QoE is low. Thus, exponential mapping function
the loss model and the acknowledgment strategy, goodput is
is able to capture the impact of QoS metrics on QoE specifi-
the ratio of the delivered packets over the delivery completion
cally for sensitive applications such as emergency response. In
time, and RTT is the round trip time. The packet error rate
addition, different experiments in the literature demonstrated
RE is found according to the estimation model in [45], which
that exponential mapping outperforms other mapping functions
relies on the link characteristics found using statistics from
such as linear or logarithmic [49].
two distinct types of probing messages. QoE combines user
experience and expectation to the edge computing system and
network performance. The performance of the edge system C. Problem Formulation
is typically evaluated by QoS metrics. Thus, it is necessary In this section, we formulate the resource allocation
to have qualitative relationship between QoS and QoE to be optimization problem with the goal of maximizing the QoE
able to achieve QoE control mechanism based on QoS with found in (2) with consideration of all applications. QoE is
maximum efficacy [46], [47]. To achieve this, we use a generic rewarded when applications’ QoS requirements are satisfied
formula to correlate the variation in QoE with the achieved and thereafter the user satisfaction through achieving high
QoS metrics including latency, loss and error rates. The QoS QoE. The resource allocation optimization problem for QoE
metrics are represented by quality scores ΦT , ΦRL , and ΦRE maximization is formulated as:
for latency, packet loss rate and packet error rate respectively.
Each of these scores is obtained based on the application type max QoE (3)
(xr ,i,j ),ας
and the selected metric class ας as indicated in Table II. For s.t. xr ,i,j ≥ 0, ∀j , r (4)
instance, if the resource allocation action was to select ας as 1
for the emergency response application which corresponds to xr ,i,j ≤ Cj , ∀i (5)
the best range for all the QoS metrics, the quality score will be j r
10. The cumulative quality score achieved for each application T ς ≤ T max (6)
ς max
with certain amount of resources allocated is calculated as RE ≤ RE (7)
follows, ς max
RL ≤ RL . (8)
Φς = xr ,i,j w1 .ΦT + w2 .ΦRL + w3 .ΦRE (1) Edge server’s capacity Cj is defined in constraint (5) to
i j r
confirm that the allocated resources cannot exceed the server
where xr ,i,j is the resource allocation indicator with r as a capacity. Equation (6), (7), (8) are the constraints for QoS
resource type (CPU, memory..etc.), i is the index of IoT device metrics including T, RE , and RL respectively to guarantee
running the application, j is the index of the edge server pro- that they will not exceed the maximum threshold. Note that
viding the resources, and w is the weight of the performance ς is used to indicate the performance metric achieved for
metric. The cumulative quality score captures the impact of certain application. The definition of QoE in (2) is derived
each of the QoS metrics on the overall performance. If the as a function of quality scores for each QoS metric Φ and
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3947
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
1) First Stage (Action Exploration and Evaluation): In this where μ ∈ [0, 1] is the learning rate. Reinforcement learn-
stage, the DNN receive the Edge-IoT state St information at ing is a stochastic approximation method that solves the
time t defined as the IoT applications resource demand yi , Bellman’s optimality equation associated with the DTMDP.
the QoS requirements and the resources available at the edge It does not require state transition probability model as it
St = {yi , RLmax , RE
max , T max , C }. According to the current
j converges with probability one to a solution if ∞ t=1 ϕ is infi-
t
∞ 2
action policy denoted as πθt : {St } → Xt , a set of joint nite, t=1 (ϕ ) is finite, and all state/action pairs are visited
t
actions is generated by DNN and denoted by a mapping fθt infinitely often [54].
as follows, One of the main shortcomings of using Q-value for action
evaluation in the dynamic Edge-IoT environment is the large
Xt = fθt (St ) (9) state space. It is not feasible to use state/action tables and
where Xt = {Xkt , k = 1, 2, . . . , K }, and Xkt
= {xrt ,i,j , αςt } find the corresponding Q-value in such environment for action
is the kth entry of Xt . Each entry in Xt is
a joint action evaluation. Thus, it is necessary to approximate the Q-value.
and is assumed to be continuous. The universal approxima- This approximation reduces the complexity of the system
tion theorem claims that if hidden layers have large number and enhances its convergence. Thus, we approximate the Q-
of hidden neurons and a proper activation function is applied value as a function of a smaller set of variables in which
at the neurons, they will be sufficient to approximate any con- Q-value utilizes a countable state space S ∗ using the func-
tinuous mapping f [51]. We exploit ReLU as an activation tion Q : S ∗ × X . This function is referred as a function
function [52] of the hidden layers, where the output b and approximator. The vector ρ = {ρp }P p=1 is exploited to approx-
input v of a neuron are related by b = max{v, 0}. In the output imate the Q-value by minimizing the metric of difference
layer, we use sigmoid activation function as b = 1/(1 + e −v ). between Q ∗ (St , Xt ) and Q (St , Xt , ρ) for all (St , Xt ) ∈
It is necessary to map the set of joint actions Xt to a dis- S ∗ × X . Thus, the approximated Q value is formalized as,
crete action set such that the actions can be evaluated by the Q (St , Xt , ρ) = P p=1 ρ ψp (St , Xt ) = ρψ (St , Xt ) where
p T
reinforcement learning Q-value function. We employ typical T denotes the transpose operator and the vector ψ(St , Xt ) =
K-nearest-neighbors (KNN) algorithm [53] to do the map- [ψp (St , Xt )P p=1 ] with a scalar function ψp (St , Xt ) that is
ping. After obtaining the candidates discrete joint actions from identified as the basis function (BF) over S ∗ × χ, and ρp (p =
KNN, the performance of these actions is evaluated using rein- 1, . . . , P ) are the associated weights. We use Stochastic
forcement learning. The action evaluation is conducted based Gradient Descent (SGD) method to update the weights. The
on the QoE optimization objective defined in (3). Q-value update rule in (11) is redefined as follows,
We assume that the Edge-IoT environment evolves as
ρt+1 ψ T (St , Xt ) = 1 − μt ρt ψ T (St , Xt ) + μt
a discrete-time Markov decision process (DTMDP). The
maximization problem in (3) falls within the domain of a
R(St , Xt ) + ϕ max ρt ψ T St , Xt
DTMDP. In order to find the optimal action policy, we define X ∈X
a DTMDP that associates an action to every Edge-IoT state, × ψ(St , Xt ) (12)
a state transition and a reward function. The state transitions
and actions occur at discrete time epochs. DeepEdge controller where the gradient is a vector of partial derivatives with respect
monitors the Edge-IoT state St in current epoch t and gener- to the elements of ρt .
ates discrete joint actions Xt , which are found using DNN. A 2) Second Stage (Action Exploitation and DNN Training):
reward function is generated for each joint action Xt at the The action with the highest Q-value Xt∗ must be exploited
end of the epoch. The reward function Rt is selected to be the among other actions of state St and added to the replay
QoE defined in (2). The formal expression for the DTMDP is memory to train the DNN. The replay memory will be popu-
given as (S, X, T, R), where T : S × X × S → [0, 1] is lated with state/action pairs that have the highest Q-value over
a state transition probability function. Ultimately, the objec- certain number of iterations. The action exploitation is accom-
tive of DRL integrated with DTMDP is to find an optimal plished through determination of action policy (πςt ), which is
joint action Xt that maximizes the QoE in (2). The Q-value defined as the probability of selection of action Xt at state
of the reinforcement learning is exploited to evaluate the joint St . It corresponds to the set of actions with the highest Q-
action is defined as the current expected reward plus a future value. The attainment of this policy is tied to resolving the
discounted reward as follows, exploration vs. exploitation tradeoff. Exploration aims to look
for new joint actions so it does not only utilize the actions
Q (St , Xt ) = E R(St , Xt ) + ϕ max Q ∗ St , Xt (10)
∗ known to achieve high Q-value. Exploitation is the process of
X ∈Xt using the good actions available. The most common method
where ϕ ∈ (0, 1] is the discount factor. The optimal Q-value to balance exploration and exploitation is to use the -greedy
Q ∗ (St , Xt ) is updated by the change in the Q-value according selection [48], where is the portion of the time that a learn-
to the transition from state St to state St under the action Xt ing agent takes a randomly selected action instead of taking
at epoch t as follows, the action that is most likely to maximize its reward given the
actions available. However, -greedy selects equally among the
Q t+1 (St , Xt ) = 1 − μt Q(St , Xt ) + μ available actions, i.e., the worst action is likely to be chosen as
the best one. In order to overcome this issue, we develop a new
R(St , Xt ) + ϕ max Q St , Xt (11)
X ∈X method in which the action selection probabilities are varied as
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3949
a graded function of Q-value. The best joint action is given the Algorithm 1 Two-Stage Deep Reinforcement Learning
highest selection probability while others are ranked accord- Algorithm to Solve Resource Allocation in DeepEdge
ing to their Q-values. Boltzmann distribution [50] is adopted to Require: Network state St which include QoS require-
achieve this ranking. The action selection probability at epoch ments of the application, resource demands edge servers
t is given as follows, resources capacity at each epoch t
Ensure: Joint action for resource allocation and QoS metric
e Q(St ,Xt )/τ
πς∗ (St , Xt ) = (13) class Xt = {Xt∗ , αςt }
Q(St ,Xt )/τ
X ∈X e 1: BEGIN
where τ is a positive parameter which can take high value and 2: Initialize the DNN with random parameters θt and empty
this indicates that the actions probabilities nearly equal. In case replay memory
τ has low values, this indicated a big difference in selection 3: Set iteration number m and the training interval Ω
probabilities for actions with different Q-values. This action 4: for (t=1 to m) do
selection probability is updated after Q-value approximation 5: Generate a set of joint actions Xt = fθt (St )
as follows, 6: Use KNN to convert the continuous set of actions into
T a discrete set
e ρt ψ (St ,Xt )/τ
πς∗ (St , Xt ) = . (14) 7: Run Approximated reinforcement learning to evaluate
ρt ψ T (St ,Xt )/τ
X ∈X e the action for resource allocation that must satisfy Xt∗ =
The selected state/actions pairs are added to the memory at maxX ρψ T (St , Xt )
each epoch and utilized later to train the DNN. This improves 8: Exploit actions according (14)
the upcoming joint actions that will be generated by the DNN 9: Update the memory by adding (St , Xt∗ )
in the future epoch. To achieve this, DeepEdge maintains an 10: if Ω = 1 then
initially empty memory of limited capacity. At the t-th epoch, 11: Uniformly select a group of data samples
a new training data (St , Xt∗ ) is added to the memory. When {(Sυ , Xυ )|υ ∈ Υt } from the memory
the memory is full, the newly generated data sample replaces 12: Train the DNN with {(Sυ , Xυ )|υ ∈ Υt } and update
the oldest one. The experience replay technique [42], [55] is θt using Adam algorithm
utilized to train the DNN using the stored data samples. After 13: end if
certain number of epochs when there is enough data to train 14: end for
the DNN, we randomly select a group of training data samples 15: END
{(Sυ , Xυ∗ )|υ ∈ Υt } from the memory, where Υ is the set
of selected time indices. The DNN parameter θt is updated
using Adam algorithm [56] which targets minimization of the The complexity of the proposed two-stages DRL is found
average cross-entropy loss L(θt ) defined as follows, based on the number of edge servers J, the number of avail-
1 ∗ T able resources of certain type r, and the number of devices
L(θt ) = − (Xυ ) log fθt (Sυ ) that demand the resources N. The implementation of the
|Υt |
υ∈Υt DRL algorithm considers different application and scenar-
+ (1 − Xυ∗ )T log 1 − fθt (Sυ ) (15) ios. It associates actions generation for the device with the
available resources and edge servers. The computation com-
where |Υt | is the size of Υt , T denotes the transpose operator, plexity of the action exploration stage of the DRL is O(JN r )
and the log function is the element-wise logarithm operation operations. The complexity of the exploitation and training
for a vector. We start the training step when the number of stage is O(mΩ) according to the number of epoch m and
samples is larger than half of the memory size. Eventually, the the training interval Ω. The memory requirements to store
DNN learns the best joint action for each state (St , Xt∗ ). Thus, the samples for DNN training is N (r ∗J ) . Exploration and
it becomes smarter and continuously improves its produced exploitation are achieved with the merit of the approximated
joint action. Q-value O(Q θt (St , Xt , ρ)) instead of the typical Q-value in
The two-stage DRL for resource allocation procedure is the traditional Q-learning. The computation complexity of our
presented in Algorithm 1. The algorithm acquires the Edge-IoT proposed two-stages DRL is acceptable given the achieved
state information which includes QoS requirements, resource performance and in comparison with the traditional Q-learning
demand and edge servers resources capacity information. which has a an exponential computational complexity of
It starts by initializing the DNN with certain parameter θ. O(N J ∗r ). The traditional Q-learning may only achieve max-
The DNN generates the joint actions. The output of DNN imum achievable QoE by searching all possible combinations
is converted to discrete format and then received by the of state/action/rewards. Consequently, it requires more number
approximated reinforcement learning to evaluate the generated of operations and its computation complexity escalates in an
actions by the DNN. The actions with the highest Q-value exponential pattern.
are exploited according to the probability in (14) and used to
populate the dedicated memory of DNN. After certain num-
ber of epoch, a sample of state/action pairs is fetched from V. P ERFORMANCE E VALUATION
the memory and used for DNN training and updating θ using We evaluate the performance of the proposed DeepEdge for
Adam algorithm. resource allocation in Edge-IoT with respect to the average
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
TABLE III
S YSTEM PARAMETERS
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3951
TABLE V
E VALUATIONS OF D EEP E DGE R ESOURCE A LLOCATION S CENARIOS
second example: there are 400 IoT users of which 100 users
emergency response, 100 users health monitoring, 100 users
C. Evaluation of Various DeepEdge Resource Allocation with personal identifications, and 100 users running the three
Scenarios applications simultaneously. All the users report their requests
In this subsection, we discuss and evaluate multiple scenar- along with the QoS requirements of applications to the RAM
ios of how DeepEdge operates to perform resource allocation at the controller. All the requests are sorted according to the
with QoE maximization. Fig. 6 depicts the scenarios of users index i and application type ς. Then, the RAM allocates
resource allocation for multiple heterogeneous applications. resources to these applications with consideration of applica-
For the first scenario, it has 100 IoT users which run emer- tion priority βς and resource availability at the edge. These
gency response application with high QoS requirements. The parameters are exploited by the two-stage DRL to adapt QoS
resources requests of emergency response application are sent class αi,ς and allocate resources accordingly with the goal to
to the RAM in the controller. The request is processed through maximize the joint QoE for all users and satisfaction of their
the two-stage DRL by selecting the most appropriate QoS class applications. Table V presents the specifications of the three
ας and allocate edge resources accordingly. In the second sce- scenarios, QoS metrics requirements and the average achieved
nario (application heterogeneity), it is assumed that each one metrics by DeepEdge for each application. We observe that
of 200 IoT user runs two applications (emergency response DeepEdge always maintains the QoS metrics below the speci-
and personal identification), which lets the controller treat fied threshold even in the most complicated setting of the third
all the IoT users the same. The RAM here receives requests scenario.
from the same user but for multiple applications. It recog- Moreover, QoE is evaluated with consideration of the differ-
nizes the application type ς, identifies the applications priority ent scenarios presented in Table V to demonstrate DeepEdge
βς and analyzes their QoS requirements. Then, it enforces capability to tackle the heterogeneity of IoT applications
the application QoS class adaptation starting with the lower in resource allocation. The QoE function derived in (2) is
priority application. For example, the QoS class of the per- exploited as an evaluation metric to demonstrate the merit of
sonal identification application that has the lowest priority the proposed two-stages DRL against other DRL schemes:
will be adapted first through proper election of its ας . The the DQN-based scheme (AD) [31] and the actor critic scheme
two-stage DRL allocates the resources for both applications (DR-Learning) [32]. However, the QoE function for the AD
with the goal of maximizing the QoE in (3). In the third sce- and DR-Learning schemes is calculated using the quality score
nario (users and application heterogeneity), we present two of the application latency only (not including quality scores
evaluation examples: First, there are 300 heterogeneous IoT for PLR and PER) as it is the only QoS metric they consid-
users of which 100 users are running emergency response, ered as an optimization goal. The average QoE is plotted in
100 users for health monitoring and 100 with two applica- Fig. 7. Fig. 7 shows that DeepEdge outperforms both schemes
tions emergency response and health care monitoring. In the as they lack the capability of handling multiple applications
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
TABLE VI
E VALUATIONS OF RUNTIME FOR A LL S CHEMES
IN D IFFERENT S CENARIOS
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3953
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3954 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021
[27] M. Khayyat, I. A. Elgendy, A. Muthanna, A. S. Alshahrani, S. Alharbi, [47] T. Hoßfeld, P. E. Heegaard, L. Skorin-Kapov, and M. Varela,
and A. Koucheryavy, “Advanced deep learning-based computational “Fundamental relationships for deriving QoE in systems,” in Proc. 11th
offloading for multilevel vehicular edge-cloud computing networks,” Int. Conf. Qual. Multimedia Exp. (QoMEX), 2019, pp. 1–6.
IEEE Access, vol. 8, pp. 137052–137062, 2020. [48] M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning
[28] H. Peng and X. Shen, “Deep reinforcement learning based resource man- based on value differences,” Advances in Artificial Intelligence (Lecture
agement for multi-access edge computing in vehicular networks,” IEEE Notes in Computer Science), vol. 6359. Heidelberg, Germany: Springer,
Trans. Netw. Sci. Eng., vol. 7, no. 4, pp. 2416–2428, Oct.–Dec. 2020. 2010.
[29] Q. Qi et al., “Knowledge-driven service offloading decision for vehicular [49] D. D. Hora, A. Asrese, V. Christophides, R. Teixeira, and D. Rossi,
edge computing: A deep reinforcement learning approach,” IEEE Trans. “Narrowing the gap between QoS metrics and Web QoE using
Veh. Technol., vol. 68, no. 5, pp. 4192–4203, May 2019. above-the-fold metrics,” in Passive and Active Measurement. Cham,
[30] X. Xiong, K. Zheng, L. Lei, and L. Hou, “Resource allocation based Switzerland: Springer, 2018.
on deep reinforcement learning in IoT edge computing,” IEEE J. Sel. [50] A. D. Tijsma, M. M. Drugan, and M. A. Wiering, “Comparing explo-
Areas Commun., vol. 38, no. 6, pp. 1133–1146, Jun. 2020. ration strategies for q-learning in random stochastic mazes,” in Proc.
[31] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation IEEE Symp. Series Comput. Intell. (SSCI), Dec. 2016, pp. 1–8.
for mobile edge computing: A deep reinforcement learning approach,” [51] S. Marsland, Machine Learning: An Algorithmic Perspective. New York,
IEEE Trans. Emerg. Topics Comput., vol. 9, no. 3, pp. 1529–1541, NY, USA: CRC Press, 2015.
Jul.–Sep. 2021. [52] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[32] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching, Boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn. (ICML),
computing, and radio resources for fog-enabled IoT using natural actor- Jun. 2010, pp. 807–814.
critic deep reinforcement learning,” IEEE Internet Things J., vol. 6, [53] K. Fukunaga and P. M. Narendra, “A branch and bound algorithm for
no. 2, pp. 2061–2073, Apr. 2019. computing k-nearest neighbors,” IEEE Trans. Comput., vol. 100, no. 7,
[33] H. Zhang, W. Wu, C. Wang, M. Li, and R. Yang, “Deep reinforcement pp. 750–753, Jul. 1975.
learning-based offloading decision optimization in mobile edge [54] C. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3,
computing,” in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), pp. 279–292, 1992.
2019, pp. 1–7. [55] J. Lin, “Reinforcement learning for robots using neural networks,”
[34] H. Meng, D. Chao, and Q. Guo, “Deep reinforcement learning based School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA,
task offloading algorithm for mobile-edge computing systems,” in Proc. Rep. CMU-CS-93-103, 1993.
4th Int. Conf. Math. Artif. Intell. (ICMAI), 2019, 90–94. [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[35] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Performance in Proc. Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.
optimization in mobile-edge computing via deep reinforcement learn-
ing,” 2018, arXiv:1804.00514.
[36] T. Yang, Y. Hu, M. C. Gursoy, A. Schmeink, and R. Mathar, “Deep
reinforcement learning based resource allocation in low latency edge
computing networks,” in Proc. Int. Symp. Wireless Commun. Syst.
(ISWCS), Lisabon, Portugal, Aug. 2018, pp. 1–5. Ismail AlQerm (Member, IEEE) received the Ph.D.
[37] Y. Xiao, M. Noreikis, and A. Ylä-Jaäiski, “QoS-oriented capacity plan- degree in computer science from the King Abdullah
ning for edge computing,” in Proc. IEEE Int. Conf. Commun. (ICC), University of Science and Technology (KAUST) in
2017, pp. 1–6. 2017. He is a Postdoctoral Research Associate with
[38] Z. Ye, S. Mistry, A. Bouguettaya, and H. Dong, “Long-term QoS-aware the Department of Computer Science, University
cloud service composition using multivariate time series analysis,” IEEE of Missouri–St. Louis. His research interests
Trans. Services Comput., vol. 9, no. 3, pp. 382–393, May/Jun. 2016. include edge computing, resource allocation in IoT
[39] R. Mahmud, S. Srirama, K. Ramamohanarao, and R. Buyya, “Quality networks, developing machine learning techniques
of experience (QoE)-aware placement of applications in Fog com- for resource allocation in wireless networks, and
puting environments,” J. Parallel Distrib. Comput., vol. 132, no. 3, software defined radio prototypes. He was among
pp. 190–203, Oct. 2019. the recipients of KAUST Provost Award. He is a
[40] Y. Lu, M. Motani, and W.-C. Wong, “A QoE-aware resource distribu- member of ACM.
tion framework incentivizing context sharing and moderate competition,”
IEEE/ACM Trans. Netw., vol. 24, no. 3, pp. 1364–1377, Jun. 2016.
[41] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA, USA: MIT Press, 1998.
[42] V. Mnih et al., “Human-level control through deep reinforcement Jianli Pan (Senior Member, IEEE) received
learning,” Nature, vol. 518, no. 7540, p. 529, 2015. the M.S. degree in information engineering
[43] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” from the Beijing University of Posts and
in Proc. 33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. Telecommunications, China, and the M.S. and
[44] S. Basso, M. Meo, A. Servetti, and J. De Martin, “Estimating packet Ph.D. degrees from the Department of Computer
loss rate in the access through application-level measurements,” in Proc. Science and Engineering, Washington University
ACM SIGCOMM Workshop Meas. Stack (W-MUST), 2012, pp. 7–12. at St. Louis, USA. He is currently an Associate
[45] B. Han and S. Lee, “Efficient packet error rate estimation in wireless Professor with the Department of Computer
networks,” in Proc. 3rd Int. Conf. Testbeds Res. Infrastruct. Develop. Science, University of Missouri–St. Louis, St.
Netw. Commun., 2007, pp. 1–9. Louis, MO, USA. His current research interests
[46] K. Nagin, A. Kassis, D. Lorenz, K. Barabash, and E. Raichstein, include Internet of Things, edge computing,
“Estimating client QoE from measured network QoS,” in Proc. 12th machine learning, cybersecurity, and smart energy. He is an Associate Editor
ACM Int. Conf. Syst. Storage (SYSTOR), 2019, p. 188. for IEEE Communication Magazine and IEEE ACCESS.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.