0% found this document useful (0 votes)
38 views13 pages

DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications

DeepEdge is a new framework that uses deep reinforcement learning for resource allocation in heterogeneous edge-IoT environments. It aims to maximize users' Quality of Experience (QoE) by developing a novel QoE model and a two-stage deep reinforcement learning scheme. The QoE model maps applications' quality of service requirements to a cumulative QoE score, and can adjust the acceptable ranges of requirements to match available resources. The deep reinforcement learning scheme generates joint actions of resource allocation and quality of service class selection to align requirements with resources while maximizing QoE. It uses deep neural networks to tackle the large state and action spaces in edge-IoT environments. Evaluation shows DeepEdge improves QoE, latency,

Uploaded by

bondgg537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views13 pages

DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications

DeepEdge is a new framework that uses deep reinforcement learning for resource allocation in heterogeneous edge-IoT environments. It aims to maximize users' Quality of Experience (QoE) by developing a novel QoE model and a two-stage deep reinforcement learning scheme. The QoE model maps applications' quality of service requirements to a cumulative QoE score, and can adjust the acceptable ranges of requirements to match available resources. The deep reinforcement learning scheme generates joint actions of resource allocation and quality of service class selection to align requirements with resources while maximizing QoE. It uses deep neural networks to tackle the large state and action spaces in edge-IoT environments. Evaluation shows DeepEdge improves QoE, latency,

Uploaded by

bondgg537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

3942 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO.

4, DECEMBER 2021

DeepEdge: A New QoE-Based Resource Allocation


Framework Using Deep Reinforcement Learning
for Future Heterogeneous Edge-IoT Applications
Ismail AlQerm , Member, IEEE, and Jianli Pan , Senior Member, IEEE

Abstract—Edge computing is emerging to empower the future IoT devices and clouds. Edge or fog computing [1], [2], [3] is
of Internet of Things (IoT) applications. However, due to hetero- considered as a potential approach to fulfill these applications
geneity of applications, it is a significant challenge for the edge demands by moving more computing, storage, and intelligence
cloud to effectively allocate multidimensional limited resources
(CPU, memory, storage, bandwidth, etc.) with constraints of resources to the edge, which would benefit IoT applica-
applications’ Quality of Service (QoS) requirements. In this tions that are delay-sensitive, bandwidth/data intensive, or that
paper, we address the resource allocation problem in Edge-IoT require closer intelligence. We envision a future “Edge-IoT”
systems through developing a novel framework named DeepEdge environment where various IoT applications could use edge
that allocates resources to the heterogeneous IoT applications computing to fulfill their resource demands and performance
with the goal of maximizing users’ Quality of Experience (QoE).
To achieve this goal, we develop a novel QoE model that considers requirements. To enable such a vision, there are some sig-
aligning the heterogeneous requirements of IoT applications to nificant challenges to overcome. On the one hand, from the
the available edge resources. The alignment is achieved through demand side, a massive number of IoT devices can run het-
selection of QoS requirement range that can be satisfied by the erogeneous applications with various Quality of Service (QoS)
available resources. In addition, we propose a novel two-stage requirements and different priorities. On the other hand, from
deep reinforcement learning (DRL) scheme that effectively allo-
cates edge resources to serve the IoT applications and maximize the supply side, the edge clouds are expected to dynamically
the users’ QoE. Unlike the typical DRL, our scheme exploits allocate multidimensional resources (CPU, storage, and band-
deep neural networks (DNN) to improve actions’ exploration by width) at geospatially distributed points and different levels
using DNN to map the Edge-IoT state to joint resource alloca- of network hierarchy. This severely complicates the required
tion action that consists of resource allocation and QoS class. resource allocation and scheduling algorithms. Most of the
The joint action not only maximize users’ QoE and satisfies het-
erogeneous applications’ requirements but also align the QoS current edge computing research either focuses on resource
requirements to the available resources. In addition, we develop allocation without paying attention to QoS requirements of
a Q-value approximation approach to tackle the large space heterogeneous applications, or optimizes specific operations
problem of Edge-IoT. Further evaluation shows that DeepEdge such as mobile offloading, migration, placement, chaining and
brings considerable improvements in terms of QoE, latency and orchestration [4], [5], [6].
application tasks’ success ratio in comparison to the existing
resource allocation schemes. In this paper, we develop a new Edge-IoT framework named
DeepEdge using deep reinforcement learning (DRL) that allo-
Index Terms—Resource allocation, deepEdge, edge-IoT, deep cates resources to heterogeneous IoT applications with the goal
reinforcement learning (DRL), quality of experience (QoE).
of maximizing users’ Quality of Experience (QoE). Unlike the
existing resource allocation schemes in the Edge-IoT research,
I. I NTRODUCTION our proposed DeepEdge framework ensures IoT users’ satis-
faction with guaranteed heterogeneous application’s QoS and
ROWING Internet of Things (IoT) applications such as
G Google Home and Amazon Echo raise the demands for
cloud computing platforms for data processing. However, it
accounts for the dynamic resource availability at the edge in
the resource allocation decisions. The paper has the following
new contributions that align with DeepEdge goals.
is very difficult for the existing centralized cloud computing • We develop a novel QoE model that maps the applications
model to scale with projected large number of IoT devices QoS requirements to a cumulative QoE score that reflects
and ubiquitous applications, due to the large amount of gen- the IoT users’ satisfaction. The developed QoE model is
erated data to be sent over relatively long distance between noteworthy as it supports adjustment of QoS requirements
Manuscript received April 10, 2021; revised August 23, 2021; accepted acceptable ranges to match with the available resources at
October 19, 2021. Date of publication October 29, 2021; date of current the edge. In addition, it specifies certain weight for each
version December 9, 2021. The work was supported by National Science QoS performance metric to emphasize its impact on the
Foundation (NSF) CNS core grant No. 1909520. The associate editor coor-
dinating the review of this article and approving it for publication was overall application performance.
H. Lutfiyya. (Corresponding author: Ismail AlQerm.) • We propose a novel two-stage DRL to fulfill the QoE
The authors are with the Department of Computer Science, University of model objectives by generating joint actions includ-
Missouri–St. Louis, St. Louis, MO 63121 USA (e-mail: [email protected];
[email protected]). ing QoS class selection which aligns applications’ QoS
Digital Object Identifier 10.1109/TNSM.2021.3123959 requirements to the available resources in addition to
1932-4537 
c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3943

TABLE I
the resource allocation action. The scheme exploits E DGE -I OT A PPLICATIONS AND T HEIR C HARACTERISTICS
deep neural networks (DNN) to map the edge-IoT state
information to resource allocation joint actions.
• The proposed DRL tackles the dimensionality problem in
the heterogeneous edge-IoT environment where the size
of state and action space is large. It formulates the Q-
value in a form of compact representation in which it is
approximated as a function of smaller set of variables.
• The proposed DRL scheme tackles the tradeoff between
exploration and exploitation encountered in the DRL
action generation by ranking the actions according to their
Q-values to avoid the equal probability of action selection
used in -greedy based exploration solutions [48].
The rest of the paper is organized as follows, the related
work, its shortcomings, and the motivation for the QoE and
DRL based resource allocation are presented in Section II.
Section III describes DeepEdge system architecture, system long-term planning problem solved by DRL. The authors
model and QoE optimization problem formulation. The two- in [30] proposed a resource allocation policy for the Edge-
stage DRL-based resource allocation scheme is illustrated in IoT system to improve the efficiency of resource utilization
Section IV. Section V presents the performance evaluation and using deep Q-networks (DQN). The work in [31] proposed
the paper concludes in Section VI. a DQN-based resource allocation scheme, which can allocate
computing and network resources to reduce the average ser-
II. R ELATED W ORK AND M OTIVATION vice time. In [32], a joint optimization solution solved by
actor-critic DRL was proposed for allocation of resources in
In this section, the related work is discussed. In addition,
fog-enabled IoT systems. The work in [33] proposed a frame-
we present the motivation for developing QoE model that is
work for edge offloading based on DRL with latency and
backed by DRL for resource allocation.
power consumption minimization as optimization objectives.
Task offloading with a single-user edge computing system was
A. Related Work explored in [34] where DRL was exploited to optimize the
The potential benefits of edge computing in different trade-off between energy consumption and slowdown of tasks
network applications have been studied extensively in the in the processing queue. An online computation offloading
recent literature. A large number of existing work has scheme based on DQN was studied in [35] under random task
focused on edge computing either about allocation for arrivals. The work in [36] investigated strategies for the alloca-
specific applications, or optimizing some operations such tion of computational resources using DRL in edge computing
as offloading, migration, and orchestration [4], [5], [6]. For networks.
offloading, many schemes have been proposed to make Given the related work, none of the existing schemes con-
offloading decisions to optimize energy consumption and delay sidered awareness of multiple heterogeneous applications’
performance [7], [8], [9], [10], [11]. Some of the proposals demands and aligning them with the available resources at
targeted allocation of edge resources. For example, the utiliza- the edge. Heterogeneous IoT applications may have dif-
tion of distributive game-theoretical approaches for resource ferent requirements and characteristics. These requirements
allocation in “cloud-edge” multi-level networks [12]. The might not be fulfilled with the available resources at the
authors in [9] proposed an optimization framework for energy- edge at certain time instant, given that the edge has lim-
efficient resource allocation, by assuming that the network ited computing power comparing with the cloud computing
operator is aware of the complete information of all users’ that is of virtually unlimited computing power but the rel-
applications. atively high latency. The problem of satisfying users’ QoE
DRL has been employed for solving decision-making and applications’ demands in multiple heterogeneous appli-
related problems in the context of edge computing such as cations and dynamic IoT environment with the ability to
computation offloading [13], [14], [15], [16], management adjust QoS requirements of applications to fit with the
problems in vehicular networks [17], [18], [19], [20], [21] available resources is not addressed. None of the proposed
and edge resource allocation [22], [23]. For vehicular DRL schemes for resource allocation considered using DNN
networks, DRL has been investigated to solve several problems to diversify action generation rather than approximation
including resource allocation [24] and computation offload- of value functions of reinforcement learning. In addition,
ing [25], [26], [27]. For instance, the work in [28] exploited the related work neither proposed an effective approach to
DRL to solve the problem of edge resource management tackle the problem of large state space in Edge-IoT nor
by leveraging hierarchical learning architectures. In [29], effectively handled the tradeoff of exploration and exploita-
the authors proposed a knowledge driven service offload- tion in reinforcement learning. A series of typical Edge-
ing decision framework for vehicular network in which the IoT applications and their characteristics are summarized in
offloading decision was formulated for multiple tasks as a Table I.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

B. Motivation 2) Resource allocation decisions made in the Edge-IoT context


1) Quality of Experience (QoE): QoS metrics have been are highly repetitive, hence, it generates a bunch of training
utilized for long as the performance optimization objective data for the DRL technique; 3) DRL is capable of modeling
in resource allocation proposals [37], [38]. However, they complex systems such as Edge-IoT systems as various sig-
do not capture the quality perceived by users, which may nals can be formulated as inputs to the DNN and the output
result in waste of network resources. QoE concept came strategy can be utilized in an online stochastic environment.
into practice to enable broader understanding of the impact With continuous learning, the learning agent becomes able to
of the network performance and complement the traditional optimize specific tasks under varying conditions; 4) DRL does
performance measurement. In contrast to QoS, QoE not only not require any prior knowledge of the system’s behavior to
depends on the technical performance of the system but also learn a resource allocation policy. Moreover, it can support
other factors such as contents, applications, user expectations a variety of objectives just by using different reinforcement
and goals, and contexts of use. QoE is more comprehen- rewards.
sive evaluation particularly for IoT application services as it
focuses on users satisfaction reflected by application QoS guar- III. S YSTEM D ESCRIPTION
antees through the maximization of certain quality scores. In In this section, we present the proposed DeepEdge system
this paper motivated by the heterogeneous applications, which model and architecture, the QoE model, and the QoE
directly deal with the users’ perception, QoE fits the resource maximization problem formulation.
allocation problem as an optimization goal since it can guaran-
tee dynamic resource allocation with users satisfaction. Thus,
we develop a novel QoE estimation model for edge resource A. DeepEdge System Model and Architecture
allocation with heterogeneous IoT applications that has the The considered system model in this paper consists of
following characteristics: 1) It monitors the Edge-IoT envi- multiple groups of IoT devices at one side of the network.
ronment and gathers information including QoS requirements These IoT devices demand resources from the edge of the
of applications and edge resources availability. 2) It defines network to support their applications in tasks processing. Each
the QoS performance metrics that are associated with certain group of IoT devices runs different applications. These appli-
IoT application, and determines their impact on the applica- cations are assumed to be heterogeneous and may have distinct
tion performance. 3) Moreover, it supports the adjustment of QoS requirements. In addition, the system model includes the
the application’s QoS to align with the available resources and edge at the other side of the network which is considered as
boosts the achieved QoE. the resource provider and manager of the Edge-IoT resource
There are some existing works utilizing the QoE concept in allocation. The resources are located at the edge servers where
the edge computing context. For example, the authors in [39] computation, memory and other resources are available. The
proposed a QoE-aware application placement policy that pri- resource allocation process is managed by a controller located
oritizes different application placement requests according to at the edge. It is a centralized component that receives IoT
user expectations. In [40], a framework for edge computing devices’ requests and allocate resources at the edge servers
resource distribution was proposed with crucial security and using DRL integrated with QoE optimization.
authentication components by which it ensures the delivery The proposed DeepEdge architecture is presented in Fig. 1.
of users’ QoE. However, none of these proposals considered The architecture consists of multiple components that work
resource allocation problem for multiple heterogeneous appli- together to achieve the resource allocation with maximum
cations as each one is focusing on specific application and users’ QoE. The architecture includes the IoT environment
mostly HTTP videos. Moreover, they do not tackle the sit- and the edge cloud. The IoT environment comprises multiple
uation when QoS requirements of the application cannot be types of IoT devices: devices that run multiple applications
fulfilled by the available resources at certain time. and multiple devices that run the same applications such as
2) Deep Reinforcement Learning: Reinforcement learn- camera surveillance. The edge cloud consists of certain num-
ing [41] such as Q-learning has become an active research ber of edge servers that comprise virtual machines, memory,
area [42], [43]. It deals with agents that learn to make better and computation resources. IoT and edge computing update
decisions directly from experience through interacting with the controller implemented at the edge with their states. For
the environment. Recently, reinforcement learning was com- example, the IoT sends information about QoS requirements of
bined with deep learning techniques to develop DRL which the IoT applications’ and the edge computing servers provide
demonstrated significant impact on various applications such information about the available resources, their location, and
as video gaming, Computer Go, and data center cooling. DRL their current load. The controller incorporates a resource allo-
is well-suited for the resource allocation problem in Edge-IoT cation manager (RAM) which runs the two-stage DRL scheme
given its large scale and dynamicity for the following reasons: and decides the resource allocation policy that maximizes the
1) Edge-IoT systems are dynamic in the context of resource QoE and adapts applications’ QoS requirements to align with
demand and resource availability varies over the time which the available resources. The controller receives the application
makes it difficult to use numerical optimization to solve the and edge servers state information through the devices and
resource allocation problem. DRL learns over time resource servers modules respectively. The QoE model is integrated
allocation actions that match with environment dynamics. with the controller to enforce its objectives and constraints to

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3945

Fig. 1. DeepEdge Architecture.

TABLE II
achieve user satisfaction and efficient resources utilization in S AMPLE Q UALITY S CORE FOR H ETEROGENEOUS A PPLICATIONS
Edge-IoT with multiple heterogeneous applications. W ITH VARIOUS R EQUIREMENTS

B. The New QoE Model


The proposed QoE model aims to dynamically map the
IoT applications’ performance metrics such as latency into a
cumulative quality score that evaluates the IoT user satisfac-
tion. QoE is designated to be the optimization objective that
DeepEdge exploits to drive the resource allocation decisions.
The QoE formulation consists of multiple QoS performance
metrics that quantify the IoT application performance. A
cumulative quality score is mapped to multiple quality scores
each corresponds to a QoS metric acceptable range. This range
is tunable to enforce the QoS requirements to fit with the avail-
able resources. This brings a wide range of flexibility that can
enhance the system resource allocation capability and main-
tains applications operations uninterrupted. Moreover, the QoE
model incorporates the following key attributes: 1) determina-
tion of the weights of QoS metrics according to their impact on
the IoT application operation; 2) QoE estimation based on the
achieved QoS performance metrics of multiple heterogeneous applications; or multiple devices run the same application. The
IoT applications and the application priority that is determined QoE model considers latency (T), packet error rate (RE ), and
according to the application type. For instance, applications packet loss rate (RL ) as QoS metrics for each application
with critical QoS requirements are given the highest priority. and assigns certain weight w for each metric to describe its
Our QoE model is generic and can accommodate several impact on the application QoS. A parameter called applica-
IoT applications. We picked the following applications in tion’s QoS class ας is defined to represent the possible metric
this paper for demonstration purposes: emergency response, adjustment range. The range is evaluated using a quality score
health monitoring, and personal identification. These applica- Φ which quantifies the IoT user satisfaction and contributes
tions comprise a broad range of QoS performance metrics, to the cumulative quality score. ας is selected to achieve the
different priorities and various resource demands. Emergency alignment of the QoS requirements of the IoT application such
response is latency sensitive with intensive data and high com- that they are consistent with the application’s priority and the
putation requirements while health monitoring has low latency available resources at the edge. The application’s priority βς
sensitivity and requires lower data and computation. Personal is ranked starting from 1 to indicate the highest priority and
identification has the least priority with moderate data inten- it is assumed to be predetermined.
siveness, computation and latency requirements. We denote Table II presents the parameters of our QoE model including
the application type by the index ς and the IoT user (device) βς , ας , w and the corresponding quality score Φ of each met-
that runs the application by index i ∈ N . The network ric class for the three heterogeneous applications considered in
model assumes that one device can run single or multiple this model. The priority βς is specified based on how crucial

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

is the resource allocation for the application. The performance QoS metrics are below minimum thresholds, the cumulative
metric weight w is selected to show how sensitive the appli- quality score Φς will be compromised. The proposed defi-
cation for the corresponding metric. For example, emergency nition of QoE reflects all its impacting parameters including
response is more sensitive to latency than to PLR or PER. The cumulative Φς normalized score, the metric class (ας ) and the
QoS class ας is set using DRL to maximize Φ and align to the priority (βς ). It maps the relationship between Φς and QoE
resource availability at the edge. For instance, if the current according to the applications’ characteristics since the appli-
resource request for certain application at certain time instant cations’ requirements vary from one type of application to the
cannot be fulfilled due to lack of resources at the edge, the other. Therefore, QoE for multiple applications is modeled
application ας will be altered in certain ranges that maintains using an exponential mapping function to the quality score
the application service and fit with the available resources. Φς as follows,
The metric classes indicated in Table II show examples of  
 e Φς −ας + e −Φς +ας
the metric ranges that correspond to certain ας . The quality QoE = ϑ +1 (2)
score Φ given in Table II shows how the selection of different ς
e ας + β ς
class ας affects the achieved QoE. All the presented values
where ϑ is scaling constant selected for the mapping func-
for metric ranges and Φ are for demonstration of the QoE
tion. The definition in (2) is non-linear exponential monotonic
model functionality and how Φ is influenced by the selected
mapping function, which suits our model as the performance
on ας . Moreover, the values of the metrics ranges are tied to
metrics considered cannot be scaled uniformly, i.e., equal per-
the metric weight specified. For instance, high latency weight
ceived performance difference does not correspond to equal
in emergency response causes its latency ranges of different
numerical difference in the Φς score. The considered QoS
classes to be lower than other applications.
metrics including LA, PLR, and PER have exponential inter-
Φ is mapped to the following metrics: (T), (RL ) and (RE ).
dependency with user QoE in the proposed edge-IoT system.
The latency T is calculated according to the link bandwidth,
For example, when the QoE value is high, any variation in
data size and the propagation medium. Packet loss rate RL is
MSS .η these metrics will heavily impact the QoE. However, consider-
evaluated according to [44] as, RL = goodput.RTT where MSS
able variation in these QoS metrics will not exhibit significant
is the maximum segment size, η is a constant that incorporates
impact if the QoE is low. Thus, exponential mapping function
the loss model and the acknowledgment strategy, goodput is
is able to capture the impact of QoS metrics on QoE specifi-
the ratio of the delivered packets over the delivery completion
cally for sensitive applications such as emergency response. In
time, and RTT is the round trip time. The packet error rate
addition, different experiments in the literature demonstrated
RE is found according to the estimation model in [45], which
that exponential mapping outperforms other mapping functions
relies on the link characteristics found using statistics from
such as linear or logarithmic [49].
two distinct types of probing messages. QoE combines user
experience and expectation to the edge computing system and
network performance. The performance of the edge system C. Problem Formulation
is typically evaluated by QoS metrics. Thus, it is necessary In this section, we formulate the resource allocation
to have qualitative relationship between QoS and QoE to be optimization problem with the goal of maximizing the QoE
able to achieve QoE control mechanism based on QoS with found in (2) with consideration of all applications. QoE is
maximum efficacy [46], [47]. To achieve this, we use a generic rewarded when applications’ QoS requirements are satisfied
formula to correlate the variation in QoE with the achieved and thereafter the user satisfaction through achieving high
QoS metrics including latency, loss and error rates. The QoS QoE. The resource allocation optimization problem for QoE
metrics are represented by quality scores ΦT , ΦRL , and ΦRE maximization is formulated as:
for latency, packet loss rate and packet error rate respectively.
Each of these scores is obtained based on the application type max QoE (3)
(xr ,i,j ),ας
and the selected metric class ας as indicated in Table II. For s.t. xr ,i,j ≥ 0, ∀j , r (4)
instance, if the resource allocation action was to select ας as 1 
for the emergency response application which corresponds to xr ,i,j ≤ Cj , ∀i (5)
the best range for all the QoS metrics, the quality score will be j r
10. The cumulative quality score achieved for each application T ς ≤ T max (6)
ς max
with certain amount of resources allocated is calculated as RE ≤ RE (7)
follows, ς max
RL ≤ RL . (8)
  
Φς = xr ,i,j w1 .ΦT + w2 .ΦRL + w3 .ΦRE (1) Edge server’s capacity Cj is defined in constraint (5) to
i j r
confirm that the allocated resources cannot exceed the server
where xr ,i,j is the resource allocation indicator with r as a capacity. Equation (6), (7), (8) are the constraints for QoS
resource type (CPU, memory..etc.), i is the index of IoT device metrics including T, RE , and RL respectively to guarantee
running the application, j is the index of the edge server pro- that they will not exceed the maximum threshold. Note that
viding the resources, and w is the weight of the performance ς is used to indicate the performance metric achieved for
metric. The cumulative quality score captures the impact of certain application. The definition of QoE in (2) is derived
each of the QoS metrics on the overall performance. If the as a function of quality scores for each QoS metric Φ and

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3947

the resource allocation factor x. The quality scores correspond


to the user satisfaction according to the achieved below the
threshold QoS metrics values as in the constraints (6) to (8).
Technically, we try to maximize the achievable QoE for each
user under the conditions that all other users running different
applications achieve maximum QoE. The mutual interest in
each resource unit at certain server for all the users causes the
optimization problem to be coupled across them. Moreover,
the constraint in (5) makes the optimization problem coupled
across all the resources of each edge server. This makes the
convexity of the QoE cannot be guaranteed and hence, the
optimization function in (3) becomes non-convex.
The formulated QoE optimization problem comprises the
allocation of resources xr ,i,j and the selection of the most Fig. 2. DeepEdge two-stage DRL scheme overview.
appropriate QoS class ας . This will be the core of the decision-
making problem solved by DRL in the next section.
the best action is given the highest selection probability. 3) It
IV. T WO -S TAGE D EEP R EINFORCEMENT L EARNING exploits information about QoS requirements of the heteroge-
S CHEME FOR R ESOURCE A LLOCATION IN E DGE -I OT neous applications, the resource demands, and the resources
In this section, we illustrate the two-stages DRL scheme availability in actions generation; 4) The scheme generates
built to allocate resources from the edge to the IoT appli- joint actions including resource allocation with certain QoS
cations. First, rationale and overview of the scheme are class.
presented. Then, the two stages of DRL are illustrated. Our DRL scheme consists of two stages: 1) Action explo-
ration and evaluation. 2) Action exploitation and DNN train-
ing. In the first stage, we employ DNN to generate joint
A. Scheme Rationale and Overview
resource allocation and QoS class selection actions. After
The DRL for resource allocation is implemented in the the generation of the joint actions, reinforcement learning is
RAM module in the controller of DeepEdge. Despite the engaged to evaluate the joint actions and select the ones that
fact that DRL has a potential to solve the resource allo- have the maximum Q-value, which is defined based on the
cation problem in the Edge-IoT domain, diversity in action achieved QoE described earlier in the system model. In the
exploration remains a major challenge for DRL in such envi- second stage, the joint actions with the highest Q-value dur-
ronment with large state/action space and sparse reward values. ing exploration are exploited and stored in a replay memory.
It is infeasible to rely on simple look up table of state/action The memory is used to train the DNN and update its param-
and Q-values, i.e., it is necessary to approximate the Q-value eter such that the actions generated in the next iteration are
to minimize the complexity of the scheme and account for improved. The overview of the two-stage DRL mechanism
state/action dimensionality. The sparse reward values can lead is shown in Fig. 2. More specifically, in the first stage, the
the DRL to achieve sub-optimal resource allocation policy. In scheme generates joint actions based on the DNN current
addition, it is necessary to utilize multi-dimensional data and action policy πθt , parameterized by θt which is the weight
analyze it to determine the best resource allocation policy. The that connects the hidden neurons in DNN. Then, the generated
multi-dimensional data comprises edge server resources avail- actions during exploration are evaluated using the proposed
ability, IoT applications resource demands, and applications’ approximated Q-value.
QoS requirements. To tackle these challenges and leverage In the second stage, the best joint actions are selected
the multi-dimensional data for resource allocation, we build a among the actions generated with certain action selection
novel two-stage DRL scheme that has the following merits: probability. The state and the corresponding selected joint
1) It exploits DNN to enhance action exploration by map- action with the highest Q-value (St , Xt∗ ) is added to a replay
ping the Edge-IoT system state to joint actions of resource memory. The action policy at DNN is updated by fetching
allocation and QoS class selection. Q-value of DRL is approx- a batch of training samples to train the DNN. After train-
imated as a function of smaller set of variables to tackle the ing, DNN updates its weighing parameter θt to θt+1 and
large state/action space of Edge-IoT environment. This dis- the action policy πθt+1 . The new action policy πθt+1 will
tinguishes our scheme from the typical DRL schemes which be exploited in the next iteration to generate joint actions
utilize DNN to approximate the value function using temporal at t + 1. Such a reinforcement learning iteration allows
difference and train DNN accordingly. 2) The exploration gen- the DNN to continuously improve the quality of the actions
erated actions are ranked according to their Q-values to avoid generated.
the equal probability of action selection. This ranking is used
to select the DNN training data and balance exploration and
exploitation of actions. The balance is achieved using effective B. Deep Reinforcement Learning Stages
action selection probability, which is varied as a graded func- In this section, we illustrate the two stages of DRL for
tion of Q-value using Boltzmann distribution [50] such that resource allocation joint action generation.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

1) First Stage (Action Exploration and Evaluation): In this where μ ∈ [0, 1] is the learning rate. Reinforcement learn-
stage, the DNN receive the Edge-IoT state St information at ing is a stochastic approximation method that solves the
time t defined as the IoT applications resource demand yi , Bellman’s optimality equation associated with the DTMDP.
the QoS requirements and the resources available at the edge It does not require state transition probability model as it
St = {yi , RLmax , RE
max , T max , C }. According to the current
j converges with probability one to a solution if ∞ t=1 ϕ is infi-
t
∞ 2
action policy denoted as πθt : {St } → Xt , a set of joint nite, t=1 (ϕ ) is finite, and all state/action pairs are visited
t
actions is generated by DNN and denoted by a mapping fθt infinitely often [54].
as follows, One of the main shortcomings of using Q-value for action
evaluation in the dynamic Edge-IoT environment is the large
Xt = fθt (St ) (9) state space. It is not feasible to use state/action tables and
where Xt = {Xkt , k = 1, 2, . . . , K }, and Xkt
= {xrt ,i,j , αςt } find the corresponding Q-value in such environment for action
is the kth entry of Xt . Each entry in Xt is
a joint action evaluation. Thus, it is necessary to approximate the Q-value.
and is assumed to be continuous. The universal approxima- This approximation reduces the complexity of the system
tion theorem claims that if hidden layers have large number and enhances its convergence. Thus, we approximate the Q-
of hidden neurons and a proper activation function is applied value as a function of a smaller set of variables in which
at the neurons, they will be sufficient to approximate any con- Q-value utilizes a countable state space S ∗ using the func-
tinuous mapping f [51]. We exploit ReLU as an activation tion Q  : S ∗ × X . This function is referred as a function
function [52] of the hidden layers, where the output b and approximator. The vector ρ = {ρp }P p=1 is exploited to approx-
input v of a neuron are related by b = max{v, 0}. In the output imate the Q-value by minimizing the metric of difference
layer, we use sigmoid activation function as b = 1/(1 + e −v ). between Q ∗ (St , Xt ) and Q  (St , Xt , ρ) for all (St , Xt ) ∈
It is necessary to map the set of joint actions Xt to a dis- S ∗ × X . Thus, the approximated Q  value is formalized as,
crete action set such that the actions can be evaluated by the Q  (St , Xt , ρ) = P p=1 ρ ψp (St , Xt ) = ρψ (St , Xt ) where
p T

reinforcement learning Q-value function. We employ typical T denotes the transpose operator and the vector ψ(St , Xt ) =
K-nearest-neighbors (KNN) algorithm [53] to do the map- [ψp (St , Xt )P p=1 ] with a scalar function ψp (St , Xt ) that is
ping. After obtaining the candidates discrete joint actions from identified as the basis function (BF) over S ∗ × χ, and ρp (p =
KNN, the performance of these actions is evaluated using rein- 1, . . . , P ) are the associated weights. We use Stochastic
forcement learning. The action evaluation is conducted based Gradient Descent (SGD) method to update the weights. The
on the QoE optimization objective defined in (3). Q-value update rule in (11) is redefined as follows,
We assume that the Edge-IoT environment evolves as 
ρt+1 ψ T (St , Xt ) = 1 − μt ρt ψ T (St , Xt ) + μt
a discrete-time Markov decision process (DTMDP). The 
maximization problem in (3) falls within the domain of a 
R(St , Xt ) + ϕ max ρt ψ T St , Xt
DTMDP. In order to find the optimal action policy, we define X  ∈X
a DTMDP that associates an action to every Edge-IoT state, × ψ(St , Xt ) (12)
a state transition and a reward function. The state transitions
and actions occur at discrete time epochs. DeepEdge controller where the gradient is a vector of partial derivatives with respect
monitors the Edge-IoT state St in current epoch t and gener- to the elements of ρt .
ates discrete joint actions Xt , which are found using DNN. A 2) Second Stage (Action Exploitation and DNN Training):
reward function is generated for each joint action Xt at the The action with the highest Q-value Xt∗ must be exploited
end of the epoch. The reward function Rt is selected to be the among other actions of state St and added to the replay
QoE defined in (2). The formal expression for the DTMDP is memory to train the DNN. The replay memory will be popu-
given as (S, X, T, R), where T : S × X × S  → [0, 1] is lated with state/action pairs that have the highest Q-value over
a state transition probability function. Ultimately, the objec- certain number of iterations. The action exploitation is accom-
tive of DRL integrated with DTMDP is to find an optimal plished through determination of action policy (πςt ), which is
joint action Xt that maximizes the QoE in (2). The Q-value defined as the probability of selection of action Xt at state
of the reinforcement learning is exploited to evaluate the joint St . It corresponds to the set of actions with the highest Q-
action is defined as the current expected reward plus a future value. The attainment of this policy is tied to resolving the
discounted reward as follows, exploration vs. exploitation tradeoff. Exploration aims to look
 for new joint actions so it does not only utilize the actions

Q (St , Xt ) = E R(St , Xt ) + ϕ max Q ∗ St , Xt (10)
∗ known to achieve high Q-value. Exploitation is the process of
X  ∈Xt using the good actions available. The most common method
where ϕ ∈ (0, 1] is the discount factor. The optimal Q-value to balance exploration and exploitation is to use the -greedy
Q ∗ (St , Xt ) is updated by the change in the Q-value according selection [48], where  is the portion of the time that a learn-
to the transition from state St to state St under the action Xt ing agent takes a randomly selected action instead of taking
at epoch t as follows, the action that is most likely to maximize its reward given the
 actions available. However, -greedy selects equally among the
Q t+1 (St , Xt ) = 1 − μt Q(St , Xt ) + μ available actions, i.e., the worst action is likely to be chosen as

 the best one. In order to overcome this issue, we develop a new
R(St , Xt ) + ϕ max Q St , Xt (11)
 X ∈X method in which the action selection probabilities are varied as

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3949

a graded function of Q-value. The best joint action is given the Algorithm 1 Two-Stage Deep Reinforcement Learning
highest selection probability while others are ranked accord- Algorithm to Solve Resource Allocation in DeepEdge
ing to their Q-values. Boltzmann distribution [50] is adopted to Require: Network state St which include QoS require-
achieve this ranking. The action selection probability at epoch ments of the application, resource demands edge servers
t is given as follows, resources capacity at each epoch t
Ensure: Joint action for resource allocation and QoS metric
e Q(St ,Xt )/τ
πς∗ (St , Xt ) = (13) class Xt = {Xt∗ , αςt }
Q(St ,Xt )/τ
X  ∈X e 1: BEGIN
where τ is a positive parameter which can take high value and 2: Initialize the DNN with random parameters θt and empty
this indicates that the actions probabilities nearly equal. In case replay memory
τ has low values, this indicated a big difference in selection 3: Set iteration number m and the training interval Ω
probabilities for actions with different Q-values. This action 4: for (t=1 to m) do
selection probability is updated after Q-value approximation 5: Generate a set of joint actions Xt = fθt (St )
as follows, 6: Use KNN to convert the continuous set of actions into
T a discrete set
e ρt ψ (St ,Xt )/τ
πς∗ (St , Xt ) = . (14) 7: Run Approximated reinforcement learning to evaluate
ρt ψ T (St ,Xt )/τ
X ∈X e the action for resource allocation that must satisfy Xt∗ =
The selected state/actions pairs are added to the memory at maxX ρψ T (St , Xt )
each epoch and utilized later to train the DNN. This improves 8: Exploit actions according (14)
the upcoming joint actions that will be generated by the DNN 9: Update the memory by adding (St , Xt∗ )
in the future epoch. To achieve this, DeepEdge maintains an 10: if Ω = 1 then
initially empty memory of limited capacity. At the t-th epoch, 11: Uniformly select a group of data samples
a new training data (St , Xt∗ ) is added to the memory. When {(Sυ , Xυ )|υ ∈ Υt } from the memory
the memory is full, the newly generated data sample replaces 12: Train the DNN with {(Sυ , Xυ )|υ ∈ Υt } and update
the oldest one. The experience replay technique [42], [55] is θt using Adam algorithm
utilized to train the DNN using the stored data samples. After 13: end if
certain number of epochs when there is enough data to train 14: end for
the DNN, we randomly select a group of training data samples 15: END
{(Sυ , Xυ∗ )|υ ∈ Υt } from the memory, where Υ is the set
of selected time indices. The DNN parameter θt is updated
using Adam algorithm [56] which targets minimization of the The complexity of the proposed two-stages DRL is found
average cross-entropy loss L(θt ) defined as follows, based on the number of edge servers J, the number of avail-
1  ∗ T able resources of certain type r, and the number of devices
L(θt ) = − (Xυ ) log fθt (Sυ ) that demand the resources N. The implementation of the
|Υt |
υ∈Υt DRL algorithm considers different application and scenar-
 
+ (1 − Xυ∗ )T log 1 − fθt (Sυ ) (15) ios. It associates actions generation for the device with the
available resources and edge servers. The computation com-
where |Υt | is the size of Υt , T denotes the transpose operator, plexity of the action exploration stage of the DRL is O(JN r )
and the log function is the element-wise logarithm operation operations. The complexity of the exploitation and training
for a vector. We start the training step when the number of stage is O(mΩ) according to the number of epoch m and
samples is larger than half of the memory size. Eventually, the the training interval Ω. The memory requirements to store
DNN learns the best joint action for each state (St , Xt∗ ). Thus, the samples for DNN training is N (r ∗J ) . Exploration and
it becomes smarter and continuously improves its produced exploitation are achieved with the merit of the approximated
joint action. Q-value O(Q  θt (St , Xt , ρ)) instead of the typical Q-value in
The two-stage DRL for resource allocation procedure is the traditional Q-learning. The computation complexity of our
presented in Algorithm 1. The algorithm acquires the Edge-IoT proposed two-stages DRL is acceptable given the achieved
state information which includes QoS requirements, resource performance and in comparison with the traditional Q-learning
demand and edge servers resources capacity information. which has a an exponential computational complexity of
It starts by initializing the DNN with certain parameter θ. O(N J ∗r ). The traditional Q-learning may only achieve max-
The DNN generates the joint actions. The output of DNN imum achievable QoE by searching all possible combinations
is converted to discrete format and then received by the of state/action/rewards. Consequently, it requires more number
approximated reinforcement learning to evaluate the generated of operations and its computation complexity escalates in an
actions by the DNN. The actions with the highest Q-value exponential pattern.
are exploited according to the probability in (14) and used to
populate the dedicated memory of DNN. After certain num-
ber of epoch, a sample of state/action pairs is fetched from V. P ERFORMANCE E VALUATION
the memory and used for DNN training and updating θ using We evaluate the performance of the proposed DeepEdge for
Adam algorithm. resource allocation in Edge-IoT with respect to the average

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

TABLE III
S YSTEM PARAMETERS

Fig. 4. Latency for health monitoring application.

Fig. 3. Latency for emergency response application.

application’s latency, achieved users’ QoE, and the average


application tasks success ratio.

A. Evaluation Setup Fig. 5. Latency for personal identification application.

We consider a network that consists of 10 edge servers uni-


TABLE IV
formly distributed in the network. Each server is equipped E VALUATIONS OF D EEP E DGE VS O PTIMAL E XHAUSTIVE S EARCH
with a 3-core CPU where the CPU cycle frequency of each
core is 3 × 109 cycles per second. The frame length is 600
symbols where the time of one symbol is 4.5 μs. The block-
length of uplinks are all assumed to be equal to 200 symbols.
The number of IoT users is assumed as N ∈ [100, 400], ran-
domly distributed within the network. The bandwidth available
for sharing is set to 10 MHz. Applications’ latency require- work. We increase the number of users from 100 to
ment and data size, as well as the corresponding CPU cycles, 400 by step of 100 to show the change of latency. The
are determined by the specific IoT application type. We con- results show that our proposed DeepEdge achieves the best
sider the three applications described in the system model and result in terms application latency comparing with the other
their corresponding QoS requirements. The DRL parameters schemes. Moreover, the latency is maintained low in com-
and rest of simulation parameters are presented in Table III. parison to other schemes even with large number of users
The application task data size is set as a uniform distribution, involved.
[2, 8] MB, and corresponding CPU cycles is variable. In addition, we compare the performance of DeepEdge in
terms of latency against the optimal exhaustive search resource
B. Application Latency Evaluation allocation for the three applications: emergency response,
We evaluate the performance of DeepEdge in terms of health monitoring and personal identification. Exhaustive
the average encountered latency for certain application by search requires searching through all the possible resource
varying the number of IoT users starting from 100 users allocation possibilities. It is impractical in the considered edge-
where 50% of the users run the evaluated application and IoT applications where the search becomes complicated and
50% run the other two applications with 25% of the users consumes significant time as the system scale grows in terms
for each. Fig. 3, 4, and 5 present the average latency for of the numbers of IoT devices, edge server and edge resources.
the three applications: emergency response, health moni- Table IV presents the recorded latency for 20 devices which
toring and personal identification respectively with variable is a small number given the mentioned applications. We only
number of IoT users. The achieved latency is compared notice a minimal difference in the latency for different appli-
to the resource allocation schemes: DQN-based (AD) [31] cations between DeepEdge and the optimal scheme given the
and actor-critic (DR-Leanring) [32] presented in the related small number of devices.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3951

TABLE V
E VALUATIONS OF D EEP E DGE R ESOURCE A LLOCATION S CENARIOS

Fig. 7. QoE for multiple applications scenarios.


Fig. 6. DeepEdge resource allocation scenarios.

second example: there are 400 IoT users of which 100 users
emergency response, 100 users health monitoring, 100 users
C. Evaluation of Various DeepEdge Resource Allocation with personal identifications, and 100 users running the three
Scenarios applications simultaneously. All the users report their requests
In this subsection, we discuss and evaluate multiple scenar- along with the QoS requirements of applications to the RAM
ios of how DeepEdge operates to perform resource allocation at the controller. All the requests are sorted according to the
with QoE maximization. Fig. 6 depicts the scenarios of users index i and application type ς. Then, the RAM allocates
resource allocation for multiple heterogeneous applications. resources to these applications with consideration of applica-
For the first scenario, it has 100 IoT users which run emer- tion priority βς and resource availability at the edge. These
gency response application with high QoS requirements. The parameters are exploited by the two-stage DRL to adapt QoS
resources requests of emergency response application are sent class αi,ς and allocate resources accordingly with the goal to
to the RAM in the controller. The request is processed through maximize the joint QoE for all users and satisfaction of their
the two-stage DRL by selecting the most appropriate QoS class applications. Table V presents the specifications of the three
ας and allocate edge resources accordingly. In the second sce- scenarios, QoS metrics requirements and the average achieved
nario (application heterogeneity), it is assumed that each one metrics by DeepEdge for each application. We observe that
of 200 IoT user runs two applications (emergency response DeepEdge always maintains the QoS metrics below the speci-
and personal identification), which lets the controller treat fied threshold even in the most complicated setting of the third
all the IoT users the same. The RAM here receives requests scenario.
from the same user but for multiple applications. It recog- Moreover, QoE is evaluated with consideration of the differ-
nizes the application type ς, identifies the applications priority ent scenarios presented in Table V to demonstrate DeepEdge
βς and analyzes their QoS requirements. Then, it enforces capability to tackle the heterogeneity of IoT applications
the application QoS class adaptation starting with the lower in resource allocation. The QoE function derived in (2) is
priority application. For example, the QoS class of the per- exploited as an evaluation metric to demonstrate the merit of
sonal identification application that has the lowest priority the proposed two-stages DRL against other DRL schemes:
will be adapted first through proper election of its ας . The the DQN-based scheme (AD) [31] and the actor critic scheme
two-stage DRL allocates the resources for both applications (DR-Learning) [32]. However, the QoE function for the AD
with the goal of maximizing the QoE in (3). In the third sce- and DR-Learning schemes is calculated using the quality score
nario (users and application heterogeneity), we present two of the application latency only (not including quality scores
evaluation examples: First, there are 300 heterogeneous IoT for PLR and PER) as it is the only QoS metric they consid-
users of which 100 users are running emergency response, ered as an optimization goal. The average QoE is plotted in
100 users for health monitoring and 100 with two applica- Fig. 7. Fig. 7 shows that DeepEdge outperforms both schemes
tions emergency response and health care monitoring. In the as they lack the capability of handling multiple applications

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

TABLE VI
E VALUATIONS OF RUNTIME FOR A LL S CHEMES
IN D IFFERENT S CENARIOS

Fig. 9. Average task success ratio vs. arrival rate.

Fig. 8. Average task success ratio convergence.

running on large number of devices. We notice that in the


first scenario, other schemes achieve comparable performance
as only one application is running. The QoE decreases as the Fig. 10. Normalized Training Loss rate.
number of users increases which is expected as the competi-
tion between users for resources increases. However, the drop
of QoE in DeepEdge as the number of users increased with of the increase in the task arrival rates. Other schemes’ success
variety of applications is not significant in comparison to the ratios fall dramatically as the tasks arrival rate evolves.
other schemes.
In addition, we compare the system runtime for each sce- E. Convergence and Training Evaluation and Discussion
nario settings for all schemes. Table VI presents the runtime in
We evaluate the performance of the DNN utilized in
seconds for each scheme in each scenario that correspond to
DeepEdge in terms of the training losses. The evaluation shows
the achieved QoE in Fig. 7. We notice that DeepEdge records
the training quality of DNN in DeepEdge as the resource
the lowest runtime in comparison to other resource alloca-
allocation proceeds. We plot the training loss rate of our
tions schemes in all scenarios with considerable difference in
proposed DRL in Fig. 10 and compare it to other allocation
the most complicated scenario with 400 devices. Thanks to
schemes. The training loss rate gradually decreases and sta-
the enhanced design of the developed two stages DRL.
bilizes at around 0.04. We clearly notice that DRL developed
in DeepEdge converges faster and with lower training loss
D. Task Success Ratio rate comparing to the DQN in AD [31] and actor-critic in
Another evaluation factor considered in this paper is the DR-Learning [32]. The convergence speed of DeepEdge is
task success ratio, which is the ratio of the application’s evaluated in terms of the achieved normalized QoE with
tasks with satisfied QoS requirements to the total number respect to the number of epoch as in Fig. 11. We observe that
of running application’s tasks. We adopt the settings of the the moving average QoE of DeepEdge gradually converges to
second scenario in Table V. Fig. 8 presents the average task the maximum. Specifically, the achieved average QoE exceeds
success ratio of the proposed DeepEdge with average appli- 0.98 and the variance gradually decreases to zero as iteration
cation’s resources request rate of 0.5. It is observed that the becomes larger. We adopt the settings of the second scenario in
performance is improved gradually with learning as the system Table IV for the evaluation of training losses and convergence.
becomes familiar with the environment and capable to make The evaluation of DeepEdge shows that it outperforms
better resource allocation decisions. Moreover, we evaluate other resource allocation schemes. The reason for that is the
the task success ratio against the variable task arrival rates in consideration of multiple heterogeneous applications in the
Fig. 9. It shows that DeepEdge outperforms other allocation proposed QoE model, which aims to guarantee IoT users sat-
schemes and maintains the success ratio above 0.9 regardless isfaction through fulfillment of different applications’ QoS

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
ALQERM AND PAN: DeepEdge: NEW QoE-BASED RESOURCE ALLOCATION FRAMEWORK 3953

[4] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computation


offloading for mobile-edge cloud computing,” IEEE/ACM Trans. Netw.,
vol. 24, no. 5, pp. 2795–2808, Oct. 2016.
[5] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung,
“Dynamic service migration in mobile edge-clouds,” in Proc. IEEE IFIP
Netw. Conf. (IFIP Networking), 2015, pp. 1–9.
[6] R. Ranjan, B. Benatallah, S. Dustdar, and M. P. Papazoglou, “Cloud
resource orchestration programming: Overview, issues, and directions,”
IEEE Internet Comput., vol. 19, no. 5, pp. 46–56, Sep./Oct. 2015.
[7] L. Liu, Z. Chang, X. Guo, S. Mao, and T. Ristaniemi, “Multi-
objective optimization for computation offloading in fog computing,”
IEEE Internet Things J., vol. 5, no. 1, pp. 283–294, Feb. 2018.
[8] X. Sun and N. Ansari, “Latency aware workload offloading
in the cloudlet network,” IEEE Commun. Letter., vol. 21, no. 7,
pp. 1481–1484, Jul. 2017.
[9] S. Sardellitti, G. Scutari, and S. Barbarossa, “Joint optimization of
radio and computational resources for multicell mobile-edge comput-
Fig. 11. Normalized QoE for DeepEdge. ing,” IEEE Trans. Signal Inf. Process. Netw., vol. 1, no. 2, pp. 89–103,
Jun. 2015.
[10] X. Lyu, H. Tian, C. Sengul, and P. Zhang, “Multiuser joint task offload-
ing and resource optimization in proximate clouds,” IEEE Trans. Veh.
requirements. The consideration of aligning the IoT applica- Technol., vol. 66, no. 4, pp. 3435–3447, Apr. 2017.
tions’ requirements with the available resources at the edge [11] X. Chen, W. Li, S. Lu, Z. Zhou, and X. Fu, “Efficient resource allo-
cation for on-demand mobile-edge cloud computing,” IEEE Trans. Veh.
has a tremendous contribution to the achieved performance. Technol., vol. 67, no. 9, pp. 8769–8780, Sep. 2018.
In addition, the developed two-stage DRL adds the following [12] H. Zhang, Y. Xiao, S. Bu, D. Niyato, F. R. Yu, and Z. Han, “Computing
advantages to DeepEdge. 1) It benefits from historical actions resource allocation in three-tier IoT Fog networks: A joint optimization
approach combining stackelberg game and matching,” IEEE Internet
to foster the framework experience. 2) It generates joint actions Things J., vol. 4, no. 5, pp. 1204–1215, Oct. 2017.
and enhances the diversity of actions at the exploration stage [13] L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforce-
using DNN. 3) The Q-value approximation reduces the com- ment learning-based joint task offloading and bandwidth allocation for
multi-user mobile edge computing,” Digit. Commun. Netw., vol. 5, no. 1,
plexity of the system which can be noticed at the convergence pp. 10–17, 2019.
speed in comparison to other schemes. [14] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning based
computation offloading and resource allocation for MEC,” in Proc. IEEE
Wireless Commun. Netw. Conf. (WCNC), 2018, pp. 1–6.
VI. C ONCLUSION [15] F. D. Vita, D. Bruneo, A. Puliafito, G. Nardini, A. Virdis, and G. Stea,
“A deep reinforcement learning approach for data migration in multi-
Edge computing comes into practice as a potential solu- access edge computing,” in Proc. ITU Kaleidoscope Mach. Learn. 5G
tion to tackle the IoT applications resource demanding issue Future (ITU K), 2018, pp. 1–8.
in a fast manner. Resource allocation in the context of edge [16] J. Wang, J. Hu, G. Min, W. Zhan, Q. Ni, and N. Georgalas, “Computation
computing becomes important as there can be many heteroge- offloading in multi-access edge computing using a deep sequential model
based on reinforcement learning,” IEEE Commun. Mag., vol. 57, no. 5,
neous IoT applications competing for limited resources at the pp. 64–69, May 2019.
edge. The paper has tackled the resource allocation problem [17] M. Li, J. Gao, L. Zhao, and X. Shen, “Deep reinforcement learning
in Edge-IoT environment in a way that fulfills the IoT appli- for collaborative edge computing in vehicular networks,” IEEE Trans.
Cogn. Commun. Netw., vol. 6, no. 4, pp. 1122–1135, Dec. 2020.
cations’ requirements and maximizes IoT users’ satisfaction. [18] Y. Liu, H. Yu, S. Xie, and Y. Zhang, “Deep reinforcement learning
We developed the DeepEdge framework which comprises a for offloading and resource allocation in vehicle edge computing and
novel QoE model to quantify the IoT users satisfactions based networks,” IEEE Trans. Veh. Technol., vol. 68, no. 11, pp. 11158–11168,
Nov. 2019.
on the QoS requirements of applications. DeepEdge employs [19] G. Qiao, S. Leng, S. Maharjan, Y. Zhang, and N. Ansari, “Deep
a novel two-stage DRL scheme which learns by reinforce- reinforcement learning for cooperative content caching in vehicular
ment resource allocation policy that maximizes users’ QoE, edge computing and networks,” IEEE Internet Things J., vol. 7, no. 1,
pp. 247–257, Jan. 2020.
and tunes the application requirements to align with the avail- [20] Q. Luo, C. Li, T. H. Luan, and W. Shi, “Collaborative data scheduling
able edge resources. Moreover, DeepEdge exploits DNN to for vehicular edge computing via deep reinforcement learning,” IEEE
generate joint actions and utilizes historical allocation deci- Internet Things J., vol. 7, no. 10, pp. 9637–9650, Oct. 2020.
[21] Q. Qi and Z. Ma, “Vehicular edge computing via deep reinforcement
sion to improve its generated actions and expedite the system learning,” 2019, arxiv:1901.04290.
convergence. Evaluation results demonstrate DeepEdge’s capa- [22] D. Zeng, L. Gu, S. Pan, J. Cai, and S. Guo, “Resource management
bility in optimizing users QoE and maintaining task success at the network edge: A deep reinforcement learning approach,” IEEE
Netw., vol. 33, no. 3, pp. 26–33, May/Jun. 2019.
ratio at the maximum.
[23] X. Liu, Z. Qin, and Y. Gao, “Resource allocation for edge computing
in IoT networks via reinforcement learning,” in Proc. IEEE Int. Conf.
R EFERENCES Commun. (ICC), Shanghai, China, 2019, pp. 1–6.
[24] J. Zhao, M. Kong, Q. Li, and X. Sun, “Contract-based computing
[1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision resource management via deep reinforcement learning in vehicular fog
and challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, computing,” IEEE Access, vol. 8, pp. 3319–3329, 2020.
Oct. 2016. [25] Z. Ning, P. Dong, X. Wang, J. Rodrigues, and F. Xia, “Deep reinforce-
[2] J. Pan and J. McElhannon, “Future edge cloud and edge computing for ment learning for vehicular edge computing: An intelligent offloading
Internet of Things applications,” IEEE Internet Things J., vol. 5, no. 1, system,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 6, p. 60, 2019.
pp. 439–449, Feb. 2018. [26] W. Zhan, C. Luo, J. Wang, G. Min, and H. Duan, “Deep reinforcement
[3] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and learning-based computation offloading in vehicular edge computing,” in
its role in the Internet of Things,” in Proc. 1st MCC Workshop Mobile Proc. IEEE Global Commun. Conf. (GLOBECOM), Waikoloa, HI, USA,
Cloud comput., 2012, pp. 13–16. 2019, pp. 1–6.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.
3954 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 4, DECEMBER 2021

[27] M. Khayyat, I. A. Elgendy, A. Muthanna, A. S. Alshahrani, S. Alharbi, [47] T. Hoßfeld, P. E. Heegaard, L. Skorin-Kapov, and M. Varela,
and A. Koucheryavy, “Advanced deep learning-based computational “Fundamental relationships for deriving QoE in systems,” in Proc. 11th
offloading for multilevel vehicular edge-cloud computing networks,” Int. Conf. Qual. Multimedia Exp. (QoMEX), 2019, pp. 1–6.
IEEE Access, vol. 8, pp. 137052–137062, 2020. [48] M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning
[28] H. Peng and X. Shen, “Deep reinforcement learning based resource man- based on value differences,” Advances in Artificial Intelligence (Lecture
agement for multi-access edge computing in vehicular networks,” IEEE Notes in Computer Science), vol. 6359. Heidelberg, Germany: Springer,
Trans. Netw. Sci. Eng., vol. 7, no. 4, pp. 2416–2428, Oct.–Dec. 2020. 2010.
[29] Q. Qi et al., “Knowledge-driven service offloading decision for vehicular [49] D. D. Hora, A. Asrese, V. Christophides, R. Teixeira, and D. Rossi,
edge computing: A deep reinforcement learning approach,” IEEE Trans. “Narrowing the gap between QoS metrics and Web QoE using
Veh. Technol., vol. 68, no. 5, pp. 4192–4203, May 2019. above-the-fold metrics,” in Passive and Active Measurement. Cham,
[30] X. Xiong, K. Zheng, L. Lei, and L. Hou, “Resource allocation based Switzerland: Springer, 2018.
on deep reinforcement learning in IoT edge computing,” IEEE J. Sel. [50] A. D. Tijsma, M. M. Drugan, and M. A. Wiering, “Comparing explo-
Areas Commun., vol. 38, no. 6, pp. 1133–1146, Jun. 2020. ration strategies for q-learning in random stochastic mazes,” in Proc.
[31] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation IEEE Symp. Series Comput. Intell. (SSCI), Dec. 2016, pp. 1–8.
for mobile edge computing: A deep reinforcement learning approach,” [51] S. Marsland, Machine Learning: An Algorithmic Perspective. New York,
IEEE Trans. Emerg. Topics Comput., vol. 9, no. 3, pp. 1529–1541, NY, USA: CRC Press, 2015.
Jul.–Sep. 2021. [52] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[32] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching, Boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn. (ICML),
computing, and radio resources for fog-enabled IoT using natural actor- Jun. 2010, pp. 807–814.
critic deep reinforcement learning,” IEEE Internet Things J., vol. 6, [53] K. Fukunaga and P. M. Narendra, “A branch and bound algorithm for
no. 2, pp. 2061–2073, Apr. 2019. computing k-nearest neighbors,” IEEE Trans. Comput., vol. 100, no. 7,
[33] H. Zhang, W. Wu, C. Wang, M. Li, and R. Yang, “Deep reinforcement pp. 750–753, Jul. 1975.
learning-based offloading decision optimization in mobile edge [54] C. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3,
computing,” in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), pp. 279–292, 1992.
2019, pp. 1–7. [55] J. Lin, “Reinforcement learning for robots using neural networks,”
[34] H. Meng, D. Chao, and Q. Guo, “Deep reinforcement learning based School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA,
task offloading algorithm for mobile-edge computing systems,” in Proc. Rep. CMU-CS-93-103, 1993.
4th Int. Conf. Math. Artif. Intell. (ICMAI), 2019, 90–94. [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[35] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Performance in Proc. Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.
optimization in mobile-edge computing via deep reinforcement learn-
ing,” 2018, arXiv:1804.00514.
[36] T. Yang, Y. Hu, M. C. Gursoy, A. Schmeink, and R. Mathar, “Deep
reinforcement learning based resource allocation in low latency edge
computing networks,” in Proc. Int. Symp. Wireless Commun. Syst.
(ISWCS), Lisabon, Portugal, Aug. 2018, pp. 1–5. Ismail AlQerm (Member, IEEE) received the Ph.D.
[37] Y. Xiao, M. Noreikis, and A. Ylä-Jaäiski, “QoS-oriented capacity plan- degree in computer science from the King Abdullah
ning for edge computing,” in Proc. IEEE Int. Conf. Commun. (ICC), University of Science and Technology (KAUST) in
2017, pp. 1–6. 2017. He is a Postdoctoral Research Associate with
[38] Z. Ye, S. Mistry, A. Bouguettaya, and H. Dong, “Long-term QoS-aware the Department of Computer Science, University
cloud service composition using multivariate time series analysis,” IEEE of Missouri–St. Louis. His research interests
Trans. Services Comput., vol. 9, no. 3, pp. 382–393, May/Jun. 2016. include edge computing, resource allocation in IoT
[39] R. Mahmud, S. Srirama, K. Ramamohanarao, and R. Buyya, “Quality networks, developing machine learning techniques
of experience (QoE)-aware placement of applications in Fog com- for resource allocation in wireless networks, and
puting environments,” J. Parallel Distrib. Comput., vol. 132, no. 3, software defined radio prototypes. He was among
pp. 190–203, Oct. 2019. the recipients of KAUST Provost Award. He is a
[40] Y. Lu, M. Motani, and W.-C. Wong, “A QoE-aware resource distribu- member of ACM.
tion framework incentivizing context sharing and moderate competition,”
IEEE/ACM Trans. Netw., vol. 24, no. 3, pp. 1364–1377, Jun. 2016.
[41] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA, USA: MIT Press, 1998.
[42] V. Mnih et al., “Human-level control through deep reinforcement Jianli Pan (Senior Member, IEEE) received
learning,” Nature, vol. 518, no. 7540, p. 529, 2015. the M.S. degree in information engineering
[43] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” from the Beijing University of Posts and
in Proc. 33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937. Telecommunications, China, and the M.S. and
[44] S. Basso, M. Meo, A. Servetti, and J. De Martin, “Estimating packet Ph.D. degrees from the Department of Computer
loss rate in the access through application-level measurements,” in Proc. Science and Engineering, Washington University
ACM SIGCOMM Workshop Meas. Stack (W-MUST), 2012, pp. 7–12. at St. Louis, USA. He is currently an Associate
[45] B. Han and S. Lee, “Efficient packet error rate estimation in wireless Professor with the Department of Computer
networks,” in Proc. 3rd Int. Conf. Testbeds Res. Infrastruct. Develop. Science, University of Missouri–St. Louis, St.
Netw. Commun., 2007, pp. 1–9. Louis, MO, USA. His current research interests
[46] K. Nagin, A. Kassis, D. Lorenz, K. Barabash, and E. Raichstein, include Internet of Things, edge computing,
“Estimating client QoE from measured network QoS,” in Proc. 12th machine learning, cybersecurity, and smart energy. He is an Associate Editor
ACM Int. Conf. Syst. Storage (SYSTOR), 2019, p. 188. for IEEE Communication Magazine and IEEE ACCESS.

Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on September 17,2023 at 09:14:35 UTC from IEEE Xplore. Restrictions apply.

You might also like