0% found this document useful (0 votes)
19 views15 pages

Dynamic Scheduling For Stochastic Edge-Cloud Computing Environments Using A3C Learning and Residual Recurrent Neural Networks

This paper presents a dynamic scheduling model for stochastic Edge-Cloud computing environments using Asynchronous Advantage Actor-Critic (A3C) learning and Residual Recurrent Neural Networks (R2N2). The proposed model addresses challenges in task scheduling due to resource constraints and variability in workloads, achieving significant improvements in energy consumption, response time, service-level agreements, and running costs compared to existing methods. The results demonstrate the model's adaptability and efficiency in decentralized settings, making it suitable for real-time applications in the Internet of Things (IoT).

Uploaded by

cm23csr1p08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

Dynamic Scheduling For Stochastic Edge-Cloud Computing Environments Using A3C Learning and Residual Recurrent Neural Networks

This paper presents a dynamic scheduling model for stochastic Edge-Cloud computing environments using Asynchronous Advantage Actor-Critic (A3C) learning and Residual Recurrent Neural Networks (R2N2). The proposed model addresses challenges in task scheduling due to resource constraints and variability in workloads, achieving significant improvements in energy consumption, response time, service-level agreements, and running costs compared to existing methods. The results demonstrate the model's adaptability and efficiency in decentralized settings, making it suitable for real-time applications in the Internet of Things (IoT).

Uploaded by

cm23csr1p08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

940 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO.

3, MARCH 2022

Dynamic Scheduling for Stochastic Edge-Cloud


Computing Environments Using A3C Learning
and Residual Recurrent Neural Networks
Shreshth Tuli , Shashikant Ilager , Kotagiri Ramamohanarao, and Rajkumar Buyya , Fellow, IEEE

Abstract—The ubiquitous adoption of Internet-of-Things (IoT) based applications has resulted in the emergence of the Fog computing
paradigm, which allows seamlessly harnessing both mobile-edge and cloud resources. Efficient scheduling of application tasks in such
environments is challenging due to constrained resource capabilities, mobility factors in IoT, resource heterogeneity, network hierarchy, and
stochastic behaviors. Existing heuristics and Reinforcement Learning based approaches lack generalizability and quick adaptability, thus
failing to tackle this problem optimally. They are also unable to utilize the temporal workload patterns and are suitable only for centralized
setups. However, asynchronous-advantage-actor-critic (A3C) learning is known to quickly adapt to dynamic scenarios with less data and
residual recurrent neural network (R2N2) to quickly update model parameters. Thus, we propose an A3C based real-time scheduler for
stochastic Edge-Cloud environments allowing decentralized learning, concurrently across multiple agents. We use the R2N2 architecture
to capture a large number of host and task parameters together with temporal patterns to provide efficient scheduling decisions. The
proposed model is adaptive and able to tune different hyper-parameters based on the application requirements. We explicate our choice of
hyper-parameters through sensitivity analysis. The experiments conducted on real-world data set show a significant improvement in terms
of energy consumption, response time, Service-Level-Agreement and running cost by 14.4, 7.74, 31.9, and 4.64 percent, respectively when
compared to the state-of-the-art algorithms.

Index Terms—Edge computing, cloud computing, deep reinforcement learning, task scheduling, recurrent neural network, asynchronous
advantage actor-critic

1 INTRODUCTION The resources at the edge of the network are constrained


due to cost and feasibility factors [6]. Efficient utilization of
advancements in the Internet of Things (IoT) have
T HE
resulted in a massive amount of data being generated
with enormous volume and rate. Applications that access
Edge resources to accommodate a greater number of applica-
tions and to simultaneously maximize their Quality of Service
(QoS) is extremely necessary. To achieve this, ideally, we
this data, analyze and trigger actions based on stated goals,
need a scheduler that efficiently manages workloads and
require adequate computational infrastructure to satisfy the
underlying resources. However, scheduling in the Edge
requirements of users [1]. Due to increased network latency,
computational paradigm is exceptionally challenging due to
traditional cloud-centric IoT application deployments fail to
many factors. Primarily, due to the heterogeneity, computa-
provide quick response to many of the time-critical applica-
tional servers between remote cloud and local edge nodes sig-
tions such as health-care, emergency response, and traffic
nificantly differ in terms of their capacity, speed, response
surveillance [2]. Consequently, emerging Edge-Cloud is a
time, and energy consumption. Moreover, machines can also
promising computing paradigm that provides a low latency
be heterogeneous within cloud and edge layers. Besides, due
response to this new class of IoT applications [3], [4], [5].
to the mobility factor in Edge paradigm, bandwidth continu-
Here, along with remote cloud, the edge of the network have
ously changes between the data source and computing nodes,
limited computational resources to provide a quick response
which requires continual dynamic optimization to meet the
to time-critical applications.
application requirements. Furthermore, the Edge-Cloud envi-
ronment is stochastic in many aspects, such as the task’s
arrival rate, duration of tasks, and their resource require-
 Shreshth Tuli is with the Cloud Computing and Distributed Systems
(CLOUDS) Laboratory, School of Computing and Information Systems, ments, which further makes the scheduling problem challeng-
University of Melbourne, Parkville, VIC 3010, Australia, and also with the ing. Therefore, dynamic task scheduling to efficiently utilize
Department of Computer Science and Engineering, Indian Institute of Tech- the multi-layer resources in stochastic environments becomes
nology, New Delhi, Delhi 110016, India. E-mail: [email protected].
crucial to save energy, cost and simultaneously improve the
 Shashikant Ilager, Kotagiri Ramamohanarao, and Rajkumar Buyya are with
the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, QoS of applications.
School of Computing and Information Systems, University of Melbourne, The existing task or job scheduling algorithms in Edge-
Parkville, VIC 3010, Australia. E-mail: [email protected], Cloud environments have been dominated by heuristics or
{kotagiri, rbuyya}@unimelb.edu.au.
rule-based policies [7], [8], [9], [10], [11], [12]. Although heu-
Manuscript received 2 Apr. 2020; revised 26 July 2020; accepted 12 Aug. 2020. ristics usually work well in general cases, they do not
Date of publication 17 Aug. 2020; date of current version 3 Feb. 2022.
(Corresponding author: Shreshth Tuli.) account for the dynamic contexts driven by both workloads
Digital Object Identifier no. 10.1109/TMC.2020.3017079 and composite computational paradigms like Edge-Cloud.
1536-1233 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 941

Furthermore, they fail to adapt to continuous changes in the proposed scheduling model can be tuned to optimize the
system [13], which is common in Edge-Cloud environments required QoS metrics based on the application demands using
[14]. To that end, Reinforcement Learning (RL) based sched- the adaptive loss function proposed in this work. To that end,
uling approach is a promising avenue for dynamic optimiza- minimizing this loss function through policy learning helps
tion of the system [13, 15]. The RL solutions are more achieve highly optimized scheduling decisions. Unlike heu-
accurate as the models are built from the actual measure- ristics, the proposed framework can adapt to the new require-
ments, and they can identify complex relationships between ments as it continuously improves the model by tuning
different interdependent parameters. Recent works have parameters based on new observations. Furthermore, policy
explored different value-based RL techniques to optimize gradient enables our model to quickly adapt allocation policy
several aspects of Resource Management Systems (RMS) in responding to the dynamic workload, host behaviour and
distributed environments [16], [17], [18, 19]. Such methods QoS requirements, compared to traditional DQN methods.
store a Q value function in a table or using a Neural network The experiment results using an extended version of iFogSim
for each state of the edge-cloud environment, which is an Toolkit [30] with elements of CloudSim 5.0 [31] show the
expected cumulative reward in the RL setup [20]. The tabular superiority of our model against existing heuristics and previ-
value-based RL methods face problem of limited scalability ously proposed RL models. Our proposed methodology
[21], [22], [23], for which researchers have proposed various achieves significant efficiency for several critical metrics such
Deep learning based methods like Deep Q Learning (DQN) as energy, response time, Service Level Agreements (SLA)
[24], [25], [26] which use a neural network to approximate violation [8] and cost among others.
the Q value. However, previous studies have shown that In summary, the key contributions of this paper are:
such value-based RL techniques are not suitable for highly
stochastic environments [27], which make them perform  We design an architectural system model for the data-
poorly in Edge-Cloud deployments. Limited number of driven deep reinforcement learning based scheduling
works exist which are able to leverage policy gradient meth- for Edge-Cloud environments.
ods [28] and optimize for only a single QoS parameter and  We outline a generic asynchronous learning model for
do not use asynchronous updates for faster adaptability in scheduling in decentralized environments.
highly stochastic environments. Moreover, all prior works  We propose a Policy gradient based Reinforcement
do not exploit temporal patterns in workload, network and learning method (A3C) for stochastic dynamic sched-
node behaviours to further improve scheduling decisions. uling method.
Furthermore, these works use a centralized scheduling pol-  We demonstrate a Residual Recurrent Neural Network
icy which is not suitable for decentralized or hierarchical (R2N2) based framework for exploiting temporal pat-
environments. Hence, this work maps and solves the sched- terns for scheduling in a hybrid Edge-Cloud setup.
uling problem in stochastic edge-cloud environments using  We show the superiority of the proposed solution
asynchronous policy gradient methods which can recognize through extensive simulation experiments and com-
the temporal patterns using recurrent neural networks and pare the results against several baseline policies.
continuously adapt to the dynamics of the system to yield The rest of the paper is organized as follows. Section 2
better results. describes the system model and also formulates the problem
In this regard, we propose a deep policy gradient based specifications. Section 3 explains a generic policy gradient
scheduling method to capture the complex dynamics of based learning model. Section 4 explains the proposed A3C-
workloads and heterogeneity of resources. To continuously R2N2 model for scheduling in Edge-Cloud environments.
improve over the dynamic environment, we use the asyn- The performance evaluation of the proposed method is
chronous policy gradient reinforcement learning method shown in Section 5. The relevant prior works are explained
called Asynchronous Advantage Actor Critic (A3C). A3C, in Section 6. Conclusions and future directions are presented
proposed by Mnih et al. [27], is a policy gradient method for in Section 7.
directly updating a stochastic policy which runs multiple
actor-agents asynchronously with each agent having it’s 2 SYSTEM MODEL AND PROBLEM FORMULATION
own neural network. The agents are trained in parallel and
In this section, we describe the system model and interac-
update a global network periodically, which holds shared
tion between various components that allow an adaptive
parameters. After each update, the agents resets their param-
reinforcement-based scheduling. In addition, we describe
eters to those of the global network and continue their
the workload model and problem formulation.
independent exploration and training until they update
themselves again. This method allows exploration of larger
state-action space quickly [27] and enables models to rapidly 2.1 System Model
adapt to stochastic environments. Moreover, it allows us to In this work, we assume that the underlying infrastructure is
run multiple models asynchronously on different edge or composed of both edge and cloud nodes. An overview of the
cloud nodes in a decentralized fashion without a single point system model is shown in Fig. 1. The edge-cloud environ-
of failure. Using this, we propose a learning model based on ment consists of distributed heterogeneous resources in the
Residual Recurrent Neural Network (R2N2). The R2N2 network hierarchy, from the edge of the network to the
model is capable of accurately identifying the highly nonlin- multi-hop remote cloud. The computing resources act as
ear patterns across different features of the input and exploit- hosts for various application tasks. These hosts can vary sig-
ing the temporal workload and node patterns, with residual nificantly in their compute power and response times. The
layers increasing the speed of learning [29]. Moreover, the edge devices are closer to the users and hence provide much
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
942 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

Fig. 1. System model.

lower response times but are resource-constrained with lim- into scheduling intervals of equal duration. The scheduling
ited computation capability. On the other hand, cloud intervals are numbered based on their order of occurrence as
resources (Virtual Machines) located several hops away shown in Fig. 2. The ith scheduling interval is shown as SIi ,
from the users, provide much higher response time. How- which starts at time ti and continues till the beginning of the
ever, cloud nodes are resource enriched with increased next interval i.e., tiþ1 . In each SIi , the active tasks are those
computational capabilities that can process multiple tasks that were being executed on the hosts and are denoted as ai .
concurrently. Also, at the beginning of SIi , the set of tasks that get com-
The infrastructure is controlled by a Resource Manage- pleted is denoted as li and the new tasks that are sent by the
ment System (RMS) which consists of Scheduling, Migration WGM are denoted as ni . The tasks li leave the system and
and Resource Monitoring Services. The RMS receives tasks new tasks ni are added to the system. Thus, at the beginning
with their QoS and SLA requirements from IoT devices and of the interval SIi , the active tasks ai is ai1 [ ni n li .
users. It schedules the new tasks and also periodically
decides if existing tasks needs to be migrated to new hosts
2.3 Problem Formulation
based on the optimization objectives. The tasks’ CPU, RAM,
The problem that we consider is to optimize the perfor-
bandwidth, and disk requirements with their expected com-
mance of the scheduler in the edge-cloud environment as
pletion times or deadlines affect the decision of the RMS.
described in Section 2.1 and dynamic workload described
This effect is simulated using a stochastic task generator
in Section 2.2. The performance of the scheduler is quanti-
known as the Workload Generation Module (WGM) follow-
fied by the metric denoted as Loss defined for each schedul-
ing a dynamic workload model for task execution described
ing interval. The lower the value of Loss, the better the
in the next subsection.
scheduler. We denote loss of the interval SIi as Lossi .
In our model, the Scheduler and Migration services
In the edge-cloud environment, the set of hosts is
interact with a Deep Reinforcement Learning Module
denoted as Hosts and its enumeration as ½H0 ; H1 ; . . . ; Hn .
(DRLM), which suggests placement decision for each task
We assume that the maximum number of hosts at any
(on hosts) to the former services. Instead of a single sched-
instant of the execution is n. We also denote host assigned
uler, we run multiple schedulers with separate partitions
to a task T as fT g. We define our scheduler as a mapping
of tasks and nodes. These schedulers can be run on a single
between the state of the system to an action which consists
node or separate edge-cloud nodes [27]. As shown in prior
of host allocation for new tasks and migration decision for
works [27], [32], having multiple actors learn parameter
active tasks. The state of the system at the beginning of SIi ,
updates in an asynchronous fashion allows computational
denoted as Statei , consists of the parameter values of Hosts,
load to be distributed among different hosts, allowing
remaining active tasks of the previous interval which
faster learning within the limits of resource constrained
(ai1 n li ) and new tasks (ni ). The scheduler has to decide for
edge devices. Thus, in our system, we assume all edge and
each task in ai (¼ ai1 [ ni n li ), the host to be allocated or
cloud nodes to accumulate local gradients to their schedu-
migrated to, which we denote as the Actioni for SIi . How-
lers and add and synchronize gradients of all such hosts to
ever, all tasks may not be migratable. Let mi  ai1 n li be the
update their models individually. Our policy learning
migratable tasks. Thus, Actioni ¼ fh 2 Hosts for task T jT 2
model is part of the DRLM with each scheduler with a sep-
mi [ ni g which is a migration decision for tasks in mi and
arate copy of the global neural network, which allows
allocation decision for tasks in ni . Thus scheduler, denotes as
asynchronous updates. Another vital component of the
Model, is a function: Statei ! Actioni . The Lossi of an inter-
RMS is the Constraint Satisfaction Module (CSM) which
val depends on the allocation of the tasks to hosts i.e.,
checks if the suggestion from the DRLM is valid in terms
Actioni by the Model. Hence, for an optimal Model, the prob-
of constraints such as whether a task is already in migra-
lem can be formulated as described by Equation (1)
tion or the target host is running at full capacity. The
importance and detailed functionality of CSM is explained
in Section 3.2.

2.2 Workload Model


As described before, task generation is stochastic and each
task has a dynamic workload. Based on changing user
demands and mobility of IoT devices, the computation and
bandwidth requirements of the tasks change with time. As
done in prior works [8], [30], we divide our execution time Fig. 2. Dynamic task workload model.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 943

TABLE 1
Symbol Table

Symbol Meaning
th
SIi i scheduling interval
ai Active tasks in SIi
li Tasks leaving at beginning of SIi
ni New tasks received at beginning of SIi Fig. 4. Matrix representation of model inputs.
Hosts Set of hosts in the Edge-Cloud Datacenter
n Number of hosts in the Edge-Cloud Datacenter all hosts in a feature vector denoted as FViHosts as shown in
Hi ith host in an enumeration of Hosts Fig. 4a. The tasks in ai are segregated into two disjoint sets:
TiS ith task in an enumeration of S ni and ai1 n li . The former consists of parameters like task
fT g Host assigned to task T CPU, RAM, bandwidth, and disk requirements. The latter
FViS Feature vector corresponding to S at SIi
also consists of the index of the host assigned in the previous
mi Migratable tasks in ai
ActionPG Scheduling decision at start of SIi interval. The feature vectors of these set of tasks are denoted
i n a nl
LossPG
i Loss function for the model at start of SIi as FVi i and FVi i1 i as shown in Figs. 4b and 4c respectively.
a nl n
Thus, Statei becomes ðFViHosts ; FVi i1 i ; FVi i Þ, which is the
input of the model.
X
minimize Lossi 3.2 Output Specification
Model
i
(1) At the beginning of the interval SIi , the model needs to pro-
subject to 8 i; Actioni ¼ ModelðStatei Þ vide a host assignment for each task in ai based on the input
8 i 8 T 2 mi [ ni ; fT g Actioni ðT Þ: Statei . The output, also denoted as Actioni is a host assign-
ment for each new task 2 ni and migration decision for
remaining active tasks from previous interval 2 ai1 n li .
A symbol table for ease of meaning recall and a Venn dia- This assignment must be valid in terms of the feasibility
gram of various task sets are given in Table 1 and Fig. 3, constraints such that each task which is migrated must be
respectively. migratable to the new host (we denote migratable task as mi
which is  ai ), i.e., it is not under migration. Moreover,
3 REINFORCEMENT LEARNING MODEL when a host h is allocated to any task T , then after allocation
h should not get overloaded i.e., h is suitable for T . Thus, we
We now propose a Reinforcement Learning model for the describe Actioni through Equation (2) such that for the
problem statement described in Section 2.3 suitable for pol- interval SIi , 8 T 2 ni [ mi ; fT g Actioni ðT Þ,
icy gradient learning. First, we present the input and output
specifications of the Neural Network and then describe the 
h 2 Hosts 8 t 2 ni
modeling of Lossi (from Equation (1)) in our model. Actioni ¼
hnew 2 Hosts 8 t 2 mi if t is to be migrated
3.1 Input Specification subject to
The input of the scheduler Model, is the Statei which consists Actioni is suitable for t 8 t 2 ni [ mi :
of the parameters of hosts, which include utilization and (2)
capacity of CPU, RAM, bandwidth, and disk [16]. It also
includes the power characteristics, cost per unit time, Million However, developing a model that provides a constrained
Instructions per Seconds (MIPS) for the host, response time, output is computationally difficult [33] hence, we use an alter-
and the number of tasks to which this host is allocated. Dif- native definition of model action which is unconstrained. We
ferent hosts would have different computational power compensate for the constraints in the objective function. In the
(CPU), memory capacity (RAM) and I/O availability (disk unconstrained formulation of the model action, the output
and bandwidth). As tasks in an edge-cloud setup impose would be a priority list of hosts for each task. Thus, for task
Tj i , we have a list of hosts ½Hj0 ; Hj1 ; . . . ; Hjn  in decreasing
a
compute, memory and I/O limitations, such parameters are
crucial for scheduling decisions. Moreover, allowing multi- order of allocation preference. For a neural network, the out-
ple tasks to be placed on a small cluster of hosts could ensure put could be a vector of allocation preference for each host for
low energy usage (hibernating the ones with no tasks). A every task. This means that rather than specifying a single
host with higher I/O capacity (disk read/write speeds) host for each task, the model provides a ranked list of hosts.
could allow I/O intensive tasks to be completed quickly and We denote this unconstrained model action for policy gradi-
prevent SLA violations. All these parameters are defined for ent setup as ActionPG i as shown in Fig. 5.
This unconstrained action cannot be used directly for
updating the task allocation to hosts. We need to select the
most preferable host for each task which is suitable for only
those tasks that are migratable. To convert ActionPG i to
Actioni is straightforward as shown in Equation (3). For
a a
Actioni ðTj i Þ, if Tj i 2 ai1 n li and is not migratable then it is
a
not migrated. Otherwise, Tj i will be allocated to the highest
Fig. 3. Venn diagram of various task sets. rank host which is suitable. By the conversion of Equation (3),
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
944 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

factor ah 2 ½0; 1 which can be set for edge and cloud


nodes as per the user requirement and deployment
strategy. The power is normalized as
P R tiþ1
h2Hosts ah t¼ti Ph ðtÞdt
AECi Hosts
¼P max ðt ; (5)
h2Hosts ah Ph iþ1  ti Þ

Fig. 5. Matrix representation of model output: ActionPG


i . where Ph ðtÞ is the power function of host h with
time, and Phmax is maximum possible power of h.
Actioni always obeys constraints specified in Equation (2)
2) Average Response Time (ART) is defined for an inter-
and hence is used for model update as
val SIi as the average response time for all leaving
a a tasks (liþ1 ) in that interval normalized by maximum
Actioni ðTj i Þ ¼ Hjk j Tj i 2 mi [ ni
response time until the current interval as shown in
^ Hjk is suitable for Tjai Equation (6). The task response time is the sum of
a (3) host (on which this task is scheduled) response time
^ 8 l < k; Hjl 2 ActionPG
i1 ðTj Þ;
i

and task execution time. Hence ART is defined as


Hjl is not suitable for Tjai :
P
t2liþ1 Response TimeðtÞ
Additionally, we define penalty for the unconstrained ARTi ¼ : (6)
jliþ1 jmaxi maxt2li Response TimeðtÞ
action as in Equation (4). This captures two aspects of pen-
alty: (1) the migration penalty as the fraction of tasks that the
3) Average Migration Time (AMT) is defined for an inter-
model wanted to migrate but cannot be migrated to the total
val SIi as the average migration time for all active
number of tasks and (2) the host allocation penalty as the
tasks (ai ) in that interval normalized by maximum
sum for each task, the number of hosts that could not
migration time until the current interval as shown in
be allocated to that task but were given higher preference.
Equation (7). AMT is defines as
This penalty would be used in the Loss function defined in
Section 3.3. The first addend in Equation (4) captures the P
t2ai Migration TimeðtÞ
host allocation penalty and the second addend captures the AMTi ¼ : (7)
jai jmaxi maxt2li Response TimeðtÞ
migration penalty and this penalty guides the learning model
to make decisions based on the constraints in Equation (2).
4) Cost (C) is defined for an interval SIi as the total cost
Thus, we define penalty as
incurred during that interval as shown in Equation (8)
P R tiþ1
Penaltyiþ1 ¼
P h2Hosts t¼ti Ch ðtÞdt
Costi ¼ P ; (8)
t2ai k jH ¼ Actioni ðtÞ ^ H 2 Actioni ðtÞ
k k PG max
h2Hosts Ch ðtiþ1  ti Þ
jai j  n (4)
P where Ch ðtÞ is the cost function for host h with time,
t2ai1 nli 1
1ðt 2
= m i ^ Actioni ðtÞ 6¼ ftgÞ
þ : and Chmax is maximum cost per unit for host h.
jai j
5) Average SLA Violations (SLAV) is defined for an inter-
val SIi as the average number of SLA violations in
Hence, the output ActionPG
i is first processed by the CSM
that interval for leaving task (liþ1 ) as shown in Equa-
to generate Actioni and Penaltyiþ1 . Now, to update the
tion (9). SLAðtÞ of task T is defined in [8] which is
parameters of the model at the beginning of SIi , we incorpo-
product of two metrics: (i) SLA violation time per
rate both Lossi and Penaltyi as described in the next
active host and (ii) performance degradation due to
subsection.
migrations. Thus,
3.3 Loss Function P
t2liþ1 SLAðtÞ
In our learning model, we want the model to be optimum to SLAVi ¼ : (9)
jliþ1 j
reduce Lossi in each interval and hence the cumulative loss.
Also, we want our model, which is a mapping from Statei To minimize the above mentioned metrics, as done in
to Actioni , to adapt to the dynamically changing state. For various prior works [16], [35], we define Lossi as a convex
this, we now define Lossi , which acts as a metric for param- combination of these metrics for interval SIi1 . Thus
eter update for the model. First, we define various metrics
(normalized to [0,1]) which help us to define Lossi . Lossi ¼ a  AECi1 þ b  ARTi1 þ g  AMTi1
1) Average Energy Consumption (AEC) is defined for any þ d  Costi1 þ   SLAVi1
(10)
interval as the energy consumption of the infrastruc- such that a; b; g; d;   0
ture (which includes all edge and cloud hosts) nor- ^ a þ b þ g þ d þ  ¼ 1:
malized by the maximum power of the environment.
However, edge and cloud nodes may have different Based on different user QoS requirements and application
energy sources like energy harvesting devices for settings different values of hyper-parameters ða; b; g; d; Þ
edge and main supply for cloud [34]. Thus, we mul- may be required. Say for energy sensitive applications [36],
tiply the energy consumed by a host h 2 Hosts by a [37], [38], we need to optimize energy even though other
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 945

Fig. 7. Neural network architecture.

requests including task parameters like computation, band-


width and SLA requirements. (2) These requirements and
Fig. 6. Learning model. the host characteristics from Resource Monitoring Service
are used by the DRL model to predict the next scheduling
metrics might get compromised. Then the loss would have decisions. (3) The constraint satisfaction module finds the
a ¼ 1 and rest 0. For response time-sensitive applications like possible migration and scheduling decision from the output
healthcare monitoring or traffic management [39], the loss of DRL model. (4) For the new tasks, the RMS informs the
would have b ¼ 1 and rest 0. Similarly, for different applica- user/IoT device to send its request directly to the corre-
tions, a different set of hyper-parameter values is required. sponding edge/cloud device scheduled for this task. (5) The
Now, for the Neural Network model we need to incl- loss function is calculated for the DRL model and its parame-
ude the penalty as well because the output described in ters are updated. The formulation and the learning model
Section 3.2 is unconstrained, as done in other works [40], described earlier in Section 3 is generic for any policy based
[41]. If we include the penalty defined by Equation (4), then RL model. The model, which is a function form Statei to
the model updates its parameters to not only minimize ActionPG
i is assumed to be the theoretically best function for
Lossi but also to satisfy constraints described in Equa- minimizing LossPG i . There exist many prior works which try
tion (2). Thus, we define the loss for the Neural Network as to model this function using Q-Table or a neural network
shown in Equation (11). So function approximator [16], [24], [26] giving a deterministic
policy which is unable to adapt in stochastic settings. How-
LossPG
i ¼ Lossi þ Penaltyi : (11) ever, our approach tries to approximate the policy itself and
optimize it using policy gradient methods with LossPG i as a
signal to update the network.
3.4 Model Update
Having defined the input-output specifications and the loss
function we now define the procedure to update the Model 4.1 Neural Network Architecture
after every scheduling interval. A summary of the interaction To approximate the function from Statei to ActionPG i for
and model update for the transition from interval SIi1 to the every interval SIi , we use a R2N2 network. The advantage
interval SIi is shown in Fig. 6. We consider an episode to con- of using an R2N2 network is its ability to capture complex
tain n scheduling intervals. At the beginning of every sched- temporal relationships between the inputs and outputs. The
uling interval say SIi , the WGM sends new tasks to the architecture with the layer description used for the pro-
Scheduling and Migration Service (SMS). Then, SMS and posed work is shown in Fig. 7. A single network is used to
WGM send the Statei to the DRLM which includes the fea- predict both policy (actor head) and cumulative loss after
ture vectors of hosts, remaining active tasks from previous the current interval (critic head).
interval (ai1 n li ) and new tasks (ni ). Also, the RMS sends The R2N2 network has 2 fully connected layers followed
the Lossi to the DRLM. The CSM sends Penaltyi based on by 3 recurrent layers with skip connections. A 2-dimen-
decision of ActionPG i1 . The model then generates an Actioni
PG sional input is first flattened and then passed through the
and updates its parameters based on Equation (11), which is dense layers. The output of the last recurrent layer is sent to
sent to the CSM. The CSM converts ActionPG i to Actioni and the two network heads. The actor head output is of size 104
sends it to RMS. It also calculates and stores Penaltyiþ1 for which is reshaped to a 2-dimension 100  100 vector. This
next interval SIiþ1 . The RMS allocates new tasks (ni ) and means that the this model can manage maximum 100 tasks
migrates remaining tasks from previous interval (ai1 n li ) and 100 hosts. This is done for a fair comparison with other
based on Actioni received from CSM. This updates ai1 to ai methods that have tested on similar settings [8], [16], but for
as ai ai1 [ ni n li . The tasks in ai execute for the interval a larger system the network must be changed accordingly.
SIi and the cycle repeats for the next interval SIiþ1 . Finally, softmax is applied across the second dimension so
that all values are in [0,1] and the sum of all values in a row
equals 1. This output (say O) can be interpreted as a proba-
4 STOCHASTIC DYNAMIC SCHEDULING USING
bility map where Ojk represents the probability with which
POLICY GRADIENT LEARNING a
task Tj i should be assigned to host Hk which is kth host in
The complete framework works as follows: at the beginning an enumeration of Hosts. The output of the critic head is a
of every scheduling interval, (1) the RMS receives the task single constant which signifies the value function i.e., the
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
946 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

cumulative loss starting from next interval (CLossPG iþ1 ). The The log term in the Equation (13) specifies the direction of
recurrent layers are formed using Gated Recurrent Units change in the parameters, ðLossPG i þ CLossiþ1 Þ term is the
Pred

(GRUs) [42], which model the temporal aspects of the task predicted cumulative loss in this episode starting from
and host characteristics including tasks’ CPU, RAM and Statei . To minimize this, the gradients are proportional to
bandwidth requirements and hosts’ CPU, RAM and band- this quantity and have a minus sign to reduce total loss. The
width capacities. Although the GRU layers help in taking second gradient term is the Mean Square Error (MSE) of the
an informed scheduling decision by modeling the temporal predicted cumulative loss with the cumulative loss after
characteristics, they increase the training complexity due to one-step look-ahead. The output ActionPG i is converted to
large number of network parameters. This is solved by Actioni by CSM and sent to the RMS every scheduling inter-
using the skip connections between these layers for faster val. Thus, for each interval, there is a forward pass of the
gradient propagation. R2N2 network. For back-propagation, we use a episode size
of 12, thus we save the experience of the previous episode
4.2 Pre-Processing and Output Conversion to find and accumulate gradients and update model param-
The input to the model for the interval SIi is Statei , which is a eters after 12 intervals. For large batch sizes, parameter
n a nl
2-dimensional vector. This includes FViHosts ; FVi i ; FVi i1 i . updates are slower and for small ones the gradient accumu-
Among these vectors, the values of all elements of the first lation is not able to generalize and has high variance.
two are continuous, but the host index in each row of Accordingly, empirical analysis has resulted into optimal
a nl
FVi i1 i is a categorical value. Hence, the host indices episode size of 12. As described in Section 5.1, the experi-
are converted to a one-hot vector of size n and all feature vec- mental setup has a scheduling interval of 5 minutes, and
tors are concatenated. After this, each element in the hence back-propagation is performed every 1 hour of simu-
concatenated vector is normalized based on the minimum lation time (after 12 intervals).
and maximum values of each feature and clipped between A summary of the model update and scheduling with
[0, 1]. We denote the feature of element e as fe , and minimum back-propagation is shown in Algorithm 1. To decide the
and maximum values for feature f as minf and maxf respec- best possible scheduling decision for each scheduling
tively. These minimum and maximum values are calculated interval, we iteratively pre-process and send the interval
based on a sample dataset using two heuristic-based sched- state to the R2N2 model with the loss and penalty to
uling policies: Local-Regression (LR) for task allocation and update the network parameters. This allows the model to
Maximum-Migration-Time (MMT) for task selection as adapt on-the-fly to the environment, user and application
described in [8]. Then, the feature-wise standardization is specific requirements.
done based on Equation (12). Hence
( Algorithm 1. Dynamic Scheduling
0 if maxfe ¼ minfe
e¼ eminfe (12) Inputs:
minð1; maxð0; max min ÞÞ otherwise: 1: Number of scheduling intervals N
f e f e

2: Batch Size B
This pre-processed input is then sent to the R2N2 model Begin
which flattens it and passes through the Dense layers. The 3: for interval index i from 1 to N do
output generated O is converted to ActionPG i by first gener- 4: if i > 1 and i%B ¼¼ 0 then
ating the sorted list of host SortedHostsi with decreasing 5: Use LossPGi ¼ Lossi þ Penaltyi in RL Model for
mi [ni
probability in Oi for all i. Then, ActionPG i ðTk Þ back-propagation
SortedHostsk 8 k 2 f1; 2; . . . ; jmi [ ni jg. 6: end if
7: send PREPROCESS(Statei ) to RL Model
4.3 Policy Learning 8: probabilityMap output of RL Model for Statei
To learn the weights and biases of the R2N2 network, we use 9: (Actioni , Penaltyiþ1 ) CONSTRAINTSATISFACTIONMODULE
the back-propagation algorithm with reward as LossPG (probabilityMap)
i . For
the current model, we use adaptive learning rate starting 10: Allocate new tasks and migrate existing tasks based on
from 102 and decrease it to 1=10th when the absolute sum of Actioni
of change in the reward for the last ten iterations is less than 11: Execute tasks in edge-cloud infrastructure for interval SIi
12: end for
0.1. Using reward as LossPG i , we perform Automatic Differ-
End
entiation [43] to update the network parameters. We accumu-
late the gradients of local networks at all edge nodes
asynchronously and update the global network parameters Complexity Analysis. The complexity of Algorithm 1
periodically as described in [27]. The gradient accumulation depends on multiple tasks. The pre-processing of the input
rule after the ith scheduling interval is given by Equation (13) state is OðabÞ where a  b is the maximum size of feature vec-
n a nl
similar to the one in [27]. Here u denotes the global network tor among the vectors FViHosts ; FVi i ; FVi i1 i . To generate
parameters and u0 denotes the local parameters (only one gra- the Actioni and Penaltyi the CSM takes Oðn2 Þ time for n hosts
dient is set because of a single network with two heads). Thus and tasks based on Equations (4) and (3). As the feature vec-
tors have a higher cardinality than the number of hosts or
du  aru0 log ½pðStatei ; u0 ÞðLossPG tasks, OðabÞ dominates Oðn2 Þ. Therefore, discarding the for-
i þ CLossiþ1 Þ
Pred
du
Pred 2
ward pass and back-propagation (as they are performed in
þ aru0 ðLossPG
i þ CLossiþ1  CLossi
Pred
Þ : Graphics Processing Units - GPU [44]), for N scheduling inter-
(13) vals, the total time complexity is OðabNÞ.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 947

5 PERFORMANCE EVALUATION
In this section, we describe the experimental set up, evalua-
tion metrics, dataset and give a detailed analysis of results
comparing our model with several baseline algorithms.

5.1 Experimental Set Up


To evaluate the proposed Deep Learning-based scheduling
framework, we developed a simulation environment by
extending the elements of iFogSim [30] and CloudSim tool-
kits [31] which already have resource monitoring services Fig. 8. Bitbrain dataset characteristics.
inbuilt. As described in Section 4.3, the execution of the sim-
ulation was divided into equal-length scheduling intervals. We divide the dataset into two partitions of 25 and 75 percent
The interval size was chosen to be 5 minutes long, same as VM workloads. The larger partition is used for training of
in other works [8], [16], [24] for a fair comparison with base- the R2N2 network and the former partition is used for testing of
line algorithms. The tasks, named as Cloudlets in iFogSim the network, sensitivity analysis and comparison with other
nomenclature, are generated by the WGM based on Bitbrain related works.
dataset [45]. We extended the modules of iFogSim and
CloudSim to allow the use of parameters like response time,
5.1.2 Task Generation and Duration Configuration
cost and power of edge nodes. We also created new mod-
ules to simulate mobility of IoT devices using bandwidth In the proposed work, we consider a dynamic task generation
variations, delayed execution of tasks and interact with model. Prior work [8] does not consider a dynamic task gener-
deep learning software. Additional software for Constraint ation environment, which is not close to the real-world set-
Satisfaction Module, input pre-processing and output con- ting. At the beginning of every interval, the WGM sends ni
version was developed. new tasks where jni j is normal distributed N ðmn ; s 2n Þ. Also,
The loss function is calculated based on host and task each task t 2 ni has an execution duration of N ðmt ; s 2t Þ sec-
monitoring services in CloudSim. The penalty is calculated onds. In our setting, we kept 100 hosts and no more than 100
by the CSM and sent to the DRLM for model parameter tasks in the system being scheduled on 10 actor-agents (sched-
update. We now describe in more detail the dataset, task gen- ulers). We keep in our simulation environment: ðmni ; s ni Þ ¼
eration and duration implementation, hosts’ configuration ð12; 5Þ and ðmt ; s t Þ ¼ ð1800; 300Þ seconds for number of new
and metrics for evaluation. tasks and duration of tasks respectively. At the time of task
creation, for already active jai1 n li j tasks, we only create
minð100  jai1 n li j; N ðmni ; s 2ni ÞÞ tasks so that jai j does not
5.1.1 Dataset
exceed 100. This limit is required because the size of the input
In the simulation environment, the tasks (cloudlets) are to the R2N2 network has a prefixed upper limit which in our
assigned to Virtual Machines (VMs) which are then allo- case is 100.
cated to hosts. For the current setting of task on edge-cloud
environment, we consider a bijection from cloudlets to VMs
by allocating ith created Cloudlet to ith created VM and dis- 5.1.3 Hosts - Edge and Cloud Nodes
card the VM when the corresponding Cloudlet is com- The infrastructure considered in our studies is a heteroge-
pleted. The dynamic workload is generated for cloudlets neous edge-cloud based environment. Unlike prior work
based on real-world open-source Bitbrain’s dataset [45].1 [16], [24], [25], [26], we consider both resource-constrained
The Bitbrain’s dataset [45] has real traces of resource con- edge-cloud devices closer to the user and thus having lower
sumption metrics of business-critical workload hosted on Bit- response time and also resource-abundant cloud nodes
brain infrastructure. This data includes logs of over 1,000 with much higher response time. In our settings, we have
VMs workload hosting on two types of machines. We have considered response time of edge-cloud nodes to be 1 ms
chosen this dataset as it represents real-world infrastructure and that of cloud nodes to be 10 ms based on the empirical
usage patterns, which is useful to construct precise input fea- studies using the Ping utility in an existing edge-cloud
ture vectors for learning models. The dataset consists of work- framework namely FogBus [4].
load information for each time-stamp (separated by 5 Moreover, the environment considered is heterogeneous
minutes) including the number of requested CPU cores, CPU with a diverse range of computation capabilities of edge
usage in terms of MIPS, RAM requested with Network and cloud host. A summary of CPU, RAM, Network and
(receive/transmit) and Disk (read/write) bandwidth charac- other capacities with the Cost Model is given in Table 2, 25
teristics. These different categories of workload data consti- instances of each host type in the environment. The cost
n a nl
tute the feature values of FVi i and FVi i1 i , where the latter model for the cloud layer is based on Microsoft Azure IaaS
also has an index of host allocated in the previous schedul- cloud service. The cost per hour (in US Dollar) is calculated
ing/simulation interval. The CPU, RAM, network bandwidth based on the costs of similar configuration machines offered
and disk characteristics for a random node and its trace in the by Microsoft Azure in South-East Australia.2 For the edge
BitBrain dataset are shown to be highly volatile in Fig. 8. nodes, the cost is based on the energy consumed by the

1. The BitBrain dataset can be downloaded from: http://gwa. ewi. 2. Microsoft Azure pricing calculator for South-East Australia
tudelft.nl/datasets/gwa-t-12-bitbrains https://azure.microsoft.com/en-au/pricing/calculator/
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
948 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

TABLE 2
Configuration of Hosts in the Experiment Set Up

edge node. As per the targeted environment convention, we identify the target host. Furthermore, we also compare our
choose resource-constrained machines at edge (Intel i3 and results to two types of standard RL approaches that are
Intel i5) and powerful rack server as cloud nodes (Intel widely used in the literature.
Xeon). The power consumption averaged over the different
SPEC benchmarks [46] for respective machines is shown in  LR-MMT: schedules workloads dynamically based
Table 2. However, the power consumption values shown in on Local Regression (LR) and Minimum Migration Time
Table 2 are average values over this specific benchmark (MMT) heuristics for overload detection and task
suite. Power consumption of hosts also depends on RAM, selection, respectively (details in [8])
Disk and bandwidth consumption characteristics and are  MAD-MC: schedules workloads dynamically based
provided to the model by the underlying CloudSim simula- on Median Absolute Deviation (MAD) and Maximum
tor. In the execution environment, we consider the host Correlation Policy (MC) heuristics for overload detec-
capacities (CPU, RAM, Network Bandwidth, etc) and the tion and task selection, respectively (details in [8])
current usage to form the feature vector FViHosts for the ith  DDQN: standard Deep Q-Learning based RL approach,
scheduling interval. For the experiments, we keep the test- many works have used this technique in literature
ing simulation duration of 1 day, which equals to total 288 including [16], [25], [26]. We implement the optimized
scheduling intervals. Double DQN technique.
 DRL (REINFORCE): policy gradient based REIN-
FORCE method with fully connected neural network
5.2 Evaluation Metrics
[28].
To evaluate the efficacy of the proposed A3C-R2N2 based
It is important to note that we implement these algo-
scheduler, we consider the following metrics. Motivated
rithms adapting to our problem and compare the results.
from prior works [4], [16], [35], energy is paramount in
The RL model that has been used for comparison with our
resource constrained edge-cloud environments and real-
proposed model uses a state representation same as the
time tasks require low response times. Moreover, service
Statei defined in Section 3.1 for fair comparison. An action
level agreements are crucial in time-critical tasks and low
is a change from one state to another in the state space. As
execution cost is required for budget task execution:
in [24], the DQN network is updated using Bellman Equa-
P
1) Total tion [47] with the reward defined as LossPG i . The REIN-
R tiþ1 Energy Consumption which is given as h2Hosts
FORCE method is implemented without asynchronous
t¼ti Ph ðtÞdt for the complete simulation duration.
2) P Average Response Time which is given as updates or recurrent network.
Response TimeðtÞ
t2liþ1
jliþ1 j . P 5.4 Analysis of Results
SLAVi jliþ1 j
3) SLA Violations which is given as i P l where In this subsection, we provide the experimental results
i i
SLAVi is defined by Equation (9). using the experimental setup and the dataset described in
P P R tiþ1
4) Total Cost which is given as i h2Hosts t¼t i
Ch ðtÞdt. Section 5.1. We also discuss and compare our results based
Other metrics of importance include: Average Task Com- on evaluation metrics specified in Section 5.2. We first ana-
pletion Time, Total number of completed Tasks with fraction of lyze the sensitivity of hyper-parameters ða; b; g; d; Þ on the
tasks that were completed within the expected execution model learning and how it affects different metrics. We
time (based on requested MIPS), Number of task migrations in then analyze the variation of scheduling decisions based on
each interval and Total migration time per interval. The task different hyper-parameter values and show how the com-
completion time is defined as the sum of the average task bined optimization of different evaluation metrics provides
scheduling time, task execution time and response time of better results. We also compare the fraction of scheduling
host on which the task ran in last scheduling interval. time with total execution time by varying the number of
layers on the R2N2 network. Based on the above analysis,
5.3 Baseline Algorithms we find the optimum R2N2 network and hyper-parameter
We evaluate the performance of our proposed algorithms values to compare with the baseline algorithms described in
with the following baseline algorithms, the reasons for choos- Section 5.3. All model learning is done for 10 days of simula-
ing these is described in Section 6. Multiple heuristics have tion time and testing is done for 1 day of simulation time
been proposed by [8] for dynamic scheduling. These are a using a disjoint set of workloads of the dataset.
combination of different sub heuristics for different sub-prob-
lems such as host overload detection and task/VM selection 5.4.1 Sensitivity Analysis of Hyper-Parameters
and we have selected the best three heuristics from those. All We first provide experimental results in Fig. 9 for different
of these variants use Best Fit Decreasing (BFD) heuristics to hyper-parameter values and show how changing the loss
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 949

Fig. 9. Comparison of model trained with different loss functions.

function to learn only one of the metric of interest specifi- (SLAVMN,  ¼ 1) also has a low response time as a number
cally, varies the learned network to give different values of of SLA violations are directly related to response time for
the evaluation metrics, these experiments were carried for a tasks. However, SLA violations also depend on the comple-
single day of simulation duration. To visualize the output tion time of tasks, and as the average task completion time
probability map from the R2N2 network, we display it of RTMN is very high, the SLA violations of this network
using a color map to depict probabilities (0 to 1) of allocating are much more than the other network as shown in Fig. 9c.
tasks to hosts as described in Section 4.2. The fraction of SLA violation is least for SLAVMN and next
When a ¼ 1 (rest = 0), then the R2N2 network solely tries least is for the Migration Time Minimizing Network (MMN,
to optimize the average energy consumption, and hence we g ¼ 1). The SLAVMN network also sends tasks to edge
call it Energy Minimizing Network (EMN). The total energy nodes like RTMN, but it also considers task execution time
consumed across the simulation duration is least for this net- and CPU loads to distribute tasks more evenly as shown in
work as shown in Fig. 9a. As low energy devices (edge Fig. 11b.
nodes) consume the least energy and also have least cost, When only average migration time is being optimized,
energy is highly correlated to cost, and hence the Cost Mini- the average task completion time is minimum, as shown in
mizing Network (CMN, d ¼ 1) also has very low total energy Fig. 9e. However, the SLA violation is not minimum as this
consumption. As shown in Fig. 10, for the same Statei , the network does not try to minimize the response time of tasks,
probability map and hence the allocation are similar for both as shown in Fig. 9b. Moreover, the number of completed
networks. Similarly, we can also see that in Fig. 9d, CMN has tasks is highest for this network as shown in Fig. 9f. Still, the
the least cost and the next least cost is achieved by EMN. fraction of tasks completed within the expected time is high-
The graph in Fig. 9b shows that the Response Time Mini- est for SLAVMN. Figs. 9g and 9h show that number of task
mizing Network (RTMN, b ¼ 1) has the least average migrations and migration time is least for MTMN. Also
response time and tries to place most of the tasks on edge compared in Fig. 12 the number of migrations for the sam-
nodes also shown in Fig. 11a. Moreover, this network does ple size of 30 initial tasks are 7 for EMN and 0 for the other.
not differentiate among the edge nodes in terms of their Optimizing each of the evaluation metrics independently
CPU loads because all edge nodes have the same response shows that the R2N2 based network can adapt and update
time and hence gives almost same probability to every edge
node for each task. The SLA Violation Minimizing Network

Fig. 10. Probability map for EMN and CMN showing similarity and posi- Fig. 11. Probability Map for RTMN and SLAVMN showing that the former
tive correlation. does not distinguish among edge nodes but SLAVMN does.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
950 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

Fig. 14. Scalability of A3C-R2N2.

Fig. 12. Probability maps showing that MTMN has lesser migrations than with 3 recurrent layers and hyper-parameter values given
EMN. by Equation (14).

its parameters to learn the dependence among tasks and 5.4.3 Scalability Analysis
hosts to reduce metric of interest which may be energy,
We now show how the A3C-R2N2 model scales with the
response time, etc. However, for the optimum network, we
number of actor agent hosts in the setup. As discussed in Sec-
use a combination of all metrics. This combined optimization
tion 2, we have multiple edge-cloud nodes in the environ-
leads to a much lower value of the loss and a much better net-
ment which run the policy learning as described in
work. This is because optimizing only along one variable
Section 4.3. However, the number of such agents affects the
might reach a local optimum and the loss of hyper-parameter
time to train the Actor-Critic network. We define the time
space being a highly non-linear function, combined optimi-
taken by n agents to reduce the loss value to 2.5 as Timen .
zation leads to much better network [48]. Based on the
Now, speedup corresponding to a system with n actors is cal-
empirical evaluation for each combination and block coordi- Time1
culated as Sn ¼ Time . Moreover, efficiency of a system with n
nate descent [49] for minimizing Loss, the optimum values n
agents is defined as En ¼ Snn [50]. Fig. 14 shows how speedup
of the hyper-parameters are given by Equation (14). Thus,
and efficiency of the model vary with number of agent nodes.
As shown, the speedup increases with n, however, efficiency
ða; b; g; d; Þ ¼ ð0:4; 0:16; 0:174; 0:135; 0:19Þ: (14)
reduces as n increases in a piece-wise linear fashion. There is
a sudden drop in efficiency when number of agents is
5.4.2 Sensitivity Analysis of the Number of Layers increased from 1. This is because of the communication delay
between agents which leads to slower model updates. The
Now that we have the optimum values of hyper-parame-
drop increases again after 20 hosts due to addition of GPU-
ters, we analyze the scheduling overhead with the number
less agents after 20 hosts. Thus, having agent run only on
of recurrent layers of the R2N2 network. The scheduling
CPU significantly reduces the efficiency of the proposed
overhead is calculated as the ratio of time taken for schedul-
architecture. For our experiments, we keep all active edge-
ing to the total execution duration in terms of simulation
cloud hosts (100 in our case) as actor agents in the A3C learn-
time. As shown in Fig. 13, the value of the loss function
ing for faster convergence and worst-case overhead compari-
decreases with the increase in the number of layers of the
son. In such a case, the speedup is 34.3 and efficiency is 0.37.
Neural Network. This is expected because as the number of
layers increase so do the number of parameters and thus
the ability of the network to fit more complex functions 5.4.4 Evaluation With Baseline Algorithms
becomes better. The scheduling overhead depends on the Having the empirically best set of values of hyper-parame-
system on which the simulation is run, and for the current ters and the number of layers and discussed the scalability
experiments, the system used had CPU - Intel i7-7700K and aspects of the model, we now compare our policy gradient
GPU - Nvidia GTX 1,070 graphics card (8 GB graphics based reinforcement learning model with the baseline algo-
RAM). As shown in the figure, there is an inflection point at rithms described in Section 5.3. The graphs in Fig. 16 pro-
3 recurrent layers because the R2N2 network with 4 or more vide results for 1 day of simulation time with a scheduling
such layers could not fit in the GPU graphics RAM. Based interval of 5 minutes on the Bitbrain dataset.
on the available simulation infrastructure, for the compari- Fig. 16a shows that among the baseline algorithms,
son with baseline algorithms, we use the R2N2 network DDQN and REINFORCE have the least energy consumption,
but A3C-R2N2 model has even lower energy consumption
which is 14.4 and 15.8 percent lower than REINFORCE and
DDQN respectively. The main reason behind this is that the
A3C-R2N2 network is able to adapt to the task workload
behavior quickly. This allows a resource hungry task to be
scheduled to a powerful machine. Moreover, the presence of
Average Energy Consumption (AEC) metric of all the edge-
cloud nodes within the loss function enforces the model to
take energy efficient scheduling decisions. It results in the
minimum number of active hosts with the remaining hosts
Fig. 13. Loss and scheduling overhead with number of recurrent layers. in stand-by mode to conserve energy (utilizing this feature of
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 951

the average completion time as shown in Fig. 16e which is


lower than REINFORCE by 17.53 percent. Also, as seen in
Fig. 16f, the number of tasks completed and the fraction
completed in expected time is highest for the A3C-R2N2
model. As a number of migration and migration time
severely affect the quality of response of the tasks, Figs. 16g
and 16h show how A3C-R2N2 model can achieve the best
metric values by having a low number of task migrations.
To compare the scheduling overhead of the R2N2 model
with the baseline algorithms, we provide a comparative
Fig. 15. Overheads.
result in Fig. 15. As the R2N2 network needs to be updated
every 1 hour of simulation time, the scheduling time is
CloudSim). Moreover, Fig. 16b shows that among all the slightly higher than the other algorithms. Heuristic-based
scheduling policies, A3C-R2N2 provides the least average algorithms have very low scheduling overhead as they fol-
response time which is 7.74 percent lower than the REIN- low simple greedy approaches. R2N2 model has overhead
FORCE policy, best among the baseline algorithms. This is higher by 0.002 percent from RL model. Even though the
because the A3C-R2N2 model explicitly takes input about scheduling overhead is higher than the baseline algorithms,
whether a node is a edge or cloud node and allocates tasks it is not significantly large. Considering the performance
without multiple migrations and Average Migration Time improvement by the R2N2 model, this overhead is negligible
(AMT) being embedded in the loss function. As shown in and makes the R2N2 model a better scheduler compared to
Fig. 16c, the A3C-R2N2 model has the least number of SLA the heuristics or traditional RL based techniques for Edge-
violations which is 31.9 percent lower than the REINFORCE Cloud environments with stochastic workloads.
policy. This again is due to reduced migrations and intelli-
gent scheduling of tasks to prevent the high loss value 5.5 Summary of Insights
because of SLA violations. As shown in Fig. 16d, the total The R2N2 model works better than the baseline algorithms
cost of the data center is least for the A3C-R2N2 model as it because it can sense and adapt to the dynamically changing
gets the cost model (Cost per hour consumption) for each environment, unlike the heuristic-based policies which use
host as a feature in FViHosts and can ensure that tasks can be a representative technique for making scheduling decisions
allocated to as low number of cloud VMs as possible to and are prone to jump to erroneous conclusions due to their
reduce cost. Compared to the best baseline (REINFORCE), limited adaptability. Compared to the DDQN approach,
the A3C-R2N2 model reduces cost by 4.64 percent. asynchronous policy gradient allows the R2N2 model to
Furthermore, the A3C-R2N2 model also considers the quickly change the scheduling policy based on changes in
tasks completion time in the previous scheduling interval network, workload and device characteristics allowing the
and the expected completion time for running tasks. For model to quickly adapt to dynamically changing scenarios.
time-critical tasks, the A3C-R2N2 model allocates it to a Fig. 17 shows scheduling decisions classified as edge or
powerful host machine and avoid migration to save the cloud for different approaches with time for a sample task
migration time. This way, the A3C-R2N2 model can reduce and response time minimization goal. For a task that has

Fig. 16. Comparison of deep learning model with prior heuristic-based works.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
952 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

Fig. 18. Convergence comparison.

system in the next scheduling period by Gaussian process


regression method. Based on this prediction, they choose a
Fig. 17. Allocation timeline. minimum number of servers to be active to reduce the energy
consumption of the overall system. However, their approach
low resource requirement, it is better to schedule in low still uses many heuristics in scheduling decisions and hence
latency edge node rather than cloud. When task becomes do not adapt to dynamic Edge-Cloud environments or chang-
resource intensive, only then is it optimal to send it to cloud ing workload characteristics. Zhang et al. [26] proposed a
as it may slow down the edge node. The REINFORCE- DDQN for energy-efficient edge computing. The proposed
Dense model is unable to exploit temporal patterns like hybrid dynamic voltage frequency scaling (DVFS) schedul-
increasing resource utilization of a task with previous ing based on Q-learning. As a deep Q-learning model cannot
scheduling decisions to optimally decide the task allocation. distinguish the continuous system states, in an extended
This not only leads to higher frequency of sub-optimal deci- work [19], they investigated a double deep Q-learning model
sions but also increases migration time. Considering these to optimize the solution further. Xu et al. [18] proposed
points, the A3C-R2N2 strategy can adapt to non-stationary LASER, a DNN approach for speculative execution and repli-
targets and approximate and learn the parameters much cation of deadline critical jobs in the cloud. They implement
faster and more precisely compared to the traditional RL these DNN based scheduling framework for the Hadoop
based approaches as shown in Fig. 18. Fig. 18 also shows framework. Basu et al. [16] investigated the live migration
that the loss value for the RL framework is much lower problem of Virtual Machines (VMs) using RL based Q-learn-
when the A3C-R2N2 model compared to the REINFORCE- ing model. The proposed algorithms are aimed to improve
Dense model. The average loss value in last 1 hour in a full over existing heuristic-based live migration. Live migration
day experiment is 2.78 for REINFORCE-Dense and 1.12 is widely used for consolidating the VMs to reduce energy
(nearly 60 percent reduction in loss value) for the proposed consumption. Their proposed RL model Megh, continuously
model. To summarize, earlier works did not model tempo- adapts and learns to the changes in the system to increase the
ral aspects using neural networks due to slower training of energy efficiency. Cheng et al. [24] have studied Deep rein-
recurrent layers like GRU. However, modern advancements forcement learning-based resource provisioning and task
of residual connections and the proposed formulation allow scheduling approach for cloud service providers. Their Q-
faster propagation of gradients leading to a solution for the learning based model is optimized to reduce the electricity
slow training problem. price and task rejection rate. Similarly, Mao et al. [25] and Li
et al. [53] explored Resource Management with DDQN. They
apply the DRL to scheduling jobs on multiple resources and
6 RELATED WORK analyze the reasons for achieving high gain compared to
Several studies [7], [8], [9], [10], [11], [12], [55] have proposed state-of-the-art heuristics. As described before, these Q-learn-
different types of heuristics for the scheduling applications ing based algorithms lack the ability to quickly adapt in sto-
in Edge-Cloud environment. Each of these studies focuses chastic environments. Mao et al. [28] and Rjoub et al. [54] also
on optimizing different parameters for a specific set of appli- explored DRL (REINFORCE) based scheduling for edge only
cations. Some of the works are applied to Cloud systems, environments. They only consider response time as a metric
while others are for Edge-Cloud environments. It is well and also do not exploit asynchronous or recurrent networks
known that heuristics work for generic cases and fail to to optimize model adaptability and robustness.
respond to the dynamic changes in environments. However, A summary of the comparison of relevant works with our
a learning-based model can adapt and improve over time by work over different parameters is shown in Table 3. We con-
tuning its parameters according to new observations. sider that the scheduler is dynamic if the optimization is car-
Predictive optimizations have been studied by [16], [17], ried dynamically for active tasks and new tasks that arrive in
[18], [19], [24], [25], [26], [52] in many of the recent works. the system continuously. Stochastic workload is defined by
These works use different Machine Learning (ML) and Deep changing tasks arrival rates and resource consumption char-
Learning (DL) techniques to optimize the Resource Manage- acteristics. The definitions for remaining parameters are self
ment System (RMS). Deep Neural Networks (DNN) and explanatory. For the sake of brevity, instead of comparing to
Deep Reinforcement Learning (DRL) approaches have been all the heuristics based work in the table, we compare our
widely used in this regard. In most of these works, optimiz- work to [8] and [12] which act as some of the baseline algo-
ing energy is a primary objective. Bui et al. [51] studied a pre- rithm in our experiments. The existing RL based solutions
dictive optimization framework for energy efficiency of use Q-learning models [16], [24], [25] and are focused on
cloud computing. They predict the resource utilization of the optimizing the specific parameters such as energy or cost,
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 953

TABLE 3
Comparison of Related Works With Different Parameters

wherein we compare our approach with DDQN [53] and scheduling interval with the average computed in the previ-
DRL (REINFORCE) [54]. All these baseline methods are ous interval. Further, the CPU, RAM, disk and bandwidth
adapted to be used in the proposed edge-cloud setup. How- usage would have to be collected and synchronized across
ever, in the Edge-Cloud environments, infrastructure is all A3C agents in the edge-cloud setup. Further to the sca-
shared among the diverse set of users requiring different QoS lablity analysis, we also plan to conduct tests to check the
for their respective applications. In such a case, the schedul- scalability of the proposed framework with number of hosts
ing algorithm must be adaptive and be able to tune automati- and tasks. The current model can schedule for a fixed num-
cally to application requirements. Our proposed framework ber of edge nodes and tasks. However, upcoming scalable
can be optimized to achieve better efficiency with respect to reinforcement learning models like Impala [56] can be
different QoS parameters as shown in Sections 4 and 5. More- investigated in future. Moreover, we plan to investigate the
over, Edge-Cloud environment brings heterogeneous com- data privacy and security aspects in future.
plexity and stochastic behavior of workloads which need to
be modeled within a scheduling problem. We model these ACKNOWLEDGMENTS
parameters efficiently in our model.
This work was supported by the Melbourne-Chindia Cloud
Computing (MC3) Research Network and the Australian
7 CONCLUSIONS AND FUTURE WORK Research Council.
Efficiently utilizing edge and cloud resources to provide a
better QoS and response time in stochastic environments REFERENCES
with dynamic workloads is a complex problem. This prob- [1] R. Mahmud, S. N. Srirama, K. Ramamohanarao, and R. Buyya,
lem is complicated further due to the heterogeneity of multi- “Quality of experience (QoE)-aware placement of applications in
layer resources and difference in response times of devices fog computing environments,” J. Parallel Distrib. Comput., vol. 132,
pp. 190–203, 2019.
in Edge-Cloud datacenters. Integrated usage of cloud and [2] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of
edge is a non-trivial problem as resources and network have Things (IoT): A vision, architectural elements, and future
completely different characteristics when users or edge- directions,” Future Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660,
nodes are mobile. Prior work not only fails to consider these 2013.
[3] J. Diechmann, H. Kersten, R. Thomas, and W. Dominik, “The
differences in edge and cloud devices but also ignores the Internet of Things: How to capture the value of IoT,” Tech. Rep.,
effect of stochastic workloads and dynamic environments. May 2018, pp. 1–124.
This work aims to provide an end-to-end real-time task [4] S. Tuli, R. Mahmud, S. Tuli, and R. Buyya, “FogBus: A blockchain-
based lightweight framework for edge and fog computing,” J.
scheduler for integrated edge and cloud computing environ- Syst. Softw., vol. 154, pp. 22–36, 2019.
ments. We propose a novel A3C-R2N2 based scheduler that [5] J. Wang, K. Liu, B. Li, T. Liu, R. Li, and Z. Han, “Delay-sensitive
can consider all important parameters of tasks and hosts to multi-period computation offloading with reliability guarantees
make scheduling decisions to provide better performance. in fog networks,” IEEE Trans. Mobile Comput., vol. 19, no. 9,
pp. 2062–2075, Sep. 2020.
Furthermore, A3C allows the scheduler to quickly adapt to [6] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computa-
dynamically changing environments using asynchronous tion offloading for mobile-edge cloud computing,” IEEE/ACM
updates, and R2N2 is able to quickly learn network weights Trans. Netw., vol. 24, no. 5, pp. 2795–2808, Oct. 2016.
also exploiting the temporal task/workload behaviours. [7] O. Skarlat, M. Nardelli, S. Schulte, M. Borkowski, and P. Leitner,
“Optimized IoT service placement in the fog,” Service Oriented
Extensive simulation experiments using iFogSim and Cloud- Comput. Appl., vol. 11, no. 4, pp. 427–443, 2017.
Sim on real-world Bitbrain dataset show that our approach [8] A. Beloglazov and R. Buyya, “Optimal online deterministic algo-
can reduce energy consumption by 14.4 percent, response rithms and adaptive heuristics for energy and performance effi-
cient dynamic consolidation of virtual machines in cloud data
time by 7.74 percent, SLA violations by 31.9 percent and cost centers,” Concurrency Comput., Practice Experience, vol. 24, no. 13,
by 4.64 percent. Moreover, our model has a negligible sched- pp. 1397–1420, 2012.
uling overhead of 0.002 percent compared to the existing [9] X.-Q. Pham, N. D. Man, N. D. T. Tri, N. Q. Thai, and E.-N. Huh,
baseline which makes it a better alternative for dynamic task “A cost-and performance-effective approach for task scheduling
based on collaboration between cloud and fog computing,” Int. J.
scheduling in stochastic environments. Distrib. Sensor Netw., vol. 13, no. 11, pp. 1–16, 2017.
As part of future work, we plan to implement this model [10] A. Brogi and S. Forti, “QoS-aware deployment of IoT applications
in real edge-cloud environments. Implementation in real through the fog,” IEEE Internet Things J., vol. 4, no. 5, pp. 1185–1192,
Oct. 2017.
environments would require constant profiling CPU, RAM [11] T. Choudhari, M. Moh, and T.-S. Moh, “Prioritized task
and disk requirements of new tasks. This can be done using scheduling in fog computing,” in Proc. ACMSE Conf., 2018,
exponential averaging of requirement values in the current pp. 22:1–22:8.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
954 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022

[12] X.-Q. Pham and E.-N. Huh, “Towards task scheduling in a cloud- [34] L. Roselli et al., “Review of the present technologies concurrently
fog computing system,” in Proc. 18th Asia-Pacific Netw. Operations contributing to the implementation of the internet of things (IoT)
Manage. Symp., 2016, pp. 1–4. paradigm: RFID, green electronics, WPT and energy harvesting,”
[13] D. Jeff, “ML for system, system for ML, keynote talk in workshop on in Proc. Top. Conf. Wireless Sensors Sensor Netw., 2015, pp. 1–3.
ML for systems, NIPS,” 2018. [Online]. Available: http://mlforsy [35] S. Tuli et al., “HealthFog: An ensemble deep learning based smart
stems.org/ healthcare system for automatic diagnosis of heart diseases in
[14] S. Yi, C. Li, and Q. Li, “A survey of fog computing: Concepts, integrated IoT and fog computing environments,” Future Gener.
applications and issues,” in Proc. Workshop Mobile Big Data, 2015, Comput. Syst., vol. 104, pp. 187–200, 2020.
pp. 37–42. [36] S. Sarkar and S. Misra, “Theoretical modelling of fog computing:
[15] G. Fox et al., “Learning everywhere: Pervasive machine A green computing paradigm to support IoT applications,” IET
learning for effective high-performance computation,” in Proc. Netw., vol. 5, no. 2, pp. 23–29, 2016.
IEEE Int. Parallel Distrib. Process. Symp. Workshops, 2019, [37] Z. Abbas and W. Yoon, “A survey on energy conserving mecha-
pp. 422–429. nisms for the Internet of Things: Wireless networking aspects,”
[16] D. Basu, X. Wang, Y. Hong, H. Chen, and S. Bressan, “Learn-as- Sensors, vol. 15, no. 10, pp. 24 818–24 847, 2015.
you-go with Megh: Efficient live migration of virtual machines,” [38] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. Leung,
IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 8, pp. 1786–1801, and Y. L. Guan, “Wireless energy harvesting for the Internet of
Aug. 2019. Things,” IEEE Commun. Mag., vol. 53, no. 6, pp. 102–108, Jun. 2015.
[17] H. Li, K. Ota, and M. Dong, “Learning IoT in edge: Deep learning for [39] A. M. Rahmani et al., “Exploiting smart e-Health gateways at the
the Internet of Things with edge computing,” IEEE Netw., vol. 32, edge of healthcare Internet-of-Things: A fog computing approach,”
no. 1, pp. 96–101, Jan./Feb. 2018. Future Gener. Comput. Syst., vol. 78, pp. 641–658, 2018.
[18] M. Xu, S. Alamro, T. Lan, and S. Subramaniam, “LASER: A deep [40] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy
learning approach for speculative execution and replication of optimization,” in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 22–31.
deadline-critical jobs in cloud,” in Proc. 26th Int. Conf. Comput. [41] R. Doshi, K.-W. Hung, L. Liang, and K.-H. Chiu, “Deep learning
Commun. Netw., 2017, pp. 1–8. neural networks optimization using hardware cost penalty,” in
[19] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, S. U. Khan, and P. Li, “A Proc. IEEE Int. Symp. Circuits Syst., 2016, pp. 1954–1957.
double deep Q-learning model for energy-efficient edge sched- [42] R. Dey and F. M. Salemt, “Gate-variants of gated recurrent unit
uling,” IEEE Trans. Services Comput., vol. 12, no. 5, pp. 739–749, (GRU) neural networks,” in Proc. IEEE 60th Int. Midwest Symp. Cir-
Sep./Oct. 2019. cuits Syst., 2017, pp. 1597–1600.
[20] R. S. Sutton et al., Introduction to Reinforcement Learning. Cambridge, [43] A. Paszke et al., “Automatic differentiation in PyTorch,” NIPS-W,
MA, USA: MIT Press, 1998. 2017.
[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam- [44] B. Li et al., “Large scale recurrent neural network on GPU,” in
bridge, MA, USA: MIT Press, 2016. Proc. Int. Joint Conf. Neural Netw., 2014, pp. 4062–4069.
[22] M. Bowling, “Convergence problems of general-sum multiagent [45] S. Shen, V. van Beek, and A. Iosup, “Statistical characterization of
reinforcement learning,” in Proc. 17th Int. Conf. Mach. Learn., 2000, business-critical workloads hosted in cloud datacenters,” in Proc.
pp. 89–94. 15th IEEE/ACM Int. Symp. Cluster Cloud Grid Comput., 2015,
[23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement pp. 465–474.
learning with double Q-learning,” in Proc. 13th AAAI Conf. Artif. [46] SPEC, “Standard performance evaluation corporation,” 2018.
Intell., 2016, pp. 2094–2100. [Online]. Available: https://www.spec.org/benchmarks.html
[24] M. Cheng, J. Li, and S. Nazarian, “DRL-cloud: Deep reinforcement [47] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
learning-based resource provisioning and task scheduling for no. 3/4, pp. 279–292, 1992.
cloud service providers,” in Proc. 23rd Asia South Pacific Des. [48] K. Miettinen, Nonlinear Multiobjective Optimization, vol. 12. Berlin,
Autom. Conf., 2018, pp. 129–134. Germany: Springer, 2012.
[25] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource [49] S. J. Wright, “Coordinate descent algorithms,” Math. Program.,
management with deep reinforcement learning,” in Proc. 15th vol. 151, no. 1, pp. 3–34, 2015.
ACM Workshop Hot Topics Netw., 2016, pp. 50–56. [50] D. L. Eager, J. Zahorjan, and E. D. Lazowska, “Speedup versus
[26] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, and P. Li, “Energy-efficient efficiency in parallel systems,” IEEE Trans. Comput., vol. 38, no. 3,
scheduling for real-time systems based on deep Q-learning pp. 408–423, Mar. 1989.
model,” IEEE Trans. Sustain. Comput., vol. 4, no. 1, pp. 132–141, [51] D.-M. Bui, Y. Yoon, E.-N. Huh, S. Jun, and S. Lee, “Energy effi-
Jan.–Mar. 2019. ciency for cloud computing system based on predictive opti-
[27] V. Mnih et al., “Asynchronous methods for deep reinforcement mization,” J. Parallel Distrib. Comput., vol. 102, pp. 103–114, 2017.
learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. [52] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for
[28] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource online computation offloading in wireless powered mobile-edge
management with deep reinforcement learning,” in Proc. 15th computing networks,” IEEE Trans. Mobile Comput., early access,
ACM Workshop Hot Topics Netw., 2016, pp. 50–56. Jul. 24, 2019, doi: 10.1109/TMC.2019.2928811.
[29] B. Yue, J. Fu, and J. Liang, “Residual recurrent neural networks for [53] F. Li and B. Hu, “DeepJS: Job scheduling based on deep reinforce-
learning sequential representations,” Information, vol. 9, no. 3, ment learning in cloud data center,” in Proc. 4th Int. Conf. Big Data
2018, Art. no. 56. Comput., 2019, pp. 48–53.
[30] H. Gupta, A. Vahid Dastjerdi, S. K. Ghosh, and R. Buyya, [54] G. Rjoub, J. Bentahar, O. A. Wahab, and A. S. Bataineh, “Deep and
“iFogSim: A toolkit for modeling and simulation of resource man- reinforcement learning for automated task scheduling in large-
agement techniques in the internet of things, edge and fog com- scale cloud computing systems,” Concurrency Comput.: Pract. Expe-
puting environments,” Softw., Pract. Experience, vol. 47, no. 9, rience, e5919, 2020. [Online]. Available: https://doi.org/10.1002/
pp. 1275–1296, 2017. cpe.5919
[31] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and [55] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, and L.-C. Wang,
R. Buyya, “CloudSim: A toolkit for modeling and simulation “Deep reinforcement learning for mobile 5G and beyond: Funda-
of cloud computing environments and evaluation of resource mentals, applications, and challenges,” IEEE Veh. Technol. Mag.,
provisioning algorithms,” Softw.: Pract. Experience, vol. 41, no. 1, vol. 14, no. 2, pp. 44–52, Jun. 2019.
pp. 23–50, 2011. [56] L. Espeholt et al., “IMPALA: Scalable distributed Deep-RL with
[32] Q. Qi and Z. Ma, “Vehicular edge computing via deep reinforcement importance weighted actor-learner architectures,” in Proc. Int.
learning,” 2018, arXiv: 1901.04290. Conf. Mach. Learn., 2018, pp. 1407–1416.
[33] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolu-
tional neural networks for weakly supervised segmentation,” in " For more information on this or any other computing topic,
Proc. Int. Conf. Comput. Vis., 2015, pp. 1796–1804. please visit our Digital Library at www.computer.org/csdl.

uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply

You might also like