Dynamic Scheduling For Stochastic Edge-Cloud Computing Environments Using A3C Learning and Residual Recurrent Neural Networks
Dynamic Scheduling For Stochastic Edge-Cloud Computing Environments Using A3C Learning and Residual Recurrent Neural Networks
3, MARCH 2022
Abstract—The ubiquitous adoption of Internet-of-Things (IoT) based applications has resulted in the emergence of the Fog computing
paradigm, which allows seamlessly harnessing both mobile-edge and cloud resources. Efficient scheduling of application tasks in such
environments is challenging due to constrained resource capabilities, mobility factors in IoT, resource heterogeneity, network hierarchy, and
stochastic behaviors. Existing heuristics and Reinforcement Learning based approaches lack generalizability and quick adaptability, thus
failing to tackle this problem optimally. They are also unable to utilize the temporal workload patterns and are suitable only for centralized
setups. However, asynchronous-advantage-actor-critic (A3C) learning is known to quickly adapt to dynamic scenarios with less data and
residual recurrent neural network (R2N2) to quickly update model parameters. Thus, we propose an A3C based real-time scheduler for
stochastic Edge-Cloud environments allowing decentralized learning, concurrently across multiple agents. We use the R2N2 architecture
to capture a large number of host and task parameters together with temporal patterns to provide efficient scheduling decisions. The
proposed model is adaptive and able to tune different hyper-parameters based on the application requirements. We explicate our choice of
hyper-parameters through sensitivity analysis. The experiments conducted on real-world data set show a significant improvement in terms
of energy consumption, response time, Service-Level-Agreement and running cost by 14.4, 7.74, 31.9, and 4.64 percent, respectively when
compared to the state-of-the-art algorithms.
Index Terms—Edge computing, cloud computing, deep reinforcement learning, task scheduling, recurrent neural network, asynchronous
advantage actor-critic
Furthermore, they fail to adapt to continuous changes in the proposed scheduling model can be tuned to optimize the
system [13], which is common in Edge-Cloud environments required QoS metrics based on the application demands using
[14]. To that end, Reinforcement Learning (RL) based sched- the adaptive loss function proposed in this work. To that end,
uling approach is a promising avenue for dynamic optimiza- minimizing this loss function through policy learning helps
tion of the system [13, 15]. The RL solutions are more achieve highly optimized scheduling decisions. Unlike heu-
accurate as the models are built from the actual measure- ristics, the proposed framework can adapt to the new require-
ments, and they can identify complex relationships between ments as it continuously improves the model by tuning
different interdependent parameters. Recent works have parameters based on new observations. Furthermore, policy
explored different value-based RL techniques to optimize gradient enables our model to quickly adapt allocation policy
several aspects of Resource Management Systems (RMS) in responding to the dynamic workload, host behaviour and
distributed environments [16], [17], [18, 19]. Such methods QoS requirements, compared to traditional DQN methods.
store a Q value function in a table or using a Neural network The experiment results using an extended version of iFogSim
for each state of the edge-cloud environment, which is an Toolkit [30] with elements of CloudSim 5.0 [31] show the
expected cumulative reward in the RL setup [20]. The tabular superiority of our model against existing heuristics and previ-
value-based RL methods face problem of limited scalability ously proposed RL models. Our proposed methodology
[21], [22], [23], for which researchers have proposed various achieves significant efficiency for several critical metrics such
Deep learning based methods like Deep Q Learning (DQN) as energy, response time, Service Level Agreements (SLA)
[24], [25], [26] which use a neural network to approximate violation [8] and cost among others.
the Q value. However, previous studies have shown that In summary, the key contributions of this paper are:
such value-based RL techniques are not suitable for highly
stochastic environments [27], which make them perform We design an architectural system model for the data-
poorly in Edge-Cloud deployments. Limited number of driven deep reinforcement learning based scheduling
works exist which are able to leverage policy gradient meth- for Edge-Cloud environments.
ods [28] and optimize for only a single QoS parameter and We outline a generic asynchronous learning model for
do not use asynchronous updates for faster adaptability in scheduling in decentralized environments.
highly stochastic environments. Moreover, all prior works We propose a Policy gradient based Reinforcement
do not exploit temporal patterns in workload, network and learning method (A3C) for stochastic dynamic sched-
node behaviours to further improve scheduling decisions. uling method.
Furthermore, these works use a centralized scheduling pol- We demonstrate a Residual Recurrent Neural Network
icy which is not suitable for decentralized or hierarchical (R2N2) based framework for exploiting temporal pat-
environments. Hence, this work maps and solves the sched- terns for scheduling in a hybrid Edge-Cloud setup.
uling problem in stochastic edge-cloud environments using We show the superiority of the proposed solution
asynchronous policy gradient methods which can recognize through extensive simulation experiments and com-
the temporal patterns using recurrent neural networks and pare the results against several baseline policies.
continuously adapt to the dynamics of the system to yield The rest of the paper is organized as follows. Section 2
better results. describes the system model and also formulates the problem
In this regard, we propose a deep policy gradient based specifications. Section 3 explains a generic policy gradient
scheduling method to capture the complex dynamics of based learning model. Section 4 explains the proposed A3C-
workloads and heterogeneity of resources. To continuously R2N2 model for scheduling in Edge-Cloud environments.
improve over the dynamic environment, we use the asyn- The performance evaluation of the proposed method is
chronous policy gradient reinforcement learning method shown in Section 5. The relevant prior works are explained
called Asynchronous Advantage Actor Critic (A3C). A3C, in Section 6. Conclusions and future directions are presented
proposed by Mnih et al. [27], is a policy gradient method for in Section 7.
directly updating a stochastic policy which runs multiple
actor-agents asynchronously with each agent having it’s 2 SYSTEM MODEL AND PROBLEM FORMULATION
own neural network. The agents are trained in parallel and
In this section, we describe the system model and interac-
update a global network periodically, which holds shared
tion between various components that allow an adaptive
parameters. After each update, the agents resets their param-
reinforcement-based scheduling. In addition, we describe
eters to those of the global network and continue their
the workload model and problem formulation.
independent exploration and training until they update
themselves again. This method allows exploration of larger
state-action space quickly [27] and enables models to rapidly 2.1 System Model
adapt to stochastic environments. Moreover, it allows us to In this work, we assume that the underlying infrastructure is
run multiple models asynchronously on different edge or composed of both edge and cloud nodes. An overview of the
cloud nodes in a decentralized fashion without a single point system model is shown in Fig. 1. The edge-cloud environ-
of failure. Using this, we propose a learning model based on ment consists of distributed heterogeneous resources in the
Residual Recurrent Neural Network (R2N2). The R2N2 network hierarchy, from the edge of the network to the
model is capable of accurately identifying the highly nonlin- multi-hop remote cloud. The computing resources act as
ear patterns across different features of the input and exploit- hosts for various application tasks. These hosts can vary sig-
ing the temporal workload and node patterns, with residual nificantly in their compute power and response times. The
layers increasing the speed of learning [29]. Moreover, the edge devices are closer to the users and hence provide much
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
942 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
lower response times but are resource-constrained with lim- into scheduling intervals of equal duration. The scheduling
ited computation capability. On the other hand, cloud intervals are numbered based on their order of occurrence as
resources (Virtual Machines) located several hops away shown in Fig. 2. The ith scheduling interval is shown as SIi ,
from the users, provide much higher response time. How- which starts at time ti and continues till the beginning of the
ever, cloud nodes are resource enriched with increased next interval i.e., tiþ1 . In each SIi , the active tasks are those
computational capabilities that can process multiple tasks that were being executed on the hosts and are denoted as ai .
concurrently. Also, at the beginning of SIi , the set of tasks that get com-
The infrastructure is controlled by a Resource Manage- pleted is denoted as li and the new tasks that are sent by the
ment System (RMS) which consists of Scheduling, Migration WGM are denoted as ni . The tasks li leave the system and
and Resource Monitoring Services. The RMS receives tasks new tasks ni are added to the system. Thus, at the beginning
with their QoS and SLA requirements from IoT devices and of the interval SIi , the active tasks ai is ai1 [ ni n li .
users. It schedules the new tasks and also periodically
decides if existing tasks needs to be migrated to new hosts
2.3 Problem Formulation
based on the optimization objectives. The tasks’ CPU, RAM,
The problem that we consider is to optimize the perfor-
bandwidth, and disk requirements with their expected com-
mance of the scheduler in the edge-cloud environment as
pletion times or deadlines affect the decision of the RMS.
described in Section 2.1 and dynamic workload described
This effect is simulated using a stochastic task generator
in Section 2.2. The performance of the scheduler is quanti-
known as the Workload Generation Module (WGM) follow-
fied by the metric denoted as Loss defined for each schedul-
ing a dynamic workload model for task execution described
ing interval. The lower the value of Loss, the better the
in the next subsection.
scheduler. We denote loss of the interval SIi as Lossi .
In our model, the Scheduler and Migration services
In the edge-cloud environment, the set of hosts is
interact with a Deep Reinforcement Learning Module
denoted as Hosts and its enumeration as ½H0 ; H1 ; . . . ; Hn .
(DRLM), which suggests placement decision for each task
We assume that the maximum number of hosts at any
(on hosts) to the former services. Instead of a single sched-
instant of the execution is n. We also denote host assigned
uler, we run multiple schedulers with separate partitions
to a task T as fT g. We define our scheduler as a mapping
of tasks and nodes. These schedulers can be run on a single
between the state of the system to an action which consists
node or separate edge-cloud nodes [27]. As shown in prior
of host allocation for new tasks and migration decision for
works [27], [32], having multiple actors learn parameter
active tasks. The state of the system at the beginning of SIi ,
updates in an asynchronous fashion allows computational
denoted as Statei , consists of the parameter values of Hosts,
load to be distributed among different hosts, allowing
remaining active tasks of the previous interval which
faster learning within the limits of resource constrained
(ai1 n li ) and new tasks (ni ). The scheduler has to decide for
edge devices. Thus, in our system, we assume all edge and
each task in ai (¼ ai1 [ ni n li ), the host to be allocated or
cloud nodes to accumulate local gradients to their schedu-
migrated to, which we denote as the Actioni for SIi . How-
lers and add and synchronize gradients of all such hosts to
ever, all tasks may not be migratable. Let mi ai1 n li be the
update their models individually. Our policy learning
migratable tasks. Thus, Actioni ¼ fh 2 Hosts for task T jT 2
model is part of the DRLM with each scheduler with a sep-
mi [ ni g which is a migration decision for tasks in mi and
arate copy of the global neural network, which allows
allocation decision for tasks in ni . Thus scheduler, denotes as
asynchronous updates. Another vital component of the
Model, is a function: Statei ! Actioni . The Lossi of an inter-
RMS is the Constraint Satisfaction Module (CSM) which
val depends on the allocation of the tasks to hosts i.e.,
checks if the suggestion from the DRLM is valid in terms
Actioni by the Model. Hence, for an optimal Model, the prob-
of constraints such as whether a task is already in migra-
lem can be formulated as described by Equation (1)
tion or the target host is running at full capacity. The
importance and detailed functionality of CSM is explained
in Section 3.2.
TABLE 1
Symbol Table
Symbol Meaning
th
SIi i scheduling interval
ai Active tasks in SIi
li Tasks leaving at beginning of SIi
ni New tasks received at beginning of SIi Fig. 4. Matrix representation of model inputs.
Hosts Set of hosts in the Edge-Cloud Datacenter
n Number of hosts in the Edge-Cloud Datacenter all hosts in a feature vector denoted as FViHosts as shown in
Hi ith host in an enumeration of Hosts Fig. 4a. The tasks in ai are segregated into two disjoint sets:
TiS ith task in an enumeration of S ni and ai1 n li . The former consists of parameters like task
fT g Host assigned to task T CPU, RAM, bandwidth, and disk requirements. The latter
FViS Feature vector corresponding to S at SIi
also consists of the index of the host assigned in the previous
mi Migratable tasks in ai
ActionPG Scheduling decision at start of SIi interval. The feature vectors of these set of tasks are denoted
i n a nl
LossPG
i Loss function for the model at start of SIi as FVi i and FVi i1 i as shown in Figs. 4b and 4c respectively.
a nl n
Thus, Statei becomes ðFViHosts ; FVi i1 i ; FVi i Þ, which is the
input of the model.
X
minimize Lossi 3.2 Output Specification
Model
i
(1) At the beginning of the interval SIi , the model needs to pro-
subject to 8 i; Actioni ¼ ModelðStatei Þ vide a host assignment for each task in ai based on the input
8 i 8 T 2 mi [ ni ; fT g Actioni ðT Þ: Statei . The output, also denoted as Actioni is a host assign-
ment for each new task 2 ni and migration decision for
remaining active tasks from previous interval 2 ai1 n li .
A symbol table for ease of meaning recall and a Venn dia- This assignment must be valid in terms of the feasibility
gram of various task sets are given in Table 1 and Fig. 3, constraints such that each task which is migrated must be
respectively. migratable to the new host (we denote migratable task as mi
which is ai ), i.e., it is not under migration. Moreover,
3 REINFORCEMENT LEARNING MODEL when a host h is allocated to any task T , then after allocation
h should not get overloaded i.e., h is suitable for T . Thus, we
We now propose a Reinforcement Learning model for the describe Actioni through Equation (2) such that for the
problem statement described in Section 2.3 suitable for pol- interval SIi , 8 T 2 ni [ mi ; fT g Actioni ðT Þ,
icy gradient learning. First, we present the input and output
specifications of the Neural Network and then describe the
h 2 Hosts 8 t 2 ni
modeling of Lossi (from Equation (1)) in our model. Actioni ¼
hnew 2 Hosts 8 t 2 mi if t is to be migrated
3.1 Input Specification subject to
The input of the scheduler Model, is the Statei which consists Actioni is suitable for t 8 t 2 ni [ mi :
of the parameters of hosts, which include utilization and (2)
capacity of CPU, RAM, bandwidth, and disk [16]. It also
includes the power characteristics, cost per unit time, Million However, developing a model that provides a constrained
Instructions per Seconds (MIPS) for the host, response time, output is computationally difficult [33] hence, we use an alter-
and the number of tasks to which this host is allocated. Dif- native definition of model action which is unconstrained. We
ferent hosts would have different computational power compensate for the constraints in the objective function. In the
(CPU), memory capacity (RAM) and I/O availability (disk unconstrained formulation of the model action, the output
and bandwidth). As tasks in an edge-cloud setup impose would be a priority list of hosts for each task. Thus, for task
Tj i , we have a list of hosts ½Hj0 ; Hj1 ; . . . ; Hjn in decreasing
a
compute, memory and I/O limitations, such parameters are
crucial for scheduling decisions. Moreover, allowing multi- order of allocation preference. For a neural network, the out-
ple tasks to be placed on a small cluster of hosts could ensure put could be a vector of allocation preference for each host for
low energy usage (hibernating the ones with no tasks). A every task. This means that rather than specifying a single
host with higher I/O capacity (disk read/write speeds) host for each task, the model provides a ranked list of hosts.
could allow I/O intensive tasks to be completed quickly and We denote this unconstrained model action for policy gradi-
prevent SLA violations. All these parameters are defined for ent setup as ActionPG i as shown in Fig. 5.
This unconstrained action cannot be used directly for
updating the task allocation to hosts. We need to select the
most preferable host for each task which is suitable for only
those tasks that are migratable. To convert ActionPG i to
Actioni is straightforward as shown in Equation (3). For
a a
Actioni ðTj i Þ, if Tj i 2 ai1 n li and is not migratable then it is
a
not migrated. Otherwise, Tj i will be allocated to the highest
Fig. 3. Venn diagram of various task sets. rank host which is suitable. By the conversion of Equation (3),
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
944 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
cumulative loss starting from next interval (CLossPG iþ1 ). The The log term in the Equation (13) specifies the direction of
recurrent layers are formed using Gated Recurrent Units change in the parameters, ðLossPG i þ CLossiþ1 Þ term is the
Pred
(GRUs) [42], which model the temporal aspects of the task predicted cumulative loss in this episode starting from
and host characteristics including tasks’ CPU, RAM and Statei . To minimize this, the gradients are proportional to
bandwidth requirements and hosts’ CPU, RAM and band- this quantity and have a minus sign to reduce total loss. The
width capacities. Although the GRU layers help in taking second gradient term is the Mean Square Error (MSE) of the
an informed scheduling decision by modeling the temporal predicted cumulative loss with the cumulative loss after
characteristics, they increase the training complexity due to one-step look-ahead. The output ActionPG i is converted to
large number of network parameters. This is solved by Actioni by CSM and sent to the RMS every scheduling inter-
using the skip connections between these layers for faster val. Thus, for each interval, there is a forward pass of the
gradient propagation. R2N2 network. For back-propagation, we use a episode size
of 12, thus we save the experience of the previous episode
4.2 Pre-Processing and Output Conversion to find and accumulate gradients and update model param-
The input to the model for the interval SIi is Statei , which is a eters after 12 intervals. For large batch sizes, parameter
n a nl
2-dimensional vector. This includes FViHosts ; FVi i ; FVi i1 i . updates are slower and for small ones the gradient accumu-
Among these vectors, the values of all elements of the first lation is not able to generalize and has high variance.
two are continuous, but the host index in each row of Accordingly, empirical analysis has resulted into optimal
a nl
FVi i1 i is a categorical value. Hence, the host indices episode size of 12. As described in Section 5.1, the experi-
are converted to a one-hot vector of size n and all feature vec- mental setup has a scheduling interval of 5 minutes, and
tors are concatenated. After this, each element in the hence back-propagation is performed every 1 hour of simu-
concatenated vector is normalized based on the minimum lation time (after 12 intervals).
and maximum values of each feature and clipped between A summary of the model update and scheduling with
[0, 1]. We denote the feature of element e as fe , and minimum back-propagation is shown in Algorithm 1. To decide the
and maximum values for feature f as minf and maxf respec- best possible scheduling decision for each scheduling
tively. These minimum and maximum values are calculated interval, we iteratively pre-process and send the interval
based on a sample dataset using two heuristic-based sched- state to the R2N2 model with the loss and penalty to
uling policies: Local-Regression (LR) for task allocation and update the network parameters. This allows the model to
Maximum-Migration-Time (MMT) for task selection as adapt on-the-fly to the environment, user and application
described in [8]. Then, the feature-wise standardization is specific requirements.
done based on Equation (12). Hence
( Algorithm 1. Dynamic Scheduling
0 if maxfe ¼ minfe
e¼ eminfe (12) Inputs:
minð1; maxð0; max min ÞÞ otherwise: 1: Number of scheduling intervals N
f e f e
2: Batch Size B
This pre-processed input is then sent to the R2N2 model Begin
which flattens it and passes through the Dense layers. The 3: for interval index i from 1 to N do
output generated O is converted to ActionPG i by first gener- 4: if i > 1 and i%B ¼¼ 0 then
ating the sorted list of host SortedHostsi with decreasing 5: Use LossPGi ¼ Lossi þ Penaltyi in RL Model for
mi [ni
probability in Oi for all i. Then, ActionPG i ðTk Þ back-propagation
SortedHostsk 8 k 2 f1; 2; . . . ; jmi [ ni jg. 6: end if
7: send PREPROCESS(Statei ) to RL Model
4.3 Policy Learning 8: probabilityMap output of RL Model for Statei
To learn the weights and biases of the R2N2 network, we use 9: (Actioni , Penaltyiþ1 ) CONSTRAINTSATISFACTIONMODULE
the back-propagation algorithm with reward as LossPG (probabilityMap)
i . For
the current model, we use adaptive learning rate starting 10: Allocate new tasks and migrate existing tasks based on
from 102 and decrease it to 1=10th when the absolute sum of Actioni
of change in the reward for the last ten iterations is less than 11: Execute tasks in edge-cloud infrastructure for interval SIi
12: end for
0.1. Using reward as LossPG i , we perform Automatic Differ-
End
entiation [43] to update the network parameters. We accumu-
late the gradients of local networks at all edge nodes
asynchronously and update the global network parameters Complexity Analysis. The complexity of Algorithm 1
periodically as described in [27]. The gradient accumulation depends on multiple tasks. The pre-processing of the input
rule after the ith scheduling interval is given by Equation (13) state is OðabÞ where a b is the maximum size of feature vec-
n a nl
similar to the one in [27]. Here u denotes the global network tor among the vectors FViHosts ; FVi i ; FVi i1 i . To generate
parameters and u0 denotes the local parameters (only one gra- the Actioni and Penaltyi the CSM takes Oðn2 Þ time for n hosts
dient is set because of a single network with two heads). Thus and tasks based on Equations (4) and (3). As the feature vec-
tors have a higher cardinality than the number of hosts or
du aru0 log ½pðStatei ; u0 ÞðLossPG tasks, OðabÞ dominates Oðn2 Þ. Therefore, discarding the for-
i þ CLossiþ1 Þ
Pred
du
Pred 2
ward pass and back-propagation (as they are performed in
þ aru0 ðLossPG
i þ CLossiþ1 CLossi
Pred
Þ : Graphics Processing Units - GPU [44]), for N scheduling inter-
(13) vals, the total time complexity is OðabNÞ.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 947
5 PERFORMANCE EVALUATION
In this section, we describe the experimental set up, evalua-
tion metrics, dataset and give a detailed analysis of results
comparing our model with several baseline algorithms.
1. The BitBrain dataset can be downloaded from: http://gwa. ewi. 2. Microsoft Azure pricing calculator for South-East Australia
tudelft.nl/datasets/gwa-t-12-bitbrains https://azure.microsoft.com/en-au/pricing/calculator/
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
948 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
TABLE 2
Configuration of Hosts in the Experiment Set Up
edge node. As per the targeted environment convention, we identify the target host. Furthermore, we also compare our
choose resource-constrained machines at edge (Intel i3 and results to two types of standard RL approaches that are
Intel i5) and powerful rack server as cloud nodes (Intel widely used in the literature.
Xeon). The power consumption averaged over the different
SPEC benchmarks [46] for respective machines is shown in LR-MMT: schedules workloads dynamically based
Table 2. However, the power consumption values shown in on Local Regression (LR) and Minimum Migration Time
Table 2 are average values over this specific benchmark (MMT) heuristics for overload detection and task
suite. Power consumption of hosts also depends on RAM, selection, respectively (details in [8])
Disk and bandwidth consumption characteristics and are MAD-MC: schedules workloads dynamically based
provided to the model by the underlying CloudSim simula- on Median Absolute Deviation (MAD) and Maximum
tor. In the execution environment, we consider the host Correlation Policy (MC) heuristics for overload detec-
capacities (CPU, RAM, Network Bandwidth, etc) and the tion and task selection, respectively (details in [8])
current usage to form the feature vector FViHosts for the ith DDQN: standard Deep Q-Learning based RL approach,
scheduling interval. For the experiments, we keep the test- many works have used this technique in literature
ing simulation duration of 1 day, which equals to total 288 including [16], [25], [26]. We implement the optimized
scheduling intervals. Double DQN technique.
DRL (REINFORCE): policy gradient based REIN-
FORCE method with fully connected neural network
5.2 Evaluation Metrics
[28].
To evaluate the efficacy of the proposed A3C-R2N2 based
It is important to note that we implement these algo-
scheduler, we consider the following metrics. Motivated
rithms adapting to our problem and compare the results.
from prior works [4], [16], [35], energy is paramount in
The RL model that has been used for comparison with our
resource constrained edge-cloud environments and real-
proposed model uses a state representation same as the
time tasks require low response times. Moreover, service
Statei defined in Section 3.1 for fair comparison. An action
level agreements are crucial in time-critical tasks and low
is a change from one state to another in the state space. As
execution cost is required for budget task execution:
in [24], the DQN network is updated using Bellman Equa-
P
1) Total tion [47] with the reward defined as LossPG i . The REIN-
R tiþ1 Energy Consumption which is given as h2Hosts
FORCE method is implemented without asynchronous
t¼ti Ph ðtÞdt for the complete simulation duration.
2) P Average Response Time which is given as updates or recurrent network.
Response TimeðtÞ
t2liþ1
jliþ1 j . P 5.4 Analysis of Results
SLAVi jliþ1 j
3) SLA Violations which is given as i P l where In this subsection, we provide the experimental results
i i
SLAVi is defined by Equation (9). using the experimental setup and the dataset described in
P P R tiþ1
4) Total Cost which is given as i h2Hosts t¼t i
Ch ðtÞdt. Section 5.1. We also discuss and compare our results based
Other metrics of importance include: Average Task Com- on evaluation metrics specified in Section 5.2. We first ana-
pletion Time, Total number of completed Tasks with fraction of lyze the sensitivity of hyper-parameters ða; b; g; d; Þ on the
tasks that were completed within the expected execution model learning and how it affects different metrics. We
time (based on requested MIPS), Number of task migrations in then analyze the variation of scheduling decisions based on
each interval and Total migration time per interval. The task different hyper-parameter values and show how the com-
completion time is defined as the sum of the average task bined optimization of different evaluation metrics provides
scheduling time, task execution time and response time of better results. We also compare the fraction of scheduling
host on which the task ran in last scheduling interval. time with total execution time by varying the number of
layers on the R2N2 network. Based on the above analysis,
5.3 Baseline Algorithms we find the optimum R2N2 network and hyper-parameter
We evaluate the performance of our proposed algorithms values to compare with the baseline algorithms described in
with the following baseline algorithms, the reasons for choos- Section 5.3. All model learning is done for 10 days of simula-
ing these is described in Section 6. Multiple heuristics have tion time and testing is done for 1 day of simulation time
been proposed by [8] for dynamic scheduling. These are a using a disjoint set of workloads of the dataset.
combination of different sub heuristics for different sub-prob-
lems such as host overload detection and task/VM selection 5.4.1 Sensitivity Analysis of Hyper-Parameters
and we have selected the best three heuristics from those. All We first provide experimental results in Fig. 9 for different
of these variants use Best Fit Decreasing (BFD) heuristics to hyper-parameter values and show how changing the loss
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 949
function to learn only one of the metric of interest specifi- (SLAVMN, ¼ 1) also has a low response time as a number
cally, varies the learned network to give different values of of SLA violations are directly related to response time for
the evaluation metrics, these experiments were carried for a tasks. However, SLA violations also depend on the comple-
single day of simulation duration. To visualize the output tion time of tasks, and as the average task completion time
probability map from the R2N2 network, we display it of RTMN is very high, the SLA violations of this network
using a color map to depict probabilities (0 to 1) of allocating are much more than the other network as shown in Fig. 9c.
tasks to hosts as described in Section 4.2. The fraction of SLA violation is least for SLAVMN and next
When a ¼ 1 (rest = 0), then the R2N2 network solely tries least is for the Migration Time Minimizing Network (MMN,
to optimize the average energy consumption, and hence we g ¼ 1). The SLAVMN network also sends tasks to edge
call it Energy Minimizing Network (EMN). The total energy nodes like RTMN, but it also considers task execution time
consumed across the simulation duration is least for this net- and CPU loads to distribute tasks more evenly as shown in
work as shown in Fig. 9a. As low energy devices (edge Fig. 11b.
nodes) consume the least energy and also have least cost, When only average migration time is being optimized,
energy is highly correlated to cost, and hence the Cost Mini- the average task completion time is minimum, as shown in
mizing Network (CMN, d ¼ 1) also has very low total energy Fig. 9e. However, the SLA violation is not minimum as this
consumption. As shown in Fig. 10, for the same Statei , the network does not try to minimize the response time of tasks,
probability map and hence the allocation are similar for both as shown in Fig. 9b. Moreover, the number of completed
networks. Similarly, we can also see that in Fig. 9d, CMN has tasks is highest for this network as shown in Fig. 9f. Still, the
the least cost and the next least cost is achieved by EMN. fraction of tasks completed within the expected time is high-
The graph in Fig. 9b shows that the Response Time Mini- est for SLAVMN. Figs. 9g and 9h show that number of task
mizing Network (RTMN, b ¼ 1) has the least average migrations and migration time is least for MTMN. Also
response time and tries to place most of the tasks on edge compared in Fig. 12 the number of migrations for the sam-
nodes also shown in Fig. 11a. Moreover, this network does ple size of 30 initial tasks are 7 for EMN and 0 for the other.
not differentiate among the edge nodes in terms of their Optimizing each of the evaluation metrics independently
CPU loads because all edge nodes have the same response shows that the R2N2 based network can adapt and update
time and hence gives almost same probability to every edge
node for each task. The SLA Violation Minimizing Network
Fig. 10. Probability map for EMN and CMN showing similarity and posi- Fig. 11. Probability Map for RTMN and SLAVMN showing that the former
tive correlation. does not distinguish among edge nodes but SLAVMN does.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
950 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
Fig. 12. Probability maps showing that MTMN has lesser migrations than with 3 recurrent layers and hyper-parameter values given
EMN. by Equation (14).
its parameters to learn the dependence among tasks and 5.4.3 Scalability Analysis
hosts to reduce metric of interest which may be energy,
We now show how the A3C-R2N2 model scales with the
response time, etc. However, for the optimum network, we
number of actor agent hosts in the setup. As discussed in Sec-
use a combination of all metrics. This combined optimization
tion 2, we have multiple edge-cloud nodes in the environ-
leads to a much lower value of the loss and a much better net-
ment which run the policy learning as described in
work. This is because optimizing only along one variable
Section 4.3. However, the number of such agents affects the
might reach a local optimum and the loss of hyper-parameter
time to train the Actor-Critic network. We define the time
space being a highly non-linear function, combined optimi-
taken by n agents to reduce the loss value to 2.5 as Timen .
zation leads to much better network [48]. Based on the
Now, speedup corresponding to a system with n actors is cal-
empirical evaluation for each combination and block coordi- Time1
culated as Sn ¼ Time . Moreover, efficiency of a system with n
nate descent [49] for minimizing Loss, the optimum values n
agents is defined as En ¼ Snn [50]. Fig. 14 shows how speedup
of the hyper-parameters are given by Equation (14). Thus,
and efficiency of the model vary with number of agent nodes.
As shown, the speedup increases with n, however, efficiency
ða; b; g; d; Þ ¼ ð0:4; 0:16; 0:174; 0:135; 0:19Þ: (14)
reduces as n increases in a piece-wise linear fashion. There is
a sudden drop in efficiency when number of agents is
5.4.2 Sensitivity Analysis of the Number of Layers increased from 1. This is because of the communication delay
between agents which leads to slower model updates. The
Now that we have the optimum values of hyper-parame-
drop increases again after 20 hosts due to addition of GPU-
ters, we analyze the scheduling overhead with the number
less agents after 20 hosts. Thus, having agent run only on
of recurrent layers of the R2N2 network. The scheduling
CPU significantly reduces the efficiency of the proposed
overhead is calculated as the ratio of time taken for schedul-
architecture. For our experiments, we keep all active edge-
ing to the total execution duration in terms of simulation
cloud hosts (100 in our case) as actor agents in the A3C learn-
time. As shown in Fig. 13, the value of the loss function
ing for faster convergence and worst-case overhead compari-
decreases with the increase in the number of layers of the
son. In such a case, the speedup is 34.3 and efficiency is 0.37.
Neural Network. This is expected because as the number of
layers increase so do the number of parameters and thus
the ability of the network to fit more complex functions 5.4.4 Evaluation With Baseline Algorithms
becomes better. The scheduling overhead depends on the Having the empirically best set of values of hyper-parame-
system on which the simulation is run, and for the current ters and the number of layers and discussed the scalability
experiments, the system used had CPU - Intel i7-7700K and aspects of the model, we now compare our policy gradient
GPU - Nvidia GTX 1,070 graphics card (8 GB graphics based reinforcement learning model with the baseline algo-
RAM). As shown in the figure, there is an inflection point at rithms described in Section 5.3. The graphs in Fig. 16 pro-
3 recurrent layers because the R2N2 network with 4 or more vide results for 1 day of simulation time with a scheduling
such layers could not fit in the GPU graphics RAM. Based interval of 5 minutes on the Bitbrain dataset.
on the available simulation infrastructure, for the compari- Fig. 16a shows that among the baseline algorithms,
son with baseline algorithms, we use the R2N2 network DDQN and REINFORCE have the least energy consumption,
but A3C-R2N2 model has even lower energy consumption
which is 14.4 and 15.8 percent lower than REINFORCE and
DDQN respectively. The main reason behind this is that the
A3C-R2N2 network is able to adapt to the task workload
behavior quickly. This allows a resource hungry task to be
scheduled to a powerful machine. Moreover, the presence of
Average Energy Consumption (AEC) metric of all the edge-
cloud nodes within the loss function enforces the model to
take energy efficient scheduling decisions. It results in the
minimum number of active hosts with the remaining hosts
Fig. 13. Loss and scheduling overhead with number of recurrent layers. in stand-by mode to conserve energy (utilizing this feature of
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
TULI ET AL.: DYNAMIC SCHEDULING FOR STOCHASTIC EDGE-CLOUD COMPUTING ENVIRONMENTS USING A3C LEARNING AND... 951
Fig. 16. Comparison of deep learning model with prior heuristic-based works.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
952 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
TABLE 3
Comparison of Related Works With Different Parameters
wherein we compare our approach with DDQN [53] and scheduling interval with the average computed in the previ-
DRL (REINFORCE) [54]. All these baseline methods are ous interval. Further, the CPU, RAM, disk and bandwidth
adapted to be used in the proposed edge-cloud setup. How- usage would have to be collected and synchronized across
ever, in the Edge-Cloud environments, infrastructure is all A3C agents in the edge-cloud setup. Further to the sca-
shared among the diverse set of users requiring different QoS lablity analysis, we also plan to conduct tests to check the
for their respective applications. In such a case, the schedul- scalability of the proposed framework with number of hosts
ing algorithm must be adaptive and be able to tune automati- and tasks. The current model can schedule for a fixed num-
cally to application requirements. Our proposed framework ber of edge nodes and tasks. However, upcoming scalable
can be optimized to achieve better efficiency with respect to reinforcement learning models like Impala [56] can be
different QoS parameters as shown in Sections 4 and 5. More- investigated in future. Moreover, we plan to investigate the
over, Edge-Cloud environment brings heterogeneous com- data privacy and security aspects in future.
plexity and stochastic behavior of workloads which need to
be modeled within a scheduling problem. We model these ACKNOWLEDGMENTS
parameters efficiently in our model.
This work was supported by the Melbourne-Chindia Cloud
Computing (MC3) Research Network and the Australian
7 CONCLUSIONS AND FUTURE WORK Research Council.
Efficiently utilizing edge and cloud resources to provide a
better QoS and response time in stochastic environments REFERENCES
with dynamic workloads is a complex problem. This prob- [1] R. Mahmud, S. N. Srirama, K. Ramamohanarao, and R. Buyya,
lem is complicated further due to the heterogeneity of multi- “Quality of experience (QoE)-aware placement of applications in
layer resources and difference in response times of devices fog computing environments,” J. Parallel Distrib. Comput., vol. 132,
pp. 190–203, 2019.
in Edge-Cloud datacenters. Integrated usage of cloud and [2] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of
edge is a non-trivial problem as resources and network have Things (IoT): A vision, architectural elements, and future
completely different characteristics when users or edge- directions,” Future Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660,
nodes are mobile. Prior work not only fails to consider these 2013.
[3] J. Diechmann, H. Kersten, R. Thomas, and W. Dominik, “The
differences in edge and cloud devices but also ignores the Internet of Things: How to capture the value of IoT,” Tech. Rep.,
effect of stochastic workloads and dynamic environments. May 2018, pp. 1–124.
This work aims to provide an end-to-end real-time task [4] S. Tuli, R. Mahmud, S. Tuli, and R. Buyya, “FogBus: A blockchain-
based lightweight framework for edge and fog computing,” J.
scheduler for integrated edge and cloud computing environ- Syst. Softw., vol. 154, pp. 22–36, 2019.
ments. We propose a novel A3C-R2N2 based scheduler that [5] J. Wang, K. Liu, B. Li, T. Liu, R. Li, and Z. Han, “Delay-sensitive
can consider all important parameters of tasks and hosts to multi-period computation offloading with reliability guarantees
make scheduling decisions to provide better performance. in fog networks,” IEEE Trans. Mobile Comput., vol. 19, no. 9,
pp. 2062–2075, Sep. 2020.
Furthermore, A3C allows the scheduler to quickly adapt to [6] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computa-
dynamically changing environments using asynchronous tion offloading for mobile-edge cloud computing,” IEEE/ACM
updates, and R2N2 is able to quickly learn network weights Trans. Netw., vol. 24, no. 5, pp. 2795–2808, Oct. 2016.
also exploiting the temporal task/workload behaviours. [7] O. Skarlat, M. Nardelli, S. Schulte, M. Borkowski, and P. Leitner,
“Optimized IoT service placement in the fog,” Service Oriented
Extensive simulation experiments using iFogSim and Cloud- Comput. Appl., vol. 11, no. 4, pp. 427–443, 2017.
Sim on real-world Bitbrain dataset show that our approach [8] A. Beloglazov and R. Buyya, “Optimal online deterministic algo-
can reduce energy consumption by 14.4 percent, response rithms and adaptive heuristics for energy and performance effi-
cient dynamic consolidation of virtual machines in cloud data
time by 7.74 percent, SLA violations by 31.9 percent and cost centers,” Concurrency Comput., Practice Experience, vol. 24, no. 13,
by 4.64 percent. Moreover, our model has a negligible sched- pp. 1397–1420, 2012.
uling overhead of 0.002 percent compared to the existing [9] X.-Q. Pham, N. D. Man, N. D. T. Tri, N. Q. Thai, and E.-N. Huh,
baseline which makes it a better alternative for dynamic task “A cost-and performance-effective approach for task scheduling
based on collaboration between cloud and fog computing,” Int. J.
scheduling in stochastic environments. Distrib. Sensor Netw., vol. 13, no. 11, pp. 1–16, 2017.
As part of future work, we plan to implement this model [10] A. Brogi and S. Forti, “QoS-aware deployment of IoT applications
in real edge-cloud environments. Implementation in real through the fog,” IEEE Internet Things J., vol. 4, no. 5, pp. 1185–1192,
Oct. 2017.
environments would require constant profiling CPU, RAM [11] T. Choudhari, M. Moh, and T.-S. Moh, “Prioritized task
and disk requirements of new tasks. This can be done using scheduling in fog computing,” in Proc. ACMSE Conf., 2018,
exponential averaging of requirement values in the current pp. 22:1–22:8.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply
954 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 21, NO. 3, MARCH 2022
[12] X.-Q. Pham and E.-N. Huh, “Towards task scheduling in a cloud- [34] L. Roselli et al., “Review of the present technologies concurrently
fog computing system,” in Proc. 18th Asia-Pacific Netw. Operations contributing to the implementation of the internet of things (IoT)
Manage. Symp., 2016, pp. 1–4. paradigm: RFID, green electronics, WPT and energy harvesting,”
[13] D. Jeff, “ML for system, system for ML, keynote talk in workshop on in Proc. Top. Conf. Wireless Sensors Sensor Netw., 2015, pp. 1–3.
ML for systems, NIPS,” 2018. [Online]. Available: http://mlforsy [35] S. Tuli et al., “HealthFog: An ensemble deep learning based smart
stems.org/ healthcare system for automatic diagnosis of heart diseases in
[14] S. Yi, C. Li, and Q. Li, “A survey of fog computing: Concepts, integrated IoT and fog computing environments,” Future Gener.
applications and issues,” in Proc. Workshop Mobile Big Data, 2015, Comput. Syst., vol. 104, pp. 187–200, 2020.
pp. 37–42. [36] S. Sarkar and S. Misra, “Theoretical modelling of fog computing:
[15] G. Fox et al., “Learning everywhere: Pervasive machine A green computing paradigm to support IoT applications,” IET
learning for effective high-performance computation,” in Proc. Netw., vol. 5, no. 2, pp. 23–29, 2016.
IEEE Int. Parallel Distrib. Process. Symp. Workshops, 2019, [37] Z. Abbas and W. Yoon, “A survey on energy conserving mecha-
pp. 422–429. nisms for the Internet of Things: Wireless networking aspects,”
[16] D. Basu, X. Wang, Y. Hong, H. Chen, and S. Bressan, “Learn-as- Sensors, vol. 15, no. 10, pp. 24 818–24 847, 2015.
you-go with Megh: Efficient live migration of virtual machines,” [38] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. Leung,
IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 8, pp. 1786–1801, and Y. L. Guan, “Wireless energy harvesting for the Internet of
Aug. 2019. Things,” IEEE Commun. Mag., vol. 53, no. 6, pp. 102–108, Jun. 2015.
[17] H. Li, K. Ota, and M. Dong, “Learning IoT in edge: Deep learning for [39] A. M. Rahmani et al., “Exploiting smart e-Health gateways at the
the Internet of Things with edge computing,” IEEE Netw., vol. 32, edge of healthcare Internet-of-Things: A fog computing approach,”
no. 1, pp. 96–101, Jan./Feb. 2018. Future Gener. Comput. Syst., vol. 78, pp. 641–658, 2018.
[18] M. Xu, S. Alamro, T. Lan, and S. Subramaniam, “LASER: A deep [40] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy
learning approach for speculative execution and replication of optimization,” in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 22–31.
deadline-critical jobs in cloud,” in Proc. 26th Int. Conf. Comput. [41] R. Doshi, K.-W. Hung, L. Liang, and K.-H. Chiu, “Deep learning
Commun. Netw., 2017, pp. 1–8. neural networks optimization using hardware cost penalty,” in
[19] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, S. U. Khan, and P. Li, “A Proc. IEEE Int. Symp. Circuits Syst., 2016, pp. 1954–1957.
double deep Q-learning model for energy-efficient edge sched- [42] R. Dey and F. M. Salemt, “Gate-variants of gated recurrent unit
uling,” IEEE Trans. Services Comput., vol. 12, no. 5, pp. 739–749, (GRU) neural networks,” in Proc. IEEE 60th Int. Midwest Symp. Cir-
Sep./Oct. 2019. cuits Syst., 2017, pp. 1597–1600.
[20] R. S. Sutton et al., Introduction to Reinforcement Learning. Cambridge, [43] A. Paszke et al., “Automatic differentiation in PyTorch,” NIPS-W,
MA, USA: MIT Press, 1998. 2017.
[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam- [44] B. Li et al., “Large scale recurrent neural network on GPU,” in
bridge, MA, USA: MIT Press, 2016. Proc. Int. Joint Conf. Neural Netw., 2014, pp. 4062–4069.
[22] M. Bowling, “Convergence problems of general-sum multiagent [45] S. Shen, V. van Beek, and A. Iosup, “Statistical characterization of
reinforcement learning,” in Proc. 17th Int. Conf. Mach. Learn., 2000, business-critical workloads hosted in cloud datacenters,” in Proc.
pp. 89–94. 15th IEEE/ACM Int. Symp. Cluster Cloud Grid Comput., 2015,
[23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement pp. 465–474.
learning with double Q-learning,” in Proc. 13th AAAI Conf. Artif. [46] SPEC, “Standard performance evaluation corporation,” 2018.
Intell., 2016, pp. 2094–2100. [Online]. Available: https://www.spec.org/benchmarks.html
[24] M. Cheng, J. Li, and S. Nazarian, “DRL-cloud: Deep reinforcement [47] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
learning-based resource provisioning and task scheduling for no. 3/4, pp. 279–292, 1992.
cloud service providers,” in Proc. 23rd Asia South Pacific Des. [48] K. Miettinen, Nonlinear Multiobjective Optimization, vol. 12. Berlin,
Autom. Conf., 2018, pp. 129–134. Germany: Springer, 2012.
[25] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource [49] S. J. Wright, “Coordinate descent algorithms,” Math. Program.,
management with deep reinforcement learning,” in Proc. 15th vol. 151, no. 1, pp. 3–34, 2015.
ACM Workshop Hot Topics Netw., 2016, pp. 50–56. [50] D. L. Eager, J. Zahorjan, and E. D. Lazowska, “Speedup versus
[26] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, and P. Li, “Energy-efficient efficiency in parallel systems,” IEEE Trans. Comput., vol. 38, no. 3,
scheduling for real-time systems based on deep Q-learning pp. 408–423, Mar. 1989.
model,” IEEE Trans. Sustain. Comput., vol. 4, no. 1, pp. 132–141, [51] D.-M. Bui, Y. Yoon, E.-N. Huh, S. Jun, and S. Lee, “Energy effi-
Jan.–Mar. 2019. ciency for cloud computing system based on predictive opti-
[27] V. Mnih et al., “Asynchronous methods for deep reinforcement mization,” J. Parallel Distrib. Comput., vol. 102, pp. 103–114, 2017.
learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. [52] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for
[28] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource online computation offloading in wireless powered mobile-edge
management with deep reinforcement learning,” in Proc. 15th computing networks,” IEEE Trans. Mobile Comput., early access,
ACM Workshop Hot Topics Netw., 2016, pp. 50–56. Jul. 24, 2019, doi: 10.1109/TMC.2019.2928811.
[29] B. Yue, J. Fu, and J. Liang, “Residual recurrent neural networks for [53] F. Li and B. Hu, “DeepJS: Job scheduling based on deep reinforce-
learning sequential representations,” Information, vol. 9, no. 3, ment learning in cloud data center,” in Proc. 4th Int. Conf. Big Data
2018, Art. no. 56. Comput., 2019, pp. 48–53.
[30] H. Gupta, A. Vahid Dastjerdi, S. K. Ghosh, and R. Buyya, [54] G. Rjoub, J. Bentahar, O. A. Wahab, and A. S. Bataineh, “Deep and
“iFogSim: A toolkit for modeling and simulation of resource man- reinforcement learning for automated task scheduling in large-
agement techniques in the internet of things, edge and fog com- scale cloud computing systems,” Concurrency Comput.: Pract. Expe-
puting environments,” Softw., Pract. Experience, vol. 47, no. 9, rience, e5919, 2020. [Online]. Available: https://doi.org/10.1002/
pp. 1275–1296, 2017. cpe.5919
[31] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and [55] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, and L.-C. Wang,
R. Buyya, “CloudSim: A toolkit for modeling and simulation “Deep reinforcement learning for mobile 5G and beyond: Funda-
of cloud computing environments and evaluation of resource mentals, applications, and challenges,” IEEE Veh. Technol. Mag.,
provisioning algorithms,” Softw.: Pract. Experience, vol. 41, no. 1, vol. 14, no. 2, pp. 44–52, Jun. 2019.
pp. 23–50, 2011. [56] L. Espeholt et al., “IMPALA: Scalable distributed Deep-RL with
[32] Q. Qi and Z. Ma, “Vehicular edge computing via deep reinforcement importance weighted actor-learner architectures,” in Proc. Int.
learning,” 2018, arXiv: 1901.04290. Conf. Mach. Learn., 2018, pp. 1407–1416.
[33] D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolu-
tional neural networks for weakly supervised segmentation,” in " For more information on this or any other computing topic,
Proc. Int. Conf. Comput. Vis., 2015, pp. 1796–1804. please visit our Digital Library at www.computer.org/csdl.
uthorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 07,2023 at 09:55:25 UTC from IEEE Xplore. Restrictions apply