(2019) (23) - Performability Evaluation and Optimization of Workflow Applications in Cloud Environments
(2019) (23) - Performability Evaluation and Optimization of Workflow Applications in Cloud Environments
https://doi.org/10.1007/s10723-019-09476-0
Received: 13 April 2018 / Accepted: 6 January 2019 / Published online: 17 January 2019
© Springer Nature B.V. 2019
Abstract Given the characteristics of dynamic pro- Experimental results show the effectiveness of the
visioning and illusion of unlimited resources, clouds hybrid simulation-optimization approach for optimiz-
are becoming a popular alternative for running sci- ing the number of allocated virtual machines and the
entific workflows. In a cloud system for process- scheduling of tasks regarding performability.
ing workflow applications, the system’s performance
is heavily influenced by two factors: the schedul- Keywords Scientific workflows · Performability ·
ing strategy and failure of components. Failures in a Stochastic petri nets · Optimization
cloud system can simultaneously affect several users
and depreciate the number of available computing
resources. A bad scheduling strategy can increase 1 Introduction
the expected makespan and the idle time of physical
machines. In this paper, we propose an optimization In the past decades, computers become an invaluable
method for the scheduling of scientific workflows on asset for scientists in many fields of human knowl-
cloud systems. The method comprises the use of a edge. Simulation models are useful when experiments
meta-heuristic algorithm coupled to a performability in the real world are too difficult or costly to execute,
model that provides the fitnesses of explored solu- or when the phenomenon of interest is impossible to
tions. For being able to represent the combined effect reproduce (for instance, in studies about the origin of
of scheduling and component failures, we adopted dis- the universe). Such models are often computationally
crete event simulation for the performability model. intensive and require an execution environment com-
posed of many processing units. Many computational
D. Oliveira () · N. Rosa · P. Maciel scientific applications can be expressed as workflows,
Federal University of Pernambuco, Informatics Center, i.e., a set of subtasks with data and control-flow
Recife, Brazil
e-mail: [email protected]
dependencies between them. In such applications, the
scheduling of tasks in the processing units plays a
Nelson Rosa
e-mail: [email protected]
vital role in the system’s performance, but finding an
optimal schedule is an NP-Hard problem [8].
Paulo Maciel
e-mail: [email protected]
Nowadays, cloud computing has been attracting
attention as a platform for running scientific appli-
A. Brinkmann
Data Processing Center (ZDV), Johannes Gutenber
cations [25, 27, 55]. The pay-per-use model elim-
University, Mainz, Germany inates the need for upfront investment on a clus-
e-mail: [email protected] ter/supercomputer. Moreover, cloud users do not need
750 D. Oliveira et al.
to worry about managing underlying hardware infras- its adaptable architecture, many extensions were pro-
tructure. While this model makes things more conve- posed to address aspects not originally implemented
nient for the user, this task becomes a severe issue by CloudSim. Some of the extensions feature auto-
for cloud providers, who need to guarantee reliability scaling [11], federated clouds [30], fault tolerance
and performance levels specified by a Service Level mechanisms [1, 16, 65], and workflow applications
Agreement (SLA). [10, 14, 35]. Nevertheless, covering multiple real-
The complexity of cloud infrastructures (i.e., the world characteristics of workflow applications in the
large number of hardware and software components same cohesive model is still a gap in the literature.
and their interdependence relationships) raises the Such characteristics include non-determinism, multi-
need for performance evaluation methods that con- tenancy, and the combined effect of hardware and
sider the failure of components. A failure in a sin- virtual machine failures.
gle physical server can bring down several virtual In this work, we propose an optimization method
machines of different users. Likewise, the unavailabil- for the scheduling of scientific workflows running in a
ity of the cloud manager (the head node of a cloud cloud environment. The throughput of workflow jobs
infrastructure) can provoke the unavailability of the is the problem’s objective function. Each request pro-
whole system. A classic performance study (that disre- cessed by the cloud system is defined by a graph of
gards reliability aspects) may not give accurate results subtasks with precedence constraints between them.
since failures of physical and virtual machines can A single job demands the provisioning of a certain
increase waiting times and decrease throughput [43]. number of virtual machines for running the subtasks
To assess the performance degradation caused by fail- in parallel. The model used to compute the objec-
ures in a cloud environment via measurement-based tive function can measure the impact of inefficient
evaluation is often prohibitive in practice. Even using scheduling and failures of components on the system’s
fault injectors to provoke failures on the system’s com- throughput. Given the presence of stochastic com-
ponents, the associated costs and time constraints for ponents on the proposed model, our method applies
such experiments are high, especially when testing discrete-event simulation for evaluating the objective
multiple configurations in a sensitivity analysis (i.e., function. We developed a Stochastic Petri net genera-
a systematic study of the impact of system param- tor algorithm for creating performance and reliability
eters on system’s performance/reliability [24]). For models of cloud workflow applications. Our exper-
that reason, state-space based models (e.g., Markov imental results demonstrate the effectiveness of the
chains [7], Stochastic Petri Nets (SPN) [38], Stochas- hybrid Simulation/Optimization approach for optimiz-
tic Automata Networks [42]) are the most employed ing the considered objective function. We also present
technique for performability evaluation of cloud and a sensitivity analysis study of the effects of hardware
grid systems [19, 43, 45, 57]. Besides the occurrence and virtual machine failures on system throughput.
of failures, inefficient scheduling strategies can harm This work is structured as follows. Section 2 cov-
the overall system performance by increasing the job ers the theoretical background and Section 3 presents
makespan and reducing the utilization of processing a list of related works. Section 4 describes the pro-
units. posed optimization method and Section 5 presents the
Evaluating the effects of both scheduling and hard- performability model used for the objective function.
ware/virtual machine failures on performability of In Section 6, we show the evaluation results for the
workflow applications in cloud environments is a chal- proposed method. Finally, Section 7 makes some final
lenging task for space-state based models due to the considerations and points further research directions.
high number of states to be considered. Moreover,
the exponential distribution may not be a good fit for
computation times in workflows. Given this intrin- 2 Background
sic limitation of space-based models, many existing
research efforts towards the modeling of cloud appli- This section presents basic concepts about combi-
cations employ discrete event simulators. CloudSim natorial optimization with simulation and workflow
[13] is the most widely adopted simulation frame- scheduling, aiming to facilitate the understanding of
work in the cloud computing literature. Thanks to this work.
Performability Evaluation and Optimization of Workflow Applications in Cloud Environments 751
2.1 Simulation/Optimization Hybrid Heuristics provide a survey of different Sim-Opt approaches and
present guidelines to help the selection of the most
Combinatorial optimization problems (COP) arise in suited approach given the particular characteristics of
many areas of human activity and computer science the problem (e.g., all designs have the same variance,
as well. A single-objective COP can be described the variance is known/unknown, and so on).
regarding [6]: Using simulation instead deterministic models
becomes possible to avoid simplifications needed
– a set of variables X = {x1 , x2 , . . . , xn }, where xi
when using deterministic models. The expressiveness
belongs to a domain Di ;
of deterministic models, however, has a drawback:
– a set of constraints c1 , c2 , . . . , cm ;
the time to solve a simulation model can be signifi-
– an objective function f : D1 ×D2 ×. . . Dn → IR.
cantly prolonged. This fact is even more problematic
The optimization procedure should give a solution in simulations used in conjunction with meta-heuristic
s from the space state S = D1 × D2 × . . . Dn that algorithms that need to compute the objective func-
either minimizes or maximizes the objective function tion of a large number of solutions. One alternative is
f (s), and satisfies all constraints. to use a surrogate model [44], a simplification of a
The characteristics of the objective function and more complex simulation model, that can be evaluated
constraints define the approach used to solve the prob- more quickly. A typical approach is to use an Artifi-
lem. If the objective function and constraints have cial Neural Network (ANN) as a surrogate model [21]
linear relationships, the underlying problem can be along with a simulation model to train the network.
solved by the efficient Simplex algorithm [39]. How- The training phase is computationally intensive, but
ever, since many essential COP problems are NP- once completed, it generates results very quickly.
Hard, a conventional approach to solving them is
using approximate algorithms [4]. They can find good 2.2 Scheduling of Scientific Workflows on Cloud
enough - but not optimal - solutions from a potentially Systems
large solution space. Metaheuristics are approximate
algorithms not tied to any particular problem/domain A scientific workflow consists of a set of comput-
and can be adapted for solving many different combi- ing and IO intensive tasks with precedence constraints
natorial problems [53]. between them. Scientific workflows can be repre-
COP problems with stochastic components in sented by a directed acyclic graph (DAG). A DAG G is
either the objective function or constraints can adopt defined by a tuple {T , E}, where T = {T1 , T2 , . . . , Tn }
simulation models to represent the random behavior. is the set of tasks and E = {e1 , e2 , . . . , em } is the set
Hybrid Simulation-Optimization (Sim-Opt) methods of precedence constraints. Each ei = (Ta , Tb ) tuple
deal with the issues involved in using simulation denotes that task Tb starts to execute after Ta finishes
models in conjunction with optimization algorithms. and sent some input data to Tb . The node weights
Swisher et al. [51] define Sim-Opt methods as “a are the computing times, and the edge weights are
structured approach to determine optimal settings for the communication times, i.e., the time for sending
input parameters (i.e., the best system design), where the results needed by a dependent task running on a
optimality is measured by a (steady-state or tran- different processing node.
sient) function of output variables associated with a A scheduler should map tasks/jobs to a set of pro-
simulation model”. cessors according to some predefined goals such as
A simulation-optimization problem can be viewed utilization of resources, makespan (the total length
as the selection of the best design among a (potentially of a schedule), throughput, meeting of deadlines, etc.
large) set of possible designs concerning some out- A particular scheduling of tasks for some DAG can
put response variable is given by a simulator model. be defined by a mapping of ordered task lists to the
A random distribution will define the response vari- processing units. A scheduling algorithm can either
able for each element of this set. The optimization require as input the number of processing units or try
procedure selects the design that corresponds to the to find an optimal/near optimal number of process-
highest or lowest expected value of this distribution ing units in conjunction with the mapping of tasks.
by using a sample for each design. Swisher et al. [51] The latter case is harder since it leads to an increased
752 D. Oliveira et al.
search space. Additionally, increasing the number of (Stochastic Petri Nets and Markov Chains) is the most
processors can shorten the makespan of an individual adopted method for joint performance/availability
job, but it may increase the idle time of processors due evaluation of clouds and grids systems. Dealing with
to the precedence constraints between tasks [48]. space state explosion is a recurrent problem handled
Since many practical scheduling problems are by every work in this category.
either NP-Hard or NP-Complete, a lot of effort is nec- Ramakrishnan and Reed [46] propose a qualitative
essary to apply and adapt meta-heuristic algorithms framework for the performability evaluation of scien-
to schedule tasks in cloud computing environments. tific workflows running on grid systems. The frame-
The following list summarizes the major contributions work encompasses a Markov Reward Chain model [7]
made by the literature concerning different aspects of and simulations calibrated with data from real grid
cloud workflow scheduling: applications. Xia et al. [58] describe a queuing net-
work model for evaluating estimated service time and
– Devising heuristics algorithms able to provide request rejection probability of an Infrastructure-as-
near-optimal solutions under certain constraints a-Service cloud. This model represents features such
[32]; as request handling, job creation, job execution, job
– Adapting nature inspired and evolutionary algo- rejection due to insufficient queue capacity, and failure
rithms for this problem, such as Genetic algorithm and repair events of physical machines configured in
[22, 60, 64], Honey bee colony [5, 31], ant colony hot/warm standby mode. Ever et al. [18] propose a set
[15, 52], and Fish Swarm [61]; of equations obtained from queuing theory for evaluat-
– Dealing with a heterogeneous system (processors ing the performability of clouds with large numbers of
with different computing power) [41]; servers. Since the underlying space-state model does
– Dealing with conflicting aspects of a multi-cost not need to be generated, this model can represent a
objective function (e.g., energy versus makespan) large number of servers and simultaneous requests.
[37]. A strategy to avoid space state explosion is to adopt
small models rather than use a big monolithic one.
For combining the results of the submodels, iteration
3 Related Work methods can be used [34]. Ghosh [19] uses inter-
acting homogeneous time Markov chains to perform
3.1 Performability Modeling of Cloud and Grid end-to-end performability analysis of cloud services.
Environments The proposed model is used to evaluate two essential
metrics: service availability and response time. Raei
Performability is the study of systems performance et al. [45] developed Stochastic Reward Net models
when subjected to the effect of failures on its sub- for representing a public cloud and a cloudlet pro-
components [36]. The performance of a system is said viding virtual machines for mobile applications. For
to be degradable if failure events may affect it neg- avoiding space state explosion when modeling both
atively. For instance, a mesh network of routers can performance and availability aspects of the consid-
tolerate a certain number of failures, but the over- ered system, the authors divided the public cloud and
all performance will be affected as some routers may cloudlet parts into two separated models and used the
be subject to overheads. Similarly, failures on worker fixed point iteration method to obtain a joint result.
nodes in cloud and grid environments can diminish The performability models cited in this subsec-
the number of available processing resources, and tion are based on Markov Chain [7, 46], Stochastic
therefore increasing queueing times and decreasing Petri nets (with Markov chain generation) [19, 34,
throughput of jobs. 45], and Queuing Theory [18, 58]. By contrast, our
Due to a large number of components of cloud and work adopts a discrete simulation approach based on
grid environments, we can expect a significant fail- Stochastic Petri nets components and automatic gener-
ure rate even if the mean time to failure of individual ation of models. The advantage of our model over the
components is high. Thus, neglecting the impact of works mentioned above is the ability to model DAGs
failures in performance studies of such systems can as job requests and the relationship between VM and
lead to misleading results. Space stated based models hardware failures. Incorporating these features into
Performability Evaluation and Optimization of Workflow Applications in Cloud Environments 753
space-state based models would lead to a space-state scheduling and resource provisioning algorithms for
explosion problem. Also, using the exponential distri- optimizing the execution of workflow ensembles
bution for representing job times can introduce dis- under deadline constraints in IaaS clouds. Elastic-
tortions when modeling the makespan of a stochastic CloudSim [11] is a CloudSim extension for evaluating
DAG (as we demonstrate in Section 6.2.2). workflow applications which supports auto-scaling
capabilities and considers non-deterministic (stochas-
3.2 Simulation of Workflow Execution on Cloud tic) workflows.
Environments Our work differs from WorkflowSim and Dynamic-
CloudSim by modeling hardware and virtual machine
Simulation is a commonly used approach for evalu- failures instead of representing transient/permanent
ating the performance of load-balancing algorithms, failure on tasks. FailureSim is able to model hard-
allocation policies, and scheduling strategies in cloud ware failures, and DesktopCloudSim can represent
systems, considering dynamic workload patterns. both hardware and VM failures. However, Desktop-
CloudSim [13] is the most adopted cloud simulation CloudSim and FailureSim do not target workflow
software in the literature. The CloudSim simulator applications. We opted not to create another CloudSim
allows the representation of data-center infrastruc- extension as the employed SPN based simulator
tures, VM allocation policies, user level workloads, presents some advantages. The proposed SPN mod-
and coordination between multiple cloud environ- els can be used separately for obtaining other metrics
ments through a cloud broker service. After being than performability (e.g., availability, reliability, and
released as open source software, CloudSim was expected makespan). Using this simulation environ-
extended in many different ways by the research com- ment also allows us to use existing SPN models in the
munity. Fault tolerance capabilities were introduced literature for representing reliability and performance
in [65]. FederatedCloudSim [30] extended CloudSim aspects of our system.
to represent SLA policies in federated clouds. Fail-
ureSim [16] introduced failure prediction of cloud 3.3 Cloud Workflow Optimization
nodes based on ANNs. Performance and usage levels
(bandwidth, number of tasks running, the quantity of The execution of workflow applications on clouds
available million of instructions per second per node) brings the need of new modeling strategies, schedul-
are used as predictors for training the network. Alwa- ing algorithms, and optimization metrics. The reason
bel et al. proposed DesktopCloudSim [1], a CloudSim for this need is the particular aspects of cloud systems
extension with a layer of failure injection for the phys- when contrasted to traditional grid/cloud environ-
ical nodes. Like our work, DesktopCloudSim allows ments. Kliazovich et al. [29] demonstrated how exis-
the investigation of failure events on system through- tent workflow models fail to address the communica-
put. tion patterns typically found in cloud workflow appli-
CloudSim does not offer, by default, classes for cations. They proposed CA-DAG (Communication-
representing workflows modeled as DAGs. Given the Aware DAG), a workflow model which represents
importance of this application category, some exten- communication processes as vertices instead of edges.
sions to CloudSim were proposed for representing Arabnejad and Barbosa [3] developed a Heteroge-
workflows. WorkflowSim [14] is a CloudSim exten- neous Budget Constrained Scheduling (HBCS) for
sion that includes support for workflow representation minimizing makespan and rental cost of cloud work-
and management. It also provides task aggregation flow applications. The HBCS algorithm is able to
capabilities and a fault generator at job/task level. It reduce up to 30% of the execution time while main-
can generate recoverable transient failures that can be taining the same budget level.
handled by task re-execution and permanent job fail- Many works in the scheduling literature con-
ures that cannot be recovered. DynamicCloudSim [10] sider deterministic computation and communication
is a CloudSim based simulator that includes work- times. However, using a deterministic objective func-
flow execution considering VM inhomogeneity and tion does not match the non-deterministic nature of
failure of tasks at runtime. Malawskia et al. [35] devel- real-world applications [2]. In this sense, Zheng et
oped a cloud workflow simulator for evaluating task al. [62] proposed a Monte Carlo based scheduling
754 D. Oliveira et al.
method for cloud/grid workflows which consider non- performability model of a grid resource based on a
deterministic computing and communication time. stochastic reward net model and the universal gen-
The method is not dependent on a particular heuristic erating function. The proposed model is connected
algorithm, and the HEFT is adopted in the evalu- to a genetic algorithm which aims to optimize the
ation. In [63], a randomized version of HEFT was scheduling of a DAG into a set of grid resources.
proposed. The algorithm consists in running a deter- Our work aims to contribute to the research line
ministic HEFT for random predictions of the stochas- opened by Entezari et al. [17] - workflow schedul-
tic DAG, generating a list of potential candidates for ing optimization from a performability viewpoint.
best scheduling. The scheduling from the list with To the best of our knowledge, no existing method
the smaller expected makespan is selected. Cai et al. covers simultaneously performability as the objec-
[12] presents a dynamic algorithm for minimizing tive function, multi-tenancy, non-deterministic com-
the rental cost (of VMs in a cloud) of bag-of-tasks puting/communication times, and failures of hosts and
workflows with non-deterministic times. The Cloud VMs. Table 1 shows a comparison of our work to the
Workflow Scheduling Algorithm (CWSA) [47] aims state of the art.
to optimize the scheduling of workflows in a multi-
tenant cloud environment. This algorithm considers
non-deterministic times for task computation times. 4 Problem Definition and Proposed Optimization
The presence of failures in data centers can pose a Method
threat to workflow applications with strict deadlines.
In [56], an original fault-tolerant scheduling algorithm This work aims to solve the multiprocessor schedul-
name FESTAL was proposed. It employs a primary/ ing problem of scientific workflows running in a
backup redundancy model and VM migration to cloud environment, using a performability metric as
achieve high-availability and load balancing into a the objective function. Given a workflow described
cloud workflow application. Vinay et al. [54] present by a DAG G = {T , E}, our objective is to find a
a new heuristic for cloud scheduling named CHEFT scheduling S of tasks in T on m virtual machines that
(Cluster-based Heterogeneous Earliest Finish Time). maximizes the throughput of jobs. The number m is
It uses the idle time of the processors for resubmitting not fixed and must be determined by the optimization
the failed tasks as a mean to achieve fault toler- method.
ance. FASTER [66] is another algorithm that employs The flow diagram of Fig. 1 presents a high-level
the primary/backup redundancy model for provid- overview of the proposed optimization method. The
ing a fault tolerant scheduling mechanism for cloud workflow DAG has a list of tasks and their depen-
applications. Performability was first considered an dencies, processing time of each task and commu-
objective function in [17]. The authors developed a nication time between dependent tasks running on
different processors. The computing and communica- additional performance/reliability metrics, such as
tion times of a DAG can either be deterministic or average waiting/response time, discard rate of
follow a specific random distribution (normal, expo- tasks/jobs, the probability of completing a job. The
nential, Erlang, and so on). The cloud infrastructure method can be used interactively, i.e., the user can
parameters define the number of physical servers, modify the cloud/control parameters and repeat the
the maximum number of virtual machines that each process, obtaining new scheduling and performability
host can provide, and the failure/repair/switchover metrics.
rates of the physical/virtual machines. The simula- The remainder of this section will explain further
tion/optimization parameters configure the simulation each part that composes the proposed method.
engine (e.g., number of replications for an individual
simulation) and the optimization algorithm (e.g., pop- 4.1 Genetic Algorithm with Stochastic Fitness
ulation size, number of elite chromosomes, number of Function
generations and so on).
The optimization algorithm explores the solution The activity diagram of Fig. 2 describes the optimiza-
space for the input DAG until the stopping condition tion algorithm adopted in this paper. It is a genetic
is reached, which is defined by the control param- optimization procedure that uses a simulation model
eters. Then, a near-optimal scheduling solution is for computing the fitness value of explored chromo-
provided by the algorithm. The user can perform fur- somes. The chromosome representation consists of a
ther analysis of the obtained solution and evaluate pair of vectors representing the ordering of tasks and
756 D. Oliveira et al.
the mapping of tasks to the processors, as illustrated in random subinterval. Figure 4 shows an example of
Fig. 3. The partially-mapped crossover (PMX) oper- crossover operator. The mutation operator modifies a
ator [20] was adopted to generate the offspring of a chromosome with a random operation by swapping
population. Two random chromosomes are randomly
selected for creating a pair of children (parents with
higher fitness value are more likely to be selected).
The process to generate the children is defined as fol-
low. First, it creates a copy of the parents. A paired
subinterval is randomly selected and switched among
the children. Then, a mapping function is applied to
convert the repeated alleles (i.e.: the units of infor-
mation that compose the chromosome) outside the
establishes the scheduling of the subtasks. This strat- submission is when some virtual machine failure pre-
egy determines the number of virtual machines allo- vents the execution of one or more subtasks. A virtual
cated from the cloud provider and the tasks executed machine can fail due software (operating system or
on each of them. The scheduling can be generated hypervisor) or hardware faults.
by some heuristic (First Come-First Served, Shortest The number of completed and failed jobs are
Job First, Heterogeneous Earliest Finish Time, etc.) or recorded by the simulation model. The annual
meta-heuristic algorithm (taboo search, genetic algo- throughput is obtained by dividing the simulation time
rithm, ant colony, etc.). in years by the total number of completed jobs. The
Figure 8 depicts the discrete event simulation job failure ratio is defined by (1). The annual through-
model as an open queue accepting a job submission put and the job failure ratio are the metrics adopted in
influx with rate equals to λ. Each job will be promptly the analysis of Section 6.
executed in case there are available resources in the
cloud. Otherwise, it will be enqueued or discarded (if
the queue is full). A job will also be discarded if the Discarded jobs
Job failure ratio =
cloud manager is unavailable since in this case, it is Completed jobs + Discarded jobs
not possible to allocate the cloud resources for the job (1)
execution. Another possibility for the failure of a job
fi
Performability Evaluation and Optimization of Workflow Applications in Cloud Environments 759
Fig. 12 Model generated with structural parameters set to number of servers = 2, vms per server = 2
762 D. Oliveira et al.
idle component is assumed to be lower than the fail- Figure 14 displays an overview of top-level simula-
ure rate of a component serving requests [23]. The tion model. As a discrete event simulator, it has a
activate spare transition represents the switchover global clock for the simulation time and an ordered
event, i.e., the process of configuring the standby list of events which is processed and updated as the
server to handle the incoming workload. This tran- simulator runs. The simulation engine also maintains
sition is enabled only when a failure occurs on the a list of running Petri nets. Petri nets can be config-
primary server due to the inhibitor arc. After the pri- ured at the beginning of the simulation, and new Petri
mary server is repaired, the standby server is sent back nets can be created or destroyed during simulation run
to the idle state. time. The simulation routines are software modules
(implemented in the Java programming language) that
5.3 Simulation Environment are invoked according to specific simulation events.
These routines can modify the simulation state, sched-
In this subsection, we go into details about the top ule/cancel simulation events, and start/destroy Petri
level simulation model and the simulation engine. nets. The Petri nets generate firing events and timed
3500
Number of processed jobs per year
2500
1500
500
0
Global maximum
Best chromossome
Population average
1 2 3 4 5 6 7 8 9 10
Generation
Fig. 16 Average and max fitness value (number of processed Fig. 17 Sensitivity analysis - one factor at time (95% confi-
jobs per year) of each generation dence interval)
Performability Evaluation and Optimization of Workflow Applications in Cloud Environments 765
presents the results of the sensitivity analysis, consid- Table 4 Model/configuration parameters - second case study
ering the best scheduling found by brute force. For
Parameter Value
the collected metrics, we indicate the 95% confidence
intervals alongside the average values. For each point Number of workers 50
on the plots, we obtained 500 samples from the simu- VMs per worker 4
lation model. Figure 17a and b show the impact of the Number of replications for the simulation 10
hardware and virtual machine MTTFs on the number Generations 25
of processed jobs per year and Fig. 17c and d show Population size 40
the sensitivity analysis of the failure ratio of jobs. The Number of elite chromosomes 3
analysis reveals that the system throughput and job
failure ratio are sensitive to VM failures. It also can be
noticed that as the hardware/virtual machine MTTFs
increase, the differences between adjacent points in to the previous case study, we noticed a more accen-
the plots become less pronounced and the confidence tuated non-monotonic growth in the average fitness
intervals overlap. value (i.e., the average fitness value for the ith gen-
eration being smaller than the value for the (i − 1)th
6.2 Optimization of a LIGO Workflow Application generation). However, the algorithm can increase both
the average and maximum fitness value in the long
In the second case study, we used two scientific term.
workflows: the LIGO Inspiral Analysis workflow [9] The presented case studies confirm the ability of
(Fig. 18a) and a randomly generated DAG (Fig. 18b). the proposed simulation-based optimization method
The LIGO workflow was created with the Pegasus in solving the workflow scheduling problem from
Workflow Generator available in [26]. This workflow a performability viewpoint. Our method enables the
generator creates synthetic workflows based on traces optimization process to treat aspects that would be
collected from real-world scientific workflows. The impossible to capture with a deterministic function,
random DAG was created by an ad-hoc algorithm. namely:
Computing and communication times for the random
DAG were generated using a Uniform distribution – Modeling non-deterministic and non-exponential
with the interval from [1 min, 20 min] and [1min, 10 computation/communication times;
min], respectively. – Capturing the failure relationships between
Table 4 shows the updated parameters for the sec- servers and virtual machines;
ond case study. Unfortunately, increasing too much – Modeling the provisioning of cloud resources to
the cloud scale leads to a huge computational effort to multiple users concurrently;
solve the simulation model. The reason for this limita- – Representing the influence of the cloud controller
tion is the presence of stiffness on performability mod- on the overall system’s performance.
els. For capturing failure events in the performance Using such complex non-deterministic objective
model, it is necessary to employ a long simulation function (a discrete simulation model) did not cause
run (larger than a year). The high number of events the optimization algorithm to misbehave. The results
to be processed in a single simulation run leads to a for the LIGO and random workflows show the
great simulation runtime. SRIP (Single Replication in effectiveness of the generic operators (mutation and
Parallel) techniques can be used to simulate a larger crossover) in avoiding getting stuck at a local
cloud infrastructure, i.e., a cloud having hundreds or maximum. New elite chromosomes were found multi-
thousands of physical servers. ple times in both scenarios.
Figure 19a and b summarize the results of the
genetic algorithm. They are displayed as boxplots for 6.2.1 Performance and Reliability Analysis
each generation produced by the optimization algo-
rithm. Since we are using elitism, the best solution For evaluating the impact of failures on the second
(represented by the top horizontal bar in each box plot) case study, we performed a sensitivity analysis on the
is kept until better solutions are obtained. In contrast effect of hardware and VM failures in the adopted
766 D. Oliveira et al.
workflows. In this study, we consider the best schedul- a large MTTF, however, the difference between the
ing obtained with the optimization method. The results mean number of processed jobs is minimal, and the
are shown in Fig. 20. We confirm the same pattern confidence intervals overlap. These results indicate
visualized in the previous section for the small DAG: a negligible impact of the cloud-manager on system
the impact of hardware and VM failures diminishes throughput when this subsystem is highly-available.
as the reliability of these components reach a certain
level. 6.2.2 Random Makespan Kernel Density Estimation
Figure 21 shows the impact of failures on the cloud (LIGO workflow)
manager in the system throughput. It also allows us to
evaluate the effectiveness of the warm-standby redun- Considering non-deterministic communication and
dancy mechanism when contrasted with a single node computation times for a workflow scheduling algo-
cloud manager (without redundancy). Figure 21 indi- rithm means that the makespan will be defined by a
cates that for a small MTTF for a server node, there is random variable instead of being a fixed value. We
a substantial increase in the number of processed jobs performed a kernel density estimation for the random
per year when using a redundant cloud manager. For makespan of the best scheduling found for the LIGO
Performability Evaluation and Optimization of Workflow Applications in Cloud Environments 767
7 Conclusions
References
4. Bianchi, L., Dorigo, M., Gambardella, L.M., Gutjahr, W.J.: on Dependable Computing (PRDC), pp. 125–132. IEEE
A survey on metaheuristics for stochastic combinatorial (2010)
optimization. Nat. Comput. 8(2), 239–287 (2009) 20. Goldberg, D.E., Lingle, R., et al.: Alleles, loci, and the trav-
5. Bitam, S.: Bees life algorithm for job scheduling in cloud eling salesman problem. In: Proceedings of an International
computing. In: Proceedings of the Third International Con- Conference on Genetic Algorithms and their Applications,
ference on Communications and Information Technology, vol. 154, pp. 154–159. Lawrence Erlbaum, Hillsdale (1985)
pp. 186–191 (2012) 21. Gorissen, D., Couckuyt, I., Demeester, P., Dhaene, T.,
6. Blum, C., Roli, A.: Metaheuristics in combinatorial opti- Crombecq, K.: A surrogate modeling and adaptive sam-
mization: overview and conceptual comparison. ACM pling toolbox for computer based design. J. Mach. Learn.
Comput. Surv. (CSUR) 35(3), 268–308 (2003) Res. 11, 2051–2055 (2010)
7. Bolch, G., Greiner, S., de Meer, H., Trivedi, K.S.: Queueing 22. Gu, J., Hu, J., Zhao, T., Sun, G.: A new resource schedul-
Networks and Markov Chains: Modeling and Performance ing strategy based on genetic algorithm in cloud computing
Evaluation with Computer Science Applications. Wiley, environment. J. Comput. 7(1), 42–52 (2012)
Hoboken (2006) 23. Guimarães, A.P., Maciel, P.R., Matias, R.: An analyti-
8. Book, R.V. et al.: Michael r. garey and david s. john- cal modeling framework to evaluate converged networks
son, computers and intractability: a guide to the theory of through business-oriented metrics. Reliab. Eng. Syst. Saf.
np-completeness. Bulletin (New Series) of the American 118, 81–92 (2013)
Mathematical Society 3(2), 898–904 (1980) 24. Hamby, D.: A review of techniques for parameter sensi-
9. Brown, D.A., Brady, P.R., Dietz, A., Cao, J., Johnson, B., tivity analysis of environmental models. Environ. Monit.
McNabb, J.: A case study on the use of workflow tech- Assess. 32(2), 135–154 (1994)
nologies for scientific analysis: gravitational wave data 25. Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey,
analysis. In: Workflows for E-Science, pp. 39–59. Springer K., Berriman, B., Good, J.: On the use of cloud comput-
(2007) ing for scientific workflows. In: 2008. Escience’08. IEEE
10. Bux, M., Leser, U.: Dynamiccloudsim: Simulating hetero- Fourth International Conference on Escience, pp. 640–645.
geneity in computational clouds. Futur. Gener. Comput. IEEE (2008)
Syst. 46, 85–99 (2015) 26. Juve, G., Bharathi, S.: Pegasus synthetic workflow gen-
11. Cai, Z., Li, Q., Li, X.: Elasticsim: a toolkit for simulat- erator. https://confluence.pegasus.isi.edu/display/pegasus/
ing workflows with cloud resource runtime auto-scaling WorkflowGenerator (2014)
and stochastic task execution times. J. Grid Comput. 15(2), 27. Juve, G., Deelman, E., Vahi, K., Mehta, G., Berriman, B.,
257–272 (2017) Berman, B.P., Maechling, P.: Scientific workflow appli-
12. Cai, Z., Li, X., Ruiz, R., Li, Q.: A delay-based dynamic cations on amazon Ec2. In: 2009 5th IEEE International Con-
scheduling algorithm for bag-of-task workflows with ference on E-Science Workshops, pp. 59–66. IEEE (2009)
stochastic task execution times in clouds. Futur. Gener. 28. Kim, D.S., Machida, F., Trivedi, K.S.: Availability mod-
Comput. Syst. 71, 57–72 (2017) eling and analysis of a virtualized system. In: 2009.
13. Calheiros, R.N., Ranjan, R., Beloglazov, A., De Rose, C.A., PRDC’09. 15th IEEE Pacific Rim International Sym-
Buyya, R.: Cloudsim: a toolkit for modeling and simula- posium on Dependable Computing, pp. 365–371. IEEE
tion of cloud computing environments and evaluation of (2009)
resource provisioning algorithms. Softw. Pract. Exp. 41(1), 29. Kliazovich, D., Pecero, J.E., Tchernykh, A., Bou-
23–50 (2011) vry, P., Khan, S.U., Zomaya, A.Y.: Ca-dag: Modeling
14. Chen, W., Deelman, E.: Workflowsim: a toolkit for sim- communication-aware applications for scheduling in cloud
ulating scientific workflows in distributed environments. computing. J. Grid Comput. 14(1), 23–39 (2016)
In: 2012 IEEE 8th International Conference on E-Science 30. Kohne, A., Spohr, M., Nagel, L., Spinczyk, O.: Federat-
(E-Science), pp. 1–8. IEEE (2012) edcloudsim: a sla-aware federated cloud simulation frame-
15. Chen, W.N., Zhang, J.: Ant colony optimization for soft- work. In: Proceedings of the 2nd International Workshop
ware project scheduling and staffing with an event-based on CrossCloud Systems, pp. 3. ACM (2014)
scheduler. IEEE Trans. Softw. Eng. 39(1), 1–17 (2013) 31. LD, D.B., Krishna, P.V.: Honey bee behavior inspired load
16. Davis, N.A., Rezgui, A., Soliman, H., Manzanares, S., balancing of tasks in cloud computing environments. Appl.
Coates, M.: Failuresim: a system for predicting hardware Soft Comput. 13(5), 2292–2303 (2013)
failures in cloud data centers using neural networks. In: 32. Lin, W., Wu, W., Wang, J.Z.: A heuristic task scheduling
2017 IEEE 10th International Conference on Cloud Com- algorithm for heterogeneous virtual clusters. Sci. Program.
puting (CLOUD), pp. 544–551. IEEE (2017) 2016, Article ID 7040276 (2016)
17. Entezari-Maleki, R., Trivedi, K.S., Sousa, L., Movaghar, 33. Maciel, P., Matos, R., Silva, B., Figueiredo, J., Oliveira,
A.: Performability-based workflow scheduling in grids. The D., Fé, I., Maciel, R., Dantas, J.: Mercury: performance
Computer Journal (2018) and dependability evaluation of systems with exponential,
18. Ever, E.: Performability analysis of cloud computing cen- expolynomial, and general distributions. In: 2017 IEEE
ters with large numbers of servers. J. Supercomput. 73(5), 22Nd Pacific Rim International Symposium on Dependable
2130–2156 (2017) Computing (PRDC), pp. 50–57. IEEE (2017)
19. Ghosh, R., Trivedi, K.S., Naik, V.K., Kim, D.S.: End-To- 34. Mainkar, V., Trivedi, K.S.: Sufficient conditions for exis-
End performability analysis for infrastructure-as-a-service tence of a fixed point in stochastic reward net-based
cloud: an interacting stochastic models approach. In: iterative models. IEEE Trans. Softw. Eng. 22(9), 640–653
2010 IEEE 16th Pacific Rim International Symposium (1996)
770 D. Oliveira et al.
35. Malawski, M., Juve, G., Deelman, E., Nabrzyski, J.: Algo- 52. Tawfeek, M.A., El-Sisi, A., Keshk, A.E., Torkey, F.A.:
rithms for cost-and deadline-constrained provisioning for Cloud task scheduling based on ant colony optimization. In:
scientific workflow ensembles in iaas clouds. Futur. Gener. 2013 8th International Conference on Computer Engineer-
Comput. Syst. 48, 1–18 (2015) ing & Systems (ICCES), pp. 64–69. IEEE (2013)
36. Meyer, J.F.: On evaluating the performability of degradable 53. Tsai, C.W., Rodrigues, J.J.: Metaheuristic scheduling for
computing systems. IEEE Trans. Comput. C-29(8), 720– cloud: a survey. IEEE Syst. J. 8(1), 279–291 (2014)
731 (1980) 54. Vinay, K., Kumar, S.D.: Fault-tolerant scheduling for sci-
37. Mezmaz, M., Melab, N., Kessaci, Y., Lee, Y.C., Talbi, entific workflows in cloud environments. In: 2017 IEEE 7th
E.G., Zomaya, A.Y., Tuyttens, D.: A parallel bi-objective International Advance Computing Conference (IACC), pp.
hybrid metaheuristic for energy-aware scheduling for cloud 150–155. IEEE (2017)
computing systems. J. Parallel Distrib. Comput. 71(11), 55. Vöckler, J.S., Juve, G., Deelman, E., Rynge, M., Berri-
1497–1508 (2011) man, B.: Experiences using cloud computing for a scientific
38. Molloy, M.K.: Performance analysis using stochastic petri workflow application, In: Proceedings of the 2nd Inter-
nets. IEEE Trans. Comput. 31(9), 913–917 (1982) national Workshop on Scientific Cloud Computing, pp.
39. Nelder, J.A., Mead, R.: A simplex method for function 15–24. ACM (2011)
minimization. Comput. J. 7(4), 308–313 (1965) 56. Wang, J., Bao, W., Zhu, X., Yang, L.T., Xiang, Y.: Fes-
40. Oliveira, D., Matos, R., Dantas, J., Ferreira, J., Silva, B., tal: fault-tolerant elastic scheduling algorithm for real-time
Callou, G., Maciel, P., Brinkmann, A.: Advanced stochas- tasks in virtualized clouds. IEEE Trans. Comput. 64(9),
tic petri net modeling with the mercury scripting language. 2545–2558 (2015)
In: ValueTools 2017, 11th EAI International Conference on 57. Wang, T., Chang, X., Liu, B.: Performability analysis for
Performance Evaluation Methodologies and Tools. Venice, iaas cloud data center. In: 2016 17th International Confer-
Italy. Elsevier (2017) ence on Parallel and Distributed Computing, Applications
41. Panda, S.K., Jana, P.K.: Efficient task scheduling algo- and Technologies (PDCAT), pp. 91–94. IEEE (2016)
rithms for heterogeneous multi-cloud environment. J. 58. Xia, Y., Zhou, M., Luo, X., Zhu, Q., Li, J., Huang,
Supercomput. 71(4), 1505–1533 (2015) Y.: Stochastic modeling and quality evaluation of
42. Plateau, B., Atif, K.: Stochastic automata network of mod- infrastructure-as-a-service clouds. IEEE Trans. Autom.
eling parallel systems. IEEE Trans. Softw. Eng. 17(10), Sci. Eng. 12(1), 162–170 (2015)
1093–1108 (1991) 59. Xu, Y., Li, K., He, L., Zhang, L., Li, K.: A hybrid chemical
43. Qiu, X., Sun, P., Guo, X., Xiang, Y.: Performability anal- reaction optimization scheme for task scheduling on hetero-
ysis of a cloud system. In: 2015 IEEE 34th International geneous computing systems. IEEE Trans. Parallel Distrib.
Performance Computing and Communications Conference Syst. 26(12), 3208–3222 (2015)
(IPCCC), pp. 1–6. IEEE (2015) 60. Zhao, C., Zhang, S., Liu, Q., Xie, J., Hu, J.: Indepen-
44. Queipo, N.V., Haftka, R.T., Shyy, W., Goel, T., dent tasks scheduling based on genetic algorithm in cloud
Vaidyanathan, R., Tucker, P.K.: Surrogate-based analysis computing. In: 2009. Wicom’09. 5th International Confer-
and optimization. Prog. Aerosp. Sci. 41(1), 1–28 (2005) ence on Wireless Communications, Networking and Mobile
45. Raei, H., Yazdani, N.: Performability analysis of cloudlet Computing, pp. 1–4. IEEE (2009)
in mobile cloud computing. Inform. Sci. 388, 99–117 61. Zhao, H.W., Tian, L.W.: Resource schedule algorithm
(2017) based on artificial fish swarm in cloud computing environ-
46. Ramakrishnan, L., Reed, D.A.: Performability modeling for ment. In: Applied Mechanics and Materials, vol. 635, pp.
scheduling and fault tolerance strategies for scientific work- 1614–1617. Trans Tech Publ (2014)
flows. In: Proceedings of the 17th International Symposium 62. Zheng, W., Sakellariou, R.: Stochastic dag scheduling using
on High Performance Distributed Computing, pp. 23–34. a monte carlo approach. J. Parallel Distrib. Comput. 73(12),
ACM (2008) 1673–1689 (2013)
47. Rimal, B.P., Maier, M.: Workflow scheduling in multi- 63. Zheng, W., Wang, C., Zhang, D.: A randomization approach
tenant cloud computing environments. IEEE Trans. Parallel for stochastic workflow scheduling in clouds. Sci. Program.
Distrib. Syst. 28(1), 290–304 (2017) 2016, Article ID 9136107 (2016)
48. Rodriguez, M.A., Buyya, R.: A taxonomy and survey on 64. Zheng, Z., Wang, R., Zhong, H., Zhang, X.: An approach
scheduling algorithms for scientific workflows in iaas cloud for cloud resource scheduling based on parallel genetic
computing environments. Concurr. Comput. Pract. Exp. algorithm. In: 2011 3rd International Conference on Com-
29(8), e4041 (2017) puter Research and Development (ICCRD), vol. 2, pp.
49. Sousa, E., Lins, F., Tavares, E., Cunha, P., Maciel, P.: A 444–447. IEEE (2011)
modeling approach for cloud infrastructure planning con- 65. Zhou, A., Wang, S., Sun, Q., Zou, H., Yang, F.: Ftcloudsim:
sidering dependability and cost requirements. IEEE Trans. a simulation tool for cloud service reliability enhancement
Syst. Man Cybern. Syst. Hum. 45(4), 549–558 (2015) mechanisms. In: Proceedings Demo & Poster Track of
50. Sousa, E., Lins, F., Tavares, E., Maciel, P.: Cloud infras- ACM/IFIP/USENIX International Middleware Conference,
tructure planning considering different redundancy mecha- p. 2. ACM (2013)
nisms. Computing 99(9), 841–864 (2017) 66. Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.:
51. Swisher, J.R., Hyden, P.D., Jacobson, S.H., Schruben, Fault-tolerant scheduling for real-time scientific workflows
L.W.: A Survey of simulation optimization techniques and with elastic resource provisioning in virtualized clouds.
procedures. In: Simulation Conference, 2000. Proceedings. IEEE Trans. Parallel Distrib. Syst. 27(12), 3501–3517
Winter, vol. 1, pp. 119–128. IEEE (2000) (2016)