Paper 1
Paper 1
Abstract—By leveraging virtual machine (VM) technology, we optimize cloud system performance based on refined resource
allocation, in processing user requests with composite services. Our contribution is three-fold. (1) We devise a VM resource allocation
scheme with a minimized processing overhead for task execution. (2) We comprehensively investigate the best-suited task scheduling
policy with different design parameters. (3) We also explore the best-suited resource sharing scheme with adjusted divisible resource
fractions on running tasks in terms of Proportional-share model (PSM), which can be split into absolute mode (called AAPSM) and
relative mode (RAPSM). We implement a prototype system over a cluster environment deployed with 56 real VM instances, and
summarized valuable experience from our evaluation. As the system runs in short supply, lightest workload first (LWF) is mostly
recommended because it can minimize the overall response extension ratio (RER) for both sequential-mode tasks and parallel-mode
tasks. In a competitive situation with over-commitment of resources, the best one is combining LWF with both AAPSM and RAPSM. It
outperforms other solutions in the competitive situation, by 16 þ % w.r.t. the worst-case response time and by 7.4 þ % w.r.t. the
fairness.
Index Terms—Cloud resource allocation, task scheduling, resource allocation, virtual machine, minimization of overhead
1 INTRODUCTION
input parameters, and the output will be cached in the There are two metrics to evaluate the system perfor-
VM, waiting for the notification of the data transmission mance. One is RER of each task (defined in Formula (3))
for its succeeding subtask.
We adopt XEN’s credit scheduler [19] to perform the t0i s real response time
RERðti Þ ¼ : (3)
resource isolation among VMs on the same physical t0i s theoretically optimal length
machine. With XEN [20], we can dynamically isolate some
The RER is used to evaluate the execution performance for a
key resources (like CPU rate and network bandwidth) to suit
particular task. The lower value the RER is, the higher exe-
the specific usage demands of different tasks. There are two
cution efficiency the corresponding task is processed in
key concepts in the credit scheduler, capacity and weight.
reality. A sequential-mode task’s theoretically optimal length
Capacity specifies the upper limit on the CPU rate consum-
(TOL) is the sum of the theoretical execution time of each
able by a particular VM, and weight means a VM’s propor-
subtask based on the optimal resource allocation solution
tional-share credit. On a relatively free physical host, the
to the above problem (Formulas (1) and (2)), while a paral-
CPU rate of a running VM is determined by its capacity. If
lel-mode task’s TOL is equal to the largest theoretical sub-
there are over-many VMs running on a physical machine,
task execution time. The response time here indicates the
the real CPU rates allocated for them are proportional to their
whole wall-clock time from a task’s submission to its final
weights. Both capacity and weight can be tuned at runtime.
completion. In general, the response time of a task includes
subtask’s waiting time, overhead before subtask execution
3 PROBLEM FORMULATION (e.g., on resource allocation or data transmission), subtask’s
Assuming there are n tasks to be processed by the system, productive time, and processing overhead after execution.
and they are denoted as ti , where i ¼ 1, 2, . . . ; n. Each task We try best to minimize the cost for each part.
is made up of multiple subtasks connected in series or in The other metric is the fairness index of RER among all
parallel. We denote the subtasks of the task ti to be tið1Þ , tið2Þ , tasks (defined in Formula (4)), which is used to evaluate the
. . ., tiðmi Þ , where mi refers to the number of subtasks in ti . fairness of the treatment in the system. Its value is ranged in
Such a formulation is generic enough such that any user [0, 1], and the bigger its value is, the higher fairness of the
request (or task) can be constructed by multiple nested com- treatment is. Based on Formula (3), the fairness is also
posite services (or subtasks). related to the different types of execution overheads. How
Task execution time is represented in different ways to effectively coordinate the overheads among different
based on different intra-structure about subtask connection. tasks is a very challenging issue. This is mainly due to
For the sequential-mode task, its total execution time (or exe- largely different task structure (i.e., the subtask’s workload
Pmi liðjÞ and their connection way), task budget, and varied resource
cution length) can be denoted as T (ti ) ¼ j¼1 r , where liðjÞ
iðjÞ availability over time
and riðjÞ are referred to as the workload of subtask tiðjÞ and Pn 2
the compute resource allocated respectively. The workload i¼1 RERðti Þ
fairnessðti Þ ¼ Pn : (4)
here is evaluated by the number of instructions or data to n i¼1 RER2 ðti Þ
read/write from/to disk, and the compute resource here
means workload processing rate like CPU rate and disk I/O Our final objective is to minimize RER for each individ-
bandwidth. As for a parallel mode task (e.g., embarrassingly ual task (or minimize the maximum RER) and maximize the
parallel application), its total execution length is equal to the overall fairness, especially in a competitive situation with
longest execution time of its subtasks (or makespan). We will over-many submitted tasks.
use execution time, execution length, response length, and
wall-clock time interchangeably in the following text. 4 OPTIMIZATION OF SYSTEM PERFORMANCE
Each subtask tiðjÞ will call a particular service API, which
In order to optimize the entire QoS for each task, we need to
is associated with a service price (denoted as piðjÞ ). The ser-
minimize the time cost at each step in the course of its exe-
vice prices ($/unit) are determined by corresponding ser- cution. We study the best-fit solution with respect to three
vice makers in our model, since they are the ones who pay following facets, resource allocation, task scheduling, and
monthly resource leases to infrastructure-as-a-service (IaaS) minimization of overheads.
providers (e.g., Amazon EC2 [21]). The total payment in
executing a task ti on top of service layer is equal to
Pmi 4.1 Optimized Resource Allocation with VMs
j¼1 ½riðjÞ piðjÞ . Each task is associated with a budget We first derive an optimal resource vector for each task
(denoted as Bðti Þ) by its user in order to control its total pay- (including parallel-mode task and sequential-mode task),
ment. Hence, the problem of optimizing task ti ’s execution subject to task structure and budget, in both non-com-
can be formulated as Formulas (1) and (2) (convex-optimi- petitive situation and competitive situation. In non-com-
zation problem) petitive situation, there are always available and
8P l adequate resources for task processing. As for an over-
< m i iðjÞ
j¼1 r ; ti is in sequential mode committed situation (or competitive situation), the over-
min T ðti Þ ¼ iðjÞ
liðjÞ (1) all resources are over-committed such that the requested
: maxj¼1m ; ti is in parallel mode
i r iðjÞ resource amounts succeed the de-facto resource amounts
X
mi in the system. In this situation, we designed an adjust-
s:t: ½riðjÞ piðjÞ Bðti Þ: (2) able resource allocation method for maintaining the
j¼1 high performance and fairness.
1758 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 6, JUNE 2015
4.1.1 Optimal Resource Allocation in Non-Competitive T ðti Þ subject to the constraint (2) is shown as Equa-
Situation tion (8), where j ¼ 1, 2, . . . ; mi
liðjÞ
In a non-competitive situation (with unlimited available riðjÞ ¼ Pmi Bðti Þ: (8)
resource amounts), the resource fraction allocated to some j¼1 piðjÞ liðjÞ
task is mainly restricted by its user-set budget. Based on the Proof. We just need to prove the optimal situation
target function (Formula (1)) and a constraint (Formula (2)), occurs if and only if all of subtask execution
we analyze the two types of tasks (sequential-mode and lengths are equal to each other. That is, the entire
parallel-mode) respectively. execution length of a parallel-mode task will be
minimized if and only if Equation (9) holds
Optimization of Sequential-Mode Task:
lið1Þ lið2Þ liðmi Þ
Theorem 1. If task ti is constructed in sequential mode, ¼ ¼ ¼ : (9)
rið1Þ rið2Þ riðmi Þ
ti ’s optimal resource vector r ðti Þ for minimizing
T ðti Þ subject to the constraint (2) is shown as Equa- In this situation, we can easily derive equation (8)
tion (5), where j ¼ 1, 2, . . . ; mi by using up the user-preset budget Bðti Þ, i.e., let-
P i
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ting m j¼1 ½riðjÞ piðjÞ ¼ Bðti Þ hold.
l =:piðjÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Bðti Þ:
riðjÞ ¼ Xmi iðjÞ As follows, we use proof-by-contradiction
(5)
k¼1
liðkÞ piðkÞ method to prove that Equation (9) is a necessary
condition of the optimal situation by contradic-
@2 T ðti Þ l
Proof. Since @rj ¼ 2 riðjÞ
3 > 0, T (ti ) is convex with a
tion. Let us suppose an optimal situation with
iðjÞ minimized task wall-clock length occurs while
minimum extreme point. By combining the con-
Equation (9) does not hold. Without loss of gener-
straint (2), we can get the Lagrangian function as ality, we denote by tiðkÞ the subtask that has the
liðkÞ
Formula (6), where refers to the Lagrangian longest execution time (i.e., riðkÞ ), that is,
multiplier liðkÞ
! T ðti Þ ¼ riðkÞ . Since equation (9) does not hold, there
X
mi
liðjÞ X
mi
l l
F ðri Þ ¼ þ Bðti Þ riðjÞ piðjÞ : (6) must exist another subtask tiðjÞ such that riðjÞ < riðkÞ .
j¼1
riðjÞ j¼1
iðjÞ iðkÞ
Obviously, we are able to add a small increment
~k to riðkÞ and decrease riðjÞ by ~j correspond-
We derive Equation (7) via Lagrangian multiplier
ingly, such that the total payment is unchanged
method and the two subtasks’ wall-clock lengths become
rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi the same. That is, Equations (10) and (11) hold
lið1Þ lið2Þ l
rið1Þ rið2Þ riðmi Þ ¼ p : p : : piðmi:Þ (7) simultaneously
ið1Þ ið2Þ iðmi Þ
proportional to its weight.1 Suppose on a physical host is running d subtasks (belonging to different tasks),
(denoted as hi ), ni scheduled subtasks are running on ni which are denoted as t1ðx1 Þ , t2ðx2 Þ , . . . , tdðxd Þ , where
stand-alone VMs separately (denoted vj , where j ¼ 1, 2, xi ¼ 1, 2, . . . , or mi . Then, wðtiðjÞ Þ will be determined
. . . ni ). We denote the host hi ’s total compute capacity to be ci by either Formula (13) or Formula (14), based on dif-
(e.g., eight cores), and the weights of the ni subtasks to be ferent proportional-share credits (either task’s
wðv1 Þ, wðv2 Þ, . . . , wðvni Þ. Then, the real resource fraction
workload or task’s TOL). Hence, the relative mode
(denoted by rðvj Þ) allocated to the VM vj can be calculated by
based APSM (abbreviated as RAPSM) has two dif-
Formula (12)
ferent types, workload-based APSM (abbreviated as
RAPSM(W)) and TOL-based APSM (abbreviated as
wðvj Þ
rðvj Þ ¼ Pni ci : (12) RAPSM(T))
k¼1 wðvk Þ
8
Now, the key question becomes how to determine the < h riðjÞ li a
weight value for each running subtask (or VM) on a physi- wðtiðjÞ Þ ¼ r iðjÞ a < li b (13)
:1
h riðjÞ li > b
cal machine, to adapt to the competitive situation. We
devise a novel model, namely adjusted proportional-share
model (APSM), which further tunes the credits based on 8
task’s workload (or execution length). The design of APSM < h riðjÞ
> T ðti Þ a
is based on the definition of RER: a large value of RER tends wðtiðjÞ Þ ¼ riðjÞ a < T ðti Þ b (14)
to appear with a short task. This is mainly due to the fact >
: 1 r
h iðjÞ T ðti Þ > b:
that the overheads (such as data transmission cost, VMM
operation cost) in the whole wall-clock time are often rela-
The weight values in our design (Formula (13)) are
tively constant regardless of the total task workload. That is,
determined by four parts, the extension coefficient
based on RER’s definition, short task’s RER is more sensi-
tive to the execution overheads than that of a long one. (h), theoretically optimal resource fraction (riðjÞ ), the
Hence, we make short tasks tend to get more resource frac- threshold value a to determine short tasks, and the
tions than their theoretically optimal vector (i.e., riðjÞ ). There threshold value b to determine long tasks. Obvi-
are two alternative ways to realize this effect. ously, the value of h is supposed to be always
greater than 1. In reality, tuning h’s value could
Absolute mode. For this mode, we use a threshold adjust the extension degree for short/long tasks.
(denoted as t) to split running tasks into two catego- Changing the values of a and b could tune the num-
ries, short tasks (workload t) and long tasks (work- ber of the short/long tasks. That is, by adjusting
load > t). Three values of t are investigated in our these values dynamically, we could optimize the
experiments: 500, 1,000, or 2,000, which corresponds
overall system performance to adapt to different
to 5, 10 or 20 seconds when running a task on a single
contention states. Specific values suggested in prac-
core. We assign as much resource as possible to short
tice will be discussed with our experimental results.
tasks, while keeping the long tasks’ resource fractions
In practice, one could use either of the above two modes
unchanged. Task length is evaluated in terms of its
or both of them, to adjust the resource allocation to adapt to
workload to process. In practice, it can be estimated
the competitive situation.
based on the workload characterization over history
or workload prediction method like [18]. In our
4.2 Best-Suited Task Scheduling Policy
design based on the absolute mode, short tasks’ cred-
its will be set to 800 (i.e., eight cores), implying the In a competitive situation where over-many tasks are sub-
full computational power. For example, if there is mitted to the system, it is necessary to queue some tasks
only one short running task on a host, it will be that cannot find the qualified resources temporarily. The
assigned with full resources (eight cores) for its com- queue will be checked as soon as some new resources are
putation. If there are more running tasks, they will be released or new tasks are submitted. As multiple hosts are
allocated according to PSM, while short tasks will be available for the task (e.g., there are still available CPU rates
probably assigned with more resource fractions. non-allocated on the host), the most powerful one with the
Relative mode. Our intuitive idea is adopting a largest availability will be selected as the execution host. A
proportional-share model on most of the middle- key question is how to select the waiting tasks based on
their demands, such that the overall execution performance
size-tasks such that their resource fractions received
and the fairness can both be optimized.
are proportional to their theoretically optimal
Based on the two-fold objective that aims to minimize the
resource amounts (riðjÞ ). Meanwhile, we enhance
RER and maximize the fairness meanwhile, we investigate
the credits of the subtasks whose corresponding the best-fit scheduling policy for both sequential-mode tasks
tasks are relatively short and decrease the credits of and parallel-mode tasks. We propose that (1) the best-fit
the ones with long tasks. That is, we give some extra queuing policy for the sequential-mode tasks is lightest-
credits to short tasks to enhance their resource con- workload-first policy, which assigns the highest scheduling
sumption priority. Suppose on a physical machine priority to the shortest task that has the least workload
amount to process; (2) the best-fit policy for parallel-mode
1. Weight-setting command is “xm sched-credit -d VM -w weight”. tasks is adopting LWF and longest-subtask-first (LSTF)
1760 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 6, JUNE 2015
together. In addition, we also evaluate many other queuing Pmi Pmi liðjÞ
li ¼ j¼1 liðjÞ , and TOLðti Þ ¼ j¼1 r . SPF means
policies for comparison, including FCFS, SOLF, SPF, SSTF, iðjÞ
and so on. We describe all the task-selection policies below. that the smaller value of ti ’s WP ðti Þ or TP ðti Þ, the
higher ti ’s priority would be. For example, if ti is a
FCFS. FCFS schedules the subtasks based on their newly submitted task, its workload processed must
arrival order. The first arrival one in the queue will be 0 (or d = 0), then WP ðti Þ would be equal to 0, indi-
be scheduled as long as there are available resources cating ti is with the slowest process
to use. This is the most basic policy, which is the easi-
est to implement. However, it does not take into Pd
l
account the variation of task features, such as task WP ðti Þ ¼ j¼1 iðdÞ
; (15)
li
structure, task workload, thus the performance and
wallclock time since t0i s submission
fairness will be significantly restricted. TP ðti Þ ¼ TOLðti Þ : (16)
Lightest-Workload-First. LWF schedules the subtasks
based on the predicted workload of their correspond- Based on the two different definitions, the SPF can be
ing tasks (a.k.a., jobs). Task’s workload is defined as split into two types, namely slowest-workload-prog-
the execution length estimated based on a standard ress-first (SWPF) and slowest-time-progress-first
process rate (such as single-core CPU rate). In the (STPF) respectively. We evaluated both of them in
waiting queue, the subtask whose corresponding our experiment.
task has lighter workload will be scheduled with a SSTF. SSTF selects the shortest subtask waiting in the
higher priority. In our Cloud system that aims to min- queue. The shortest subtask is defined as the subtask
imize the RER and maximize the fairness meanwhile, (in the waiting queue) which has the minimal work-
LWF obviously possesses a prominent advantage. load amount estimated based on single-core compu-
Note that various tasks’ TOLs are different due to tation. As a subtask is completed, there must be some
their different budget constraints and workloads, new resources released for other tasks, which means
while tasks’ execution overheads tend to be constant that a new waiting subtask will then be scheduled if
because of usually stable memory size consumed the queue is non-empty. Obviously, SSTF will result
over time. In addition, the tasks with lighter work- in the shortest waiting time to all the subtasks/tasks
loads tend to be with smaller TOLs, based on the defi- on average. In fact, since we select the “best” resource
nition of T ðti Þ. Hence, according to the definition of in the task scheduling, the eventual scheduling effect
RER, the tasks with lighter workloads (i.e., shorter of SSTF will make the short subtasks be executed as
jobs) are supposed to be more sensitive to their execu- soon as possible. Hence, this policy is exactly the
tion overheads, which means that they should be same as min-min policy [15], which has been effective
associated with higher priorities. in Grid workflow scheduling. However, our experi-
SOLF. SOLF is designed based on such an intui- ments validate that SSTF is not the best-suited sched-
tion: in order to minimize RER of a task, we can uling policy in our Cloud system.
only minimize the task’s real execution length LWF þ LSTF. We can also combine different individ-
because its theoretically optimal length (TOL) is a ual policies to generate a new scheduling policy. In
fixed constant based on its intrinsic structure and our system, LWF þ LSTF is devised for parallel-
budget. Since tasks’ TOLs are different due to their mode task, whose total execution length is deter-
heterogeneous structures, workloads, and budgets, mined by its longest subtask execution length (i.e.,
the execution overheads will impact their RERs to makespan), thus the subtasks with heavier work-
different extents. Suppose there were two tasks loads in the same task will have higher priority to
whose TOLs are 30 and 300 seconds respectively schedule. On the other hand, in order to minimize
and their execution overheads are both 10 seconds. the overall waiting time, all of tasks will be sched-
Even though the sums of their subtask execution uled based on lightest workload first. LWF þ LSTF
lengths were right the optimal values (30 and 300 means that the subtasks whose task has the lightest
seconds), their RERs would be largely different: workload will have the highest priority and the sub-
30 þ 10
30 versus 300300
þ 10
. In other words, the tasks with tasks belonging to the same task will be scheduled
shorter TOLs are supposed to be scheduled with based on longest subtask first. In addition, we also
higher priorities, for minimizing the discrepancy implement LWF þ SSTF for comparison.
among tasks’ RERs.
SPF. SPF is designed for sequential-mode tasks, 4.3 Minimization of Processing Overheads
based on task’s real execution progress compared to In our system, in addition to the waiting time and execution
its overall workload or TOL. The tasks with the slow- time of subtasks, there are three more overheads which need
est progress will have the highest scheduling priori- also to be counted in the whole response time, VM resource
ties. The execution progress can be defined based on isolation cost, data transmission cost between sub-tasks, and
either the workload processed or the wall-clock time VM’s default restoring cost. Our cost-minimization strategy
passed. They are called workload progress (WP) and is performing the data transmission and VMM operations
time process (TP) respectively, and they are defined in concurrently, based on the characterization of their costs.
Formulas (15) and (16) respectively. In the two for- We also assign extra amount of resources to super-short
mulas, d refers to the number of completed subtasks, tasks (e.g., the tasks with TOL 2 seconds) in order to
DI ET AL.: OPTIMIZATION OF COMPOSITE CLOUD SERVICE PROCESSING WITH VIRTUAL MACHINES 1761
TABLE 1
Workloads (Single-Core Execution Length) of 10 Matrix Computations (Seconds)
Matrix Scale M-M-Multi. QR-Decom. Matrix-Power M-V-Multi. Frob.-Norm Rank Solve Solve-Tran. V-V-Multi. Two-Norm
500 0.7 2.6 m = 10 2.1 0.001 0.010 1.6 0.175 0.94 0.014 1.7
1,000 11 12.7 m = 20 55 0.003 0.011 8.9 1.25 7.25 0.021 9.55
1,500 38 35.7 m = 20 193.3 0.005 0.03 29.9 4.43 24.6 0.047 29.4
2,000 99.3 78.8 m = 10 396 0.006 0.043 67.8 10.2 57.2 0.097 68.2
2,500 201 99.5 m = 20 1,015 0.017 0.111 132.6 18.7 109 0.141 136.6
mitigate the impact of the overhead to their executions. Spe- single thread, thus they cannot get speedup when being
cifically, we run them directly on VMs without any credit- allocated with more than one processor. Hence, we set the
tuning operation. Otherwise, the credit-tuning effect may capacity of any subtask performing a single-threaded ser-
work on another subtask instead of the current subtask, vice to be single-core rate, or less when its theoretically opti-
due to the inevitable delay (about 0.3 seconds) of the credit- mal resource to allocate is less than one core.
tuning command. Details can be found in our corresponding In our experiment, we are assigned with eight physical
conference paper [24]. nodes to use from the most powerful cluster at HongKong
(namely Gideon-II [23]), and each node owns two quad-
core Xeon CPU E5540 (i.e., eight processors per node) and
5 PERFORMANCE EVALUATION 16 GB memory size. There are 56 VM-images (centos 5.2)
5.1 Experimental Setting maintained by Network File System (NFS), so 56 VMs
We implement a cloud composite service prototype that can (seven VMs per node) will be generated at the bootstrap.
help solve complex matrix problems, each of which is XEN 4.0 [20] serves as the hypervisor on each node and
allowed to contain a series of nested or parallel matrix com- dynamically allocates various CPU rates to the VMs at run-
putations. For an example of nested matrix computation, a time using the credit scheduler.
user may submit a request like Solve((Amn Anm )k , Bmm ), We will evaluate different queuing policies and resource
which can be split into three steps (or subtasks): (1) matrix- allocation schemes under different competitive situations
product (a.k.a., matrix-matrix multiply): Cmm ¼ Amn with different numbers (4-24) of tasks simultaneously.
Anm ; (2) matrix-power: Dmm ¼ Cmmk
; (3) calculating least Table 2 lists the candidate key parameters we investigated
squares solution of DX ¼ B: Solve(Dmm , Bmm ). in our evaluation. Note that the measurement unit of h and
In our experiment, we make use of ParallelColt [25] to b for RAPSM(T) is second, while the measurement unit for
perform the math computations, each consisting of a set of RAPSM(W) is seconds 100, because a single core’s proc-
matrix computations. ParallelColt [25] is such a library that essing ability is represented as 100 according to XEN’s
can effectively calculate complex matrix computations, such credit scheduler [19], [20].
as matrix-matrix multiply and matrix decomposition, in
parallel (with multiple threads) based on symmetric multi- 5.2 Experimental Results
ple processor (SMP) model. 5.2.1 Demonstration of Resource Contention Degrees
There are totally 10 different matrix computations (such
as matrix-product, matrix-decomposition, etc.) as shown We first characterize the various contention degrees with
in Table 1. We carefully characterize the single-core execu- different number of tasks submitted. The contention
tion length (or workload) for each of them, and find that degree is evaluated via two metrics, allocate-request ratio
each matrix computation has its own execution type. For (abbreviated as ARR) and queue length (abbreviated as
example, matrix-product and matrix-power are typical QL). System’s ARR at a time point is defined as the ratio
computation-intensive services, while rank and two-norm of the total allocated resource amount to the total amount
computation should be memory-intensive or I/O-bound requested by subtasks at that moment. QL at a time point
ones when matrix scale is large. Hence, each sequential- is defined as the total number of subtasks in the waiting
mode task that is made up of multiple different matrix list at that moment. There are four test-cases each of
computations in series can be considered complex applica- which uses different number of tasks (4, 8, 16, and 24)
tions with execution types varied over time. submitted. The four test-cases correspond to different
In each test, we randomly generate a number of user contention degrees.
requests, each of which is composed of 5 15 sub-tasks
(or matrix computation services). Such a simulation is non- TABLE 2
trivial since each emulated matrix has to be compatible for Candidate Key Parameters
each matrix computation (e.g., two matrices in a matrix-
product must be in the form of Amn and Bnp respectively). Parameter Value
Among the 10 matrix-computation services, three services threshold of short task length (seconds) 5, 10, 20
are implemented as multiple-threaded programs, including h 1.25, 1.5, 1.75, 2
matrix-matrix multiply, QR-decomposition, matrix-power, a w.r.t. RAPSM(T) (seconds) 5, 10, 20
b w.r.t. RAPSM(T) (seconds) 100, 200, 300
hence their computation can get an approximate-linear a w.r.t. RAPSM(W) (seconds 100) 500, 1,000, 2,000
speedup when allocated multiple processors. The other b w.r.t. RAPSM(W) (seconds 100) 10,000, 20,000, 30,000
seven matrix operation services are implemented using
1762 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 6, JUNE 2015
TABLE 4
Statistics of RER in a Competitive Situation
with Parallel-Mode Tasks
prominently for a few solutions like STLFþRAPSM(T) and In our experiments, the most interesting and valuable
STLFþRAPSM(W). However, they would not impact RER finding is that the AAPSM with different short task
clearly in most cases. From Figs. 6b 6d, 6f and 6h, it is length thresholds (t) will result in quite different results,
observed that with different parameters, the RERs under which we show in Tables 5, 6 and 7. These three tables
both LWF and SSTF are within [1.85, 2.31]. present the three key indicators, average RER, maximum
TABLE 5 TABLE 6
Mean RER under Various Solutions with Different t and h Max. RER under Various Solutions with Different t and h
strategy t h ¼ 1.25 h ¼ $1.5 h ¼ 1.75 h¼2 strategy t h ¼ 1.25 h ¼ 1.5 h ¼ 1.75 h¼2
FCFS 5 4.050 4.142 4.131 4.054 FCFS 5 23.392 25.733 24.742 24.470
10 4.122 3.952 3.924 3.845 10 24.184 22.397 22.633 23.192
20 4.121 4.196 4.139 4.296 20 24.204 25.258 24.391 25.313
LWF 5 2.071 2.090 2.169 2.138 LWF 5 9.164 7.826 8.543 8.323
10 2.268 2.133 2.179 2.152 10 10.323 7.681 9.136 9.104
20 2.194 1.935 2.194 2.218 20 8.106 5.539 6.962 8.812
SOLF 5 3.316 3.321 3.552 3.102 SOLF 5 13.294 15.885 15.743 13.255
10 3.241 3.382 2.989 2.783 10 12.735 18.719 13.939 10.325
20 3.375 3.324 3.305 3.039 20 17.070 15.642 13.379 13.394
SSTF 5 2.111 2.072 2.275 2.147 SSTF 5 8.091 7.931 9.276 9.803
10 2.202 2.171 1.980 2.172 10 10.245 8.376 7.224 11.199
20 2.322 1.968 2.092 2.205 20 11.418 6.432 6.706 9.650
STPF 5 3.265 3.011 3.271 3.119 STPF 5 16.456 15.545 17.642 15.232
10 3.296 3.024 3.152 3.132 10 16.925 14.797 15.892 14.710
20 3.200 3.326 3.318 3.244 20 13.978 15.596 16.250 14.853
SWPF 5 6.169 6.371 6.339 6.322 SWPF 5 58.467 61.647 60.266 59.044
10 6.271 6.353 6.446 6.659 10 59.351 60.199 61.241 64.286
20 6.784 6.763 6.730 6.635 20 65.142 64.887 64.769 63.326
DI ET AL.: OPTIMIZATION OF COMPOSITE CLOUD SERVICE PROCESSING WITH VIRTUAL MACHINES 1765
TABLE 7
Fairness of RER under Various Solutions with Different t&h
Fig. 8. Average RER with different parameters for the non-competitive situation.
instance, with RAPSM(T), {a ¼ 100 seconds, b ¼ 20 sec- application deadlines meanwhile. Whereas, they overlook
onds} exhibits good results in Fig. 8c (h ¼ 1.75), but bad the competitive situation by assuming the resource pool is
results in Fig. 8 (h ¼ 2.0); Using RAPSM(W) with always adequate and users have unlimited budgets. Many
{a ¼ 10,000, b ¼ 2,000}, about 93 percent of tasks’ RERs are of other methods like Genetic algorithms [35] and Simulated
below 1 when setting h to 1.75, while the corresponding Annealing algorithm [36], often overlooked the execution
ratio is only 86 percent when setting eta to 2.0. overheads in VM operation or data transmission, and per-
formed the evaluation through simulation.
6 RELATED WORK In addition to scheduling model, many Cloud manage-
Although job scheduling problem [26] in Grid computing ment researchers focus on the optimization of resource
[27] has been extensively studied for years, most of them assignment. Unlike Grid systems whose compute nodes are
(such as [28], [29]) are not suited for our cloud composite exclusively consumed by jobs, the resource allocation in
service processing environment. Grid jobs are often with Cloud systems are able to be refined by leveraging VM
long execution length, while Cloud tasks are often short resource isolation technology. Stillwell et al. [37] exploited
based on [13]. Hence, task’s response time will be more eas- how to optimize the resource allocation for service hosting
ily degraded by scheduling/execution overheads (such as on a heterogeneous distributed platform. Their research is
waiting time and data transmission cost) in Cloud environ- formalized as a mixed integer linear program (MILP) prob-
ment than in Grid environment. That is, the overheads in lem and treated as a rational LP problem instead, also with
Cloud environment should be treated more carefully. fundamental theoretical analysis based on estimate errors. In
Recently, many new scheduling methods are proposed comparison to their work, we intensively exploit the best-
for different Cloud systems. Zaharia et al. [30] designed a suited scheduling policy and resource allocation scheme for
task scheduling method to improve the performance of the competitive situation. We also take into account user pay-
Hadoop [31] for a heterogeneous environment (such as a ment requirement, and evaluate our solution on a real-VM-
pool of VMs each being customized with different abilities). deployment environment which needs to tackle more practi-
Unlike the FCFS policy and speculative execution model cal technical issues like minimization of various execution
originally used in Hadoop, they designed a so-called lon- overheads. Meng et al. [38] analyzed VM-pairs’ compatibil-
gest approximate time to end (LATE) policy, that assigns ity in terms of the forecasted workload and estimated VM
higher priorities to the jobs with longer remaining execution sizes. SnowFlock [39] is another interesting technology that
lengths. Their intuition is maximizing the opportunity for a allows any VM to be quickly cloned (similar to UNIX process
speculative copy to overtake the original and reduce job’s fork) such that the resource allocation would be automati-
response time. Isard et al. [32] proposed a fair scheduling cally refined at runtime. Kuribayashi [40] also proposed a
policy (namely Quincy) for a high performance compute resource allocation method for Cloud computing environ-
system with virtual machines, in order to maximize the ments especially based on divisible resources. BlobCR [41]
scheduling fairness and minimize the data transmission aims to optimize the performance of HPC applications on
cost meanwhile. Compared to these works, our Cloud sys- IaaS clouds at system level, by improving the robustness of
tem works with a strict payment model, under which the running virtual machines using virtual disk image snap-
optimal resource allocation for each task can be computed shots. In comparison, our work focuses on the theoretical
based on convex optimization theory. Mao et al. [33], [34] optimization of performance when system runs in short sup-
proposed a solution by combining dynamic scheduling and ply and corresponding implementation issues at the applica-
EDF strategy, to minimize user payment and meet tion level.
DI ET AL.: OPTIMIZATION OF COMPOSITE CLOUD SERVICE PROCESSING WITH VIRTUAL MACHINES 1767
7 CONCLUSION AND FUTURE WORK [5] D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat,
“Enforcing performance isolation across virtual machines in
In this paper, we designed and implemented a loosely- xen,” in Proc. 7th ACM/IFIP/USENIX Int. Conf. Middleware,
coupled Cloud system with web services deployed on 2006, pp. 342–362.
[6] J. N. Matthews, W. Hu, M. Hapuarachchi, T. Deshane, D. Dimatos,
multiple VMs, aiming to improve the QoS of each user G. Hamilton, M. McCabe, and J. Owens, “Quantifying the perfor-
request and maximize fairness of treatment at runtime. mance isolation properties of virtualization systems,” Proc. ACM
Our contribution is three-fold: (1) we studied the best- Workshop Exp. Comput. Sci., 2007, pp. 1–9.
[7] S. Chinni and R. Hiremane, “Virtual machine device queues,”
suited task scheduling policy with VMs; (2) we explored Virt. Technol. White Paper, 2008, [Online]. Available: http://
an optimal resource allocation scheme and an adjusted www.intel.com/content/www/us/en/virtualization/vmdq-
strategy to suit the competitive situation; (3) the processing technology-paper.html., pp. 1–22.
overhead is minimized in our design. Based on our experi- [8] T. Cucinotta, D. Giani, D. Faggioli, and F. Checconi,
“Providing performance guarantees to virtual machines using
ments, we summarize the following lessons. real-time scheduling,” in Proc. 5th ACM Workshop Virtualization
High-Perform. Cloud Comput., 2010, pp. 657–664.
We confirm that the best scheduling policy of [9] R. Nathuji, A. Kansal, and A. Ghaffarkhah, “Q-clouds: Managing
scheduling sequential-mode tasks in the competi- performance interference effects for qos-aware clouds,” in Proc.
tive situations, is either lightest-workload-first or ACM Euro. Conf. Comp. Sys., 2010, pp. 237–250.
[10] R. Ghosh, and V. K. Naik, “Biting off safely more than you can
SSTF. Each of them improves the performance by chew: Predictive analytics for resource over-commit in IaaS
about 86 percent compared to FCFS. As for the cloud,” in Proc. IEEE 5th Int. Conf. Cloud Comput., 2012, pp. 25–32.
parallel-mode tasks, the best-fit policy is combin- [11] Google cluster-usage traces. (2011). [Online]. Available: http://
ing LWF and longest subtask first, and the aver- code.google.com/p/googleclusterdata
[12] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A.
age RER is lower than other solutions by 3.8-51.6 Kozuch, “Towards understanding heterogeneous clouds at scale:
percent. Google trace analysis,” Intel Sci. Technol. Center Cloud Comput.,
For a competitive situation, the best solution is com- Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. ISTC–
CC–TR–12–101, Apr. 2012.
bining lightest-workload-first with AAPSM and [13] S. Di, D. Kondo, and W. Cirne, “Characterization and comparison
RAPSM (in absolute terms, LWFþAAPSMþRAPSM of cloud versus grid workloads,” in Proc. IEEE Int. Conf. Cluster
with short task length threshold and extension coef- Comput., 2012, pp. 230–238.
ficient being set to 20 seconds and 1.5 respectively). [14] M. Rahman, S. Venugopal, and R. Buyya, “A dynamic critical path
algorithm for scheduling scientific workflow applications on
It outperforms other solutions in the competitive sit- global grids,” in Proc. 3rd IEEE Int. Conf. e-Sci. Grid Comput., 2007,
uation, by 16 þ % w.r.t. the worst-case response pp. 35–42.
time. The fairness under this solution is about 0.709, [15] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund,
which is higher than that of the second best solution “Dynamic matching and scheduling of a class of independent tasks
onto heterogeneous computing systems,” in Proc. 8th Heterogeneous
(SSTFþAAPSMþRAPSM) by 7:4þ%. Comput. Workshop, 1999, p. 30.
For a non-competitive situation, {a ¼ 200 seconds, [16] EDF Scheduling. (2008). [Online]. Available: http://en.wikipedia.
b ¼ 5 seconds} serves as the best assignment of the org/wiki/earliest_deadline_first_scheduling
[17] S. Di, Y. Robert, F. Vivien, D. Kondo, C-L. Wang, and F. Cappello,
parameters, regardless of the threshold value of set- “Optimization of cloud task processing with checkpoint-restart
ting the short task length (h). mechanism,” in Proc. IEEE/ACM Int. Conf. High Perform. Comput.,
In the future, we plan to further exploit an adaptive solu- Netw., Storage Anal., 2013, pp. 64:1–64:12.
tion that can dynamically optimize the performance in both [18] L. Huang, J. Jia, B. Yu, B. G. Chun, P. Maniatis, and M. Naik,
“Predicting execution time of computer programs using sparse
competitive and non-competitive situations. We also plan to polynomial regression,” in Proc. 24th Int. Conf. Neural Inf. Process.
improve the ability of fault tolerance and resilience in our Syst. 2010, pp. 1–9.
cloud system. [19] Xen-credit-scheduler. (2003). [Online]. Available: http://wiki.
xensource.com/xenwiki/creditscheduler
[20] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,
ACKNOWLEDGMENTS R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of
virtualization,” in Proc. 19th ACM Symp. Operating Syst. Principles.
This work was made by the ANR project Clouds@home 2003, pp. 164–177.
(ANR-09-JCJC-0056-01), also supported by the U.S. Depart- [21] Amazon elastic compute cloud. (2006). [Online]. Available:
http://aws.amazon.com/ec2/
ment of Energy, Office of Science, under Contract DE- [22] M. Feldman, K. Lai, and L. Zhang, “The proportional-share alloca-
AC02-06CH11357, and also in part by HKU 716712E. tion market for computational resources,” IEEE Trans. Parallel Dis-
tributed Syst., vol. 20, no. 8, pp. 1075–1088, Aug. 2009.
[23] Gideon-II Cluster. (2010). [Online]. Available: http://i.cs.hku.hk/
REFERENCES clwang/Gideon-II
[24] S. Di, D. Kondo, and C. L. Wang, “Optimization and stabilization
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A.
of composite service processing in a cloud system, ” in Proc. IEEE/
Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M.
ACM 21st Int. Symp. Quality Serv., 2013, pp. 1–10.
Zaharia, “Above the clouds: A Berkeley view of cloud
[25] P. Wendykier and J. G. Nagy, “Parallel colt: A high-performance
computing,” EECS, Univ. California, Berkeley, CA, USA, Tech.
java library for scientific computing and image processing,” ACM
Rep. UCB/EECS-2009-28, Feb. 2009.
Trans. Math. Softw., vol. 37, pp. 31:1–31:22, Sep. 2010.
[2] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, “A
[26] C. Jiang, C. Wang, X. Liu, and Y. Zhao, “A survey of job schedul-
break in the clouds: Towards a cloud definition,” SIGCOMM Com-
ing in grids,” in Proc. Joint 9th Asia-Pacific Web 8th Int. Conf. Web-
put. Commun. Rev., vol. 39, no. 1, pp. 50–55, 2009.
Age Information Manage. Conf. Advances Data Web Manage, 2007,
[3] Google app engine. (2008). [Online]. Available: http://code.
pp. 419–427.
google.com/appengine/
[27] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Comput-
[4] J. E. Smith and R. Nair, Virtual Machines: Versatile Platforms for
ing Infrastructure (The Morgan Kaufmann Series in Computer Architec-
Systems and Processes. San Mateo, CA, USA: Morgan Kaufmann,
ture and Design). San Mateo, CA, USA: Morgan Kaufmann, Nov.
2005.
2003.
1768 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 6, JUNE 2015
[28] E. Imamagic, B. Radic, and D. Dobrenic, “An approach to grid Sheng Di received his Master (M.Phil) degree
scheduling by using condor-G matchmaking mechanism,” in from Huazhong University of Science and Tech-
Proc. 28th Int. Conf. Inf. Technol. Interfaces, 2006, pp. 625–632. nology in 2007 and Ph.D degree from The Uni-
[29] Y. Gao, H. Rong, and J. Z. Huang, “Adaptive grid job scheduling versity of Hong Kong in Nov. of 2011, both on
with genetic algorithms,” Future Generation Comput. Syst., vol. 21, Computer Science. Dr. Di is currently a postdoc
pp. 151–161, Jan. 2005. research at Argonne National Laboratory,
[30] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, Lemont, USA. His research interest involves opti-
“Improving mapreduce performance in heterogeneous environ- mization of distributed resource allocation in
ments,” in Proc. 8th USENIX Conf. Operating Syst. Des. Implementa- large-scale cloud platforms, characterization and
tion, 2008, pp. 29–42. prediction of workload at Cloud data centers, and
[31] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop fault tolerance on Cloud/HPC.
distributed file system,” in Proc. IEEE 26th Symp. Mass Storage
Syst. Technol., 2010, pp. 1–10.
[32] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Derrick Kondo received the bachelor’s degree
Goldberg, “Quincy: Fair scheduling for distributed computing from Stanford University in 1999, and the mas-
clusters,” in Proc. ACM SIGOPS 22nd Symp. Operating Syst. Princi- ter’s and PhD degrees from the University of
ples, 2009, pp. 261–276. California at San Diego in 2005, all in computer
[33] M. Mao, J. Li, and M. Humphrey, “Cloud auto-scaling with dead- science. He is currently a tenured research
line and budget constraints,” in Proc. 11th IEEE/ACM Int. Conf. scientist at INRIA—Grenoble, Montbonnot-Saint-
Grid Comput., 2010, pp. 41–48. Martin, France. His current research interests
[34] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and include reliability, fault-tolerance, statistical anal-
meet ApplicationDeadlines in cloud workflows,” in Proc. IEEE/ ysis, job and resource management. He received
ACM Int. Conf. High Perform. Comput., Netw., Storage Anal., 2011, the Young Researcher Award (similar to US
pp. 49:1–49:12. National science Foundation (NSF)’s CAREER
[35] S. Kaur and A. Verma, “An efficient approach to genetic algorithm Award) in 2009, and the Amazon Research Award in 2010, and the
for task scheduling in cloud computing,” Int. J. Inf. Technol. Com- Google Research Award in 2011. He is a member of the IEEE.
put. Sci., vol. 10, pp. 74–79, 2012.
[36] S. Zhan and H. Huo, “Improved PSO-based task scheduling algo-
rithm in cloud computing,” J. Inf. Comput. Sci., vol 9, no. 13, Cho-Li Wang is currently a Professor in the
pp. 3821–3829, 2012. Department of Computer Science at The Univer-
[37] M. Stillwell, F. Vivien, and H. Casanova, “Virtual machine sity of Hong Kong. He graduated with a B.S.
resource allocation for service hosting on heterogeneous distrib- degree in Computer Science and Information
uted platforms,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., Engineering from National Taiwan University in
Shanghai, China, 2012, pp. 786–797. 1985 and a Ph.D. degree in Computer Engineer-
[38] X. Meng and et al., “Efficient resource provisioning in compute ing from University of Southern California in
clouds via VM multiplexing,” in Proc. 7th Int. Conf. Autonomic 1995. Prof. Wang’s research is broadly in the
Comput., 2010, pp. 11–20. areas of parallel architecture, software systems
[39] H. A. L. Cavilla, J. A. Whitney, A. M. Scannell, P. Patchin, S. M. for Cluster computing, and virtualization techni-
Rumble, E. de Lara, M. Brudno, and M. Satyanarayanan, ques for Cloud computing. His recent research
“SnowFlock: Rapid virtual machine cloning for cloud projects involve the development of parallel software systems for multi-
computing,” in Proc. 4th ACM Eur. Conf. Comput. Syst., 2009, core/GPU computing and multi-kernel operating systems for future
pp. 1–12. manycore processor. Prof. Wang has published more than 150 papers
[40] S.-i. Kuribayashi, “Optimal joint multiple resource allocation in various peer reviewed journals and conference proceedings. He is/
method for cloud computing environments,” Int. J. Res. Rev. Com- was on the editorial boards of several scholarly journals, including IEEE
put. Sci., vol. 2, pp. 1–8, 2011. Transactions on Cloud Computing, IEEE Transactions on Computers,
[41] B. Nicolae and F. Cappello, “BlobCR: Efficient checkpoint-restart and Journal of Information Science and Engineering. He also serves
for HPC applications on IaaS clouds using virtual disk image as a coordinator (China) of the IEEE Technical Committee on Parallel
snapshots, ” in Proc. IEEE/ACM Int. Conf. High Perform. Comput., Processing (TCPP).
Netw., Storage Anal., 2011, pp. 34:1–34:12.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.