0% found this document useful (0 votes)
33 views10 pages

Batch Queue Resource Scheduling For Workflow Applications

This document proposes a technique for scheduling batch queue resources for workflow applications to reduce wait times. It aggregates workflow tasks into subcomponents and submits them to the batch queue sequentially, overlapping the wait time for one subcomponent with the execution of others. The technique was tested using job submission logs from five supercomputing centers and reduced wait times by up to 70% compared to submitting tasks individually or all at once.

Uploaded by

tatjana zamarin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

Batch Queue Resource Scheduling For Workflow Applications

This document proposes a technique for scheduling batch queue resources for workflow applications to reduce wait times. It aggregates workflow tasks into subcomponents and submits them to the batch queue sequentially, overlapping the wait time for one subcomponent with the execution of others. The technique was tested using job submission logs from five supercomputing centers and reduced wait times by up to 70% compared to submitting tasks individually or all at once.

Uploaded by

tatjana zamarin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Batch Queue Resource Scheduling for Workflow Applications

Yang Zhang, Charles Koelbel and Keith Cooper

Computer Science Department


Rice University
Houston, TX 77005
Email: {yzhang8,chk,keith}@rice.edu

Abstract supercomputing centers like TACC [?]. In any case, these


clusters are shared and usually managed by a local resource
Workflow computations have become a major pro- management system that has its own resource sharing
gramming paradigm for scientific applications. However, methodology and policy. Among them, commercial or
acquiring enough computational resources to execute a open source batch queue scheduling software [?], [?], [?] is
workflow poses a challenge in a batch queue controlled the most popular resource management system. Section ??
resource due to the space-sharing nature of the resource gives more details on the background of both workflow
management policy. This paper introduces a scheduling applications and batch schedulers.
technique that aggregates a workflow application into The main goals of a site using batch queues are usually
several subcomponents. It then uses the batch queue to to achieve high throughput and maximize the system
acquire resources for each subcomponent, overlapping utilization. Consequently, many production resources have
resource provisioning overhead (wait time) of one with long queue wait times due to the high utilization levels.
the execution of others. We implemented a prototype of In addition, although it is not unusual for a single cluster
this technique and tested it using five high performance to have several thousand processors, a single user usually
computing centers job submission logs. The results show can get only a small portion of the total available resources
that our approach can eliminate as much as 70% of the (without special arrangements). This creates performance
wait time over more traditional techniques that request problems for large scale workflow applications because
resources for individual workflow nodes or that acquire each sub-task in the workflow could experience long delays
all the resources for the whole workflow at once. in the job queue before it runs. The queue wait time
overhead is sometimes much more than the workflow
applications runtime [?]. Alternately, one could submit an
I. Introduction entire workflow as a single batch queue job. However, this
might cause an even longer wait for more resources to
Workflow applications — high-level analyses structured become available at once.
as a set of inter-dependent tasks — have become an essen- Our work seeks to reduce the workflow turnaround time
tial way for many scientists to compose and execute their by intelligently using batch queues without relying on
computations on high performance computing resources. reservations. We accomplish this by aggregating workflow
Workflow applications are widely used in scientific fields tasks together and submitting them as a single job into the
as diverse as astronomy [?], biology [?], [?] oceanogra- queue. Section ?? describes our method in greater detail.
phy [?], and earthquake science [?]. This approach can greatly reduce the number of jobs a
At the same time, clusters (parallel computers with workflow execution system submits to the batch queue. It
high-speed interconnects and shared file systems) have also makes smaller placeholder requests than the virtual
become the most common high-performance computing reservation approach. By overlapping some tasks’ wait
platform. Consequently, workflow applications are often times with others executions, we further shorten the batch
executed on clusters. The workflow execution systems can queue wait times for the workflow applications. As we will
get access to a cluster either locally, through a collaborative see in Section ??, our scheduling reduces the queue wait
organizations such as TeraGrid [?], or through national time overhead while not disturbing the normal batch queue
operation. We conclude our presentation with a discussion deadlocks. We therefore avoid advance reservations in our
of related work in Section ?? and our conclusions and work.
future work in Section ??. One advanced feature of Maui that we do use is the
start time estimation functionality. A user can invoke the
showstart command to get the estimated start time of a
II. Background job in the queue or a new job (specified with number of
processors and duration) to be submitted. This can be done
A. Batch Queues by computing the job’s priority, building (or querying) the
queue’s future schedule, and determining when the job
Batch queues have become the most popular resource would run. Note that, because new high-priority jobs could
management method on computational clusters. A batch be submitted before the queried job runs, the estimate may
queue system is normally a combination of a parallel- not be exact. However, it is a useful piece of information
aware resource management system (which determines to use in scheduling.
“where” a job runs) and a policy based job scheduling
engine (which determines “when” a job runs). We are B. Workflow Application Execution
mostly interested in the job scheduler component, treating
the individual processors as homogeneous. To illustrate A workflow consists of a set of tasks that produce
how this scheduler works, we describe the widely-used and consume data. The data transfer creates a dependence
open-source Maui batch queue scheduler [?], [?]. The between the tasks. In scientific applications, there can be
experiments in Section ?? are based on simulations of this hundreds of such tasks, which can range from setting up
scheduler. simulation conditions to performing large computations to
The Maui scheduler, like many batch queue schedulers, visualizing the results. In this paper, we represent such
is essentially a policy based reservation system. The key workflows as directed acyclic graphs (DAGs), where nodes
idea is to calculate a priority for each job in the queue represent the tasks and edges represent the dependences.
based on aspects of the job and the policy of the queue Section ?? has a few examples.
system. The priority of each batch queue job is deter- Executing a workflow is conceptually simple. Whenever
mined by job properties, such as the requested resource a node is ready to execute (i.e. all its predecessors have
requirements (number of processors and total time), its completed), it can be scheduled for execution. However,
owner’s credentials, and the time it has waited in the queue. doing this naively in a batch queue environment could
These properties are combined in a formula with weights potentially create long waits for every task to begin. Nev-
configured by the system administrator. For example, to ertheless, this is common practice. There are two general
favor large jobs, a site would choose a high (and positive) ways [?] (other than advanced reservations) to reduce
weight for the resource requirements. this batch queue overhead. One way is to aggregate the
After all jobs’ priorities are calculated, the Maui sched- workflow tasks into larger groups [?]. The other method is
uler starts all the highest-priority jobs that it can run to use virtual reservation technology [?], [?], [?], [?]. This
immediately. It then makes a reservation in the future provisioning technique enables users to create a personal
for the next highest priority job according to the already dedicated resource pool in a space-shared computing envi-
running jobs’ estimated finish time to ensure it will start to ronment. Although there are various implementations, the
run as soon as possible. Given that reservation, a backfill key idea is to submit a big placeholder job into the space
mechanism attempts to find jobs that can start immediately shared resource site. When the placeholder job gets to run,
and finish before the reservation time. Once a job begins it usually installs and runs a user-level resource manager
execution, it runs to completion or until it exhausts its on its assigned computing nodes. The user-level resource
requested resources. manager (in our case, the workflow execution system)
Maui, like some other schedulers [?], [?], [?], can then can schedule jobs onto the those computing nodes
provide advance reservation services at a user level. This without going through the sites resource manager again.
allows the user to request a specific number of resources Our work draws inspiration from the virtual reservation
for a given period of time, effectively gaining a set implementation, but attempts to choose a more propitious
of dedicated resources and eliminating the queue wait size for the placeholder job.
time. However, advance reservation is not available at
all sites, usually involves system administrator assistance, III. Workflow Application Clustering
and always requires notice beforehand. Furthermore, Snell
et al. [?] showed that advance reservation can decrease Traditional DAG clustering algorithms aggregate the
the system utilization and has the potential to introduce workflow tasks into larger unit (to reduce the potential
1 1
1 Wait Time 1 Wait Time

Wait
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Time
2 3 4 5 6
7 8 9
7 8 9 7 8 9 7 8 9
10

10
10 10

Fig. 1. Workflow Application Cluster Fig. 2. Workflow Application Cluster by Level

communications). Figure ?? shows an example. The left is responsible for launching the workflow application and
side of the figure is the original DAG that represents a monitoring its progress. In general, it chooses partial DAGs
workflow application. The right side of the figure is a and submits placeholder jobs to the batch queue system.
clustered version of the same DAG that we aggregate all Individual workflow tasks execute in the placeholder jobs
the tasks in the same level into one aggregation. Our goal when those jobs come to the front of the queue, with
is to choose a clustering algorithm that will reduce the the application manager enforcing their dependences. The
total batch queue wait time. The main idea behind our peeling procedure selects the partial DAGs to minimize the
approach is that we can aggregate the workflow by level exposed waiting time. We now consider the parts in turn.
and submit a placeholder job for the later levels before Figure ?? shows the application manager. After select-
their predecessor finishes. In this way, we can overlap the ing and submitting the initial partial DAG (lines 1-5), the
running time of the predecessor level with the wait time manager becomes an event-driven system. The primary
of the successor levels. events that it responds to are:
Figure ?? illustrates this idea. The left shows the possi- • A placeholder job starts to run (lines 7-16). The
ble result of grouping the workflow DAG in figure ?? into manger first starts all the workflow tasks associated
two aggregations and submitting them in turn. The yellow with that job whose predecessors have finished. Then
rectangles represent the wait time of the two placeholder it invokes the peeling procedure to form the next
jobs in the queue. A placeholder job, represented by placeholder job and submit it to the queue.
a rectangle that contains one or more levels of tasks, • A placeholder job finishes running (lines 17-25).
is submitted into the queue as soon as its predecessor Normally, no processing is needed. However, if the
placeholder job starts. It asks for enough resources for the placeholder is terminated before all its tasks complete
tasks it holds to run in full parallelism. The wait time (i.e. because some predecessors were delayed in the
seen by the users for the clustering on the left is the batch queue), the manager must clean up. It cancels
dark yellow area marked ”real wait time”. We can see any placeholders that have not started, since some of
from the figure that it is less than the queue wait time their predecessors may be delayed. It also calls the
for the second task because of the overlap with task 1’s peeling procedure to reschedule the unfinished DAG
execution. Ideally, if the first placeholder job gets to run nodes (both interrupted tasks and those not yet run)
immediately and the later jobs’ wait times do not exceed and submits the new placeholder job into the queue.
their predecessor’s run times, the queue wait time for • A DAG task finishes (lines 26-32). The manger starts
the entire workflow application is eliminated, as shown all the successor tasks whose placehold job is already
on right side of figure ??. However, this perfect overlap running. One subtlety in the application manager is
cannot be guaranteed. Furthermore, if the wait time for a that the successors of a DAG node may be in the same
placeholder job is less than its predecessor’s run time (as is placeholder or a different one. In the latter case, the
the case for task 10), it must pad its requested time to honor manager must handle the possibility that a placeholder
its dependences. In turn, this will affect the wait time of the starts without any runnable tasks (lines 28-30). If
placeholder job. Balancing these effects requires heuristic all of a placeholder’s tasks are finished, the manger
scheduling. finishes the job to free the batch queue resource.
Our algorithm consists of two interrelated parts: an We choose to submit a new placeholder job only after its
application manager shown in Figure ??, and a “peeling” predecessor begins running. There are several reasons for
procedure shown in Figure ??. The application manager this design. In our experience with real queues, we discov-
Algorithm:runDAG (DAG dag, int sub time) Algorithm: peelLevel(levelized DAG, int sub time, int ear time)
1 task[] partial dag ← levelize(dag); 1 int runTime all, waitTime all;
2 int count ← 0; 2 int peel runTime[2], peel waitTIme[2];
3 Placeholder job ← peelLevel(partial dag, sub time, 0); 3 runTime all ← est runTime(DAG);
4 job. name ← count; 4 waitTime all ← est waitTime(runTime all, DAG.width,sub time);
5 submit job; 5 peel runTime[0] ← runTime all;
6 while ( dag is not finished) 6 peel waitTIme[0] ← waitTime all;
7 listen to batch queue and task events; 7 int level = groupLevel(DAG,sub time, ear time,
8 if (placeholder job n starts to run at time t) 8 peel runTime, peel waitTIme);
9 for all (task in job n.getTasks()) 9 if ( level == DAG.height)
10 if (all task predecessors have finished) 10 if (runTime all * 2 < waitTime all)
11 start task; 11 return the whole remaining DAG in a batch queue job
12 ear finTime ← job n.runTime; 12 else
13 partial dag ← levelize(dag.unmappedTasks()); 13 return submit the remaining DAG in individual mode
14 job ← peelLevel(partial dag, t, ear finTime ); 14 else
15 job. name ← ++count; 15 group levels to a partial dag;
16 submit job; 16 map each dag job to the batch queue job;
17 else if ( placeholder job n finishes running at time t) 17 return the partial dag in a placeholder job;
18 if ( job n has unfinished tasks)
19 partial dag ← levelize(job n.unfinishedTasks());
20 for all ( pending placeholder job job m ) Fig. 4. The DAG Peeling Procedure
21 cancel job m;
22 add job m.tasks() to partial dag ;
23 Placeholder jobResub ← peelLevel(partial dag, t, 0);
24 map all tasks in the partial dag to jobResub;
25 submit jobResub; as a single job is twice the total run time of the DAG. The
26 else if ( task dagTask finishes running at time t) intuition for this is that individual submission can take
27 delete the dagTask from its placeholder job advantage of the free resources or the backfill window.
28 for all (dagTask’s successor task chd task) When the one giant placeholder job’s wait time is twice as
29 if (chd task’s associated placeholder job is running)
30 start chd task;
long as the run time, the individual submission has a better
31 if (dagTask’s placehold Job has no more tasks to run) chance to finish earlier. This is a heuristic parameter chosen
32 stop dagTask’s placehold Job empirically. Otherwise, we use the partial DAG returned by
groupLevel. The earliest job start estimation we used is a
best effort approach like the showstart command in Maui.
Fig. 3. The DAG Application Manager However, our experience shows it is a reliable indicator
of the wait time with one experiment shows the mean
difference is within 5% of the real wait time.
ered that multiple outstanding jobs in the queue interfered Figure ?? shows the key groupLevel procedure. Al-
with each other. In turn, this often caused the wait time though the logic is somewhat complex, in essence we
for already-submitted jobs to lengthen, which both added perform a greedy search for a aggregation of DAG that
overhead and invalidated our existing schedules. Therefore, has enough granularity to hide later wait times and is wait-
we did not have a good estimate of the later placeholder’s effective. We define the wait effectiveness of a job as the
start time. Although our current design misses the potential ratio between its wait time and its running time; a smaller
of overlapping two placeholder jobs wait times with each ratio is better. The intuition behind this is that we want a
other or with running jobs, we can calculate the earliest job to either wait less or finish more tasks. However, we
start time of all the remaining tasks. This is one key to the do not search for the globally best wait-effectiveness. This
aggregate decision described in Figure ??. is because, once we group several layers of the DAG into a
Figures ?? shows the peeling procedure used by the wait-effective aggregation, any later jobs wait time can be
application manager. We refer to this process as “peeling” overlapped with run time of this aggregation. Continually
because it successively peels levels of the DAG off of the adding levels onto the current aggregation negates this
unfinished work list. First (lines 1-6), the main peelLevel benefit.
function estimates the wait time to submit the entire DAG Here is some more detailed explanation of our algo-
as a single placeholder job. It then invokes the groupLevel rithm. After some initialization in lines 1-6, the main
function (lines 7-8 and Figure ??) to search for a better loop in lines 8-37 repeatedly moves one DAG level from
alternative. If groupLevel does not improve the wait time the remaining work to the next placeholder job until the
(lines 10-13), the peeling procedure chooses to submit the aggregation is less wait-effective than the previous round.
DAG either as a single placeholder job or as one job per For each candidate job, lines 9-18 adjust the placeholder’s
task. The decision depends on whether the total wait time requested time to allow the workflow tasks to complete.
Algorithm: groupLevel (levelized DAG, int sub time, int ear sTime,
int peel runTime[2], int peel waitTIme[2] ) 1 peel_wait 1
Time[0]
1 int real runTime[2];
peel_wait
2 int runTime all, waitTime all,leeway; 2 4 5
6 leeway 2 4 5 Time[1]
3 3 6
3 runTime all ← peel runTime[0];
4 waitTime all ← peel waitTIme[0];
real_WaitTime
5 real runTime[0] ← peel runTime[0]; 9
7
6 partial dag ← level one of DAG; 8

7 boolean giant ← true; 7 9


8
8 while partial dag ! = DAG
9 peel runTime[1] ← est runTime(partial dag)
10 real runTime[1] ← peel runTime[1]; 10

11 do
12 peel runTime[1] ← peel runTime[1]+ leeway/2;
13 peel waitTIme[1] ← Fig. 6. Workflow Application Level Decision
14 est waitTime(peel runTime[1], DAG.width,sub time);
15 leeway← ear sTime + real runTime[1] - peel waitTIme[1];
16 while leeway > 10 mins
17 if (leeway > 0)
18 peel runTime[1] ← peel runTime[1] + leeway;
on, levels are added only while the wait-effectiveness of
19 int real WaitTime ← peel waitTIme[1] - ear sTime; the aggregation continues to improve (lines 27-37).
20 if ( real WaitTime < 0)
21 real WaitTime ← peel waitTime[1];
22 if (giant) IV. Experiments
23 if (real WaitTime > real runTime[1])
24 add one level to partial dag;
25 continue A. Experimental Methodology
26 giant ← false;
27 if (peel waitTime[1] - ear sTime > 0 )
28 if ( peel waitTime[1] / real runTime[1]
To test the performance of our algorithm, we developed
29 > peel waitTime[0] / real runTime[0] ) a prototype batch queue system simulator that implements
30 break; the core algorithms of the Maui batch queue scheduler
31 if ( peel waitTime[1] / real runTime[1] described in [?]. The input of the system is a batch
32 > waitTime all /runTime all ) queue log obtained from a production high performance
33 break;
34 peel waitTime[0] ← peel waitTime[1]
computing cluster and a batch queue policy configuration
35 peel runTime[0] ← peel runTime[1] file. It simulates the batch queue execution step by step
36 real runTime[0] ← real runTime[1] based on the input. We also implemented the job start time
37 add one level to partial dag; estimation function (the showstart command). The estima-
38 if (giant) tion is based on the batch queue policy and all the existing
39 return DAG.height;
40 else
queued and running jobs’ maximum requested time. It does
41 return partial dag.height-1; not forecast any future job submissions. Therefore, it is
a best effort estimation within the knowledge of a batch
Fig. 5. The Peel Level decision Procedure queue scheduler.
We implemented the methods of Section ?? to submit
placeholder jobs to this simulator. We also implemented
the runtime algorithm in that section, using events gener-
As the left side of figure ?? shows, this is sometimes ated by our simulator to drive the workflow management.
necessary because the (estimated) queue wait time is less We also implemented two other ways to execute a work-
than the time to complete the current job, creating what we flow application on a batch queue based resources. The first
term the leeway. A simple iteration adds the leeway to the is a straightforward way to submit each individual task to
job request until it is insignificant. (Of course, if the wait the batch queue when it is available to run, which we will
time is more than the time to execute predecessors, then no refer to as the individual submission method. The second
adjustment is needed, as in the right side of figure ??.) The is to submit a giant placeholder job that requests enough
loop then operates in one of two modes based on whether resources for the entire DAG to finish, which we will
a good aggregation has been identified. If no aggregation refer to as the giant submission method. We compare our
has been selected, more levels are added until the real run algorithm, which we will refer to as the hybrid submission
time is significant enough to create overlap for the next method, to the individual and giant method by simulating
aggregation(lines 19-25). Once this happens, the current a DAG submission into the queue using different methods
candidate is marked as a viable aggregation. From then with exactly the same experimental configuration.
1

2 2 2 2 2

4 4 4 4 4

(a) EMAN (b) BLAST (c) Montage (d) Gaussian Elimination (e) Fast Fourier Transform

Fig. 7. Workflow application DAGs

Cluster Institution Batch Length


B. Experimental Setting
Lonestar Texas Adv. Computing Center LSF 12 Mon.
Ada Rice University Maui 12 Mon.
We generate DAG configurations for five high per- LeMieux Pittsburgh SuperComp. Center Custom 12 Mon.
formance computing applications that represent typical RTC Rice University Maui 12 Mon.
Star University of Arkansas Moab 10 Mon.
parallel computing paradigms, as shown in Figure ??.
EMAN [?] is a computational biology application that Fig. 8. The Clusters
has two parallel phases connected with single execution
steps. BLAST [?] is a bioinformatic application that has a
sequence of parallel executions. Montage [?] is an astro-
nomical application consisting of several inter-leaved lay- at 100 random times during the batch queue log’s available
ers of parallel executions. We also use two traditional high time and report the mean results. In total, we ran over
performance algorithms, Fast Fourier Transform (FFT) and 700,000 experiments.
Gaussian elimination. For each application, we generate 25
• Algorithms = {individual, giant, hybrid}
configurations for different data sizes. The total number
• Workflow Application = {EMAN, Montage, BLAST,
of tasks in a workflow ranges from dozens to thousands,
FFT, Gaussian}
maximum parallelism ranges from 5 to 256, and total
• DAG = { 25 for each workflow application}
running time ranges from several hours to a week.
• Batch Queue Logs = {Lonestar, Ada, LeMieux, RTC,
We gathered batch queue logs from four production
Star}
high performance computing sites with different capacity
• Batch Queue Policies = {FL, FS, FCFS}
and batch queue management systems. Figure ?? lists
the five clusters we studied at those sites. From each Fig. 9. The Experiment Settings
log, we collected all the jobs that finished and their
requested number of processors, requested running time,
submission time and user id (used only for the user fair
share computation). We also obtained the start time and C. Result Analysis
finish time of each job to compute the real job run time.
Since most sites don’t publish the details of their queuing Figure ?? shows the average wait time of all workflow
policy and it can change from day to day, we generate applications on five clusters. All but one of the differences
three policies that favored large jobs (FL), small jobs (FS) between averages are statistically significant on a two-
or jobs that stay in the queue the longest (FCFS). These tailed paired t-test with p-value set at 0.05. We can see that
policies are modified from real site policies which all have our hybrid scheduling and submission method consistently
a cap value on the resource component of the priority. For has the least average wait time among three execution
example, the FL policy does not assign a higher priory for methods. The single exception is on cluster Ada with
a large job beyond certain size. Each policy has a queue queuing policy that favors large jobs, and that is the only
wait time component which does not have a cap value to statistical tie. In addition, our results indicate that although
avoid starvation. The FCFS policy has a particularly large the batch queue policy determines each batch queue job’s
weight on the wait time component. priority, it does not affect our experiment significantly.
Figure ?? shows our experiment settings. Since the However, the average wait time from each cluster varies
batch queue loads and number of jobs in the queue fluc- greatly. For example, the average application wait time
tuate widely, the results of our algorithms depend highly on the Lonetar cluster is only a fraction of the other
on the time we simulate the submission. Therefore, we four clusters. Furthermore, while the individual submission
run each experimental configuration combination starting method waits significantly more time on the Ada and
Average Wait Time with FCFS Policy used by the cluster’s maximum capacity and the requested
100000

90000
Individual
84861
90681 load is calculated by using the total CPU hour requested.
Giant 81992
80000
Hybrid 72559
Figure ?? presents each cluster’s configuration and our
70000
62851
calculations. The results clearly show each cluster has its
60000
own unique usage pattern, and we can use this to explain
Sec

50000
the variance in our experiment results. For example, Lon-
40000
33987 36099
33861
30000
28122
25924
estar cluster has the largest computing capacity among the
23122
20000 15960 five clusters. This explains why the average wait time of
10000 7070 8936 workflow application on Lonestar is much less than on
4927
0
the other clusters since the it’s much easier for Lonestar
Lonestar Ada LeMieux RTC Star
cluster to fulfill the resource demand of the same workflow
(a) FCFS Policy Results application than other clusters. The batch queue usage
Average Wait Time with FL Policy pattern can also affect the execution results in more subtle
100000
Individual
92239 ways.
90000
Giant 78873 79850 Figure ?? shows that the Ada cluster users tend to
80000 77351
Hybrid
70000 submit small jobs both in terms of processors and CPU
60000 55996 hours. However, the Ada’s actual load is not particularly
Sec

50000
light and it has a large number of jobs get submitted
40000 34416
31999 30259 each day. This explains why the giant method is more
30000
22472 23291 23573
20000 16583
effective on Ada than the individual method when the
10000 7570 9929
5693
queue policy favors large job, see figure ??. It is because
0 the giant placeholder job would usually be the job with
Lonestar Lonestar LeMieux RTC Star
the highest priority in the queue and thus could start early.
(b) FL Policy Results On the other hand, the individual job submission is less
Average Wait Time with FS Policy
effective not only because the queue policy favors large
100000 jobs but also, since most jobs in the queue are small jobs,
Individual 90130
90000
Giant
there are fewer opportunities to schedule an individual
80000
Hybrid job by backfilling. However, figure ?? does not show a
70000
63939
60000
59438 59490 very clear picture of why the giant method still performs
52871
relatively well when the policy favors small jobs (although
Sec

50000

40000 the difference is much less). Figure ?? depicts more clearly


29891
30000 26111
21568
25444
22125
the effect of the queue policy on the outcome for each
18380
20000 14798
method. We calculated the average of the relative wait time
10000 4819 70533663

0
In figure ?? by dividing each application’s wait time by its
Lonestar Ada LeMieux RTC Star running time before we computed the mean. In this way,
we give each workflow’s wait time an equal weight in the
(c) FS Policy Results
final result. Now, we can see that giant method actually
performs worse when the queue policy favors small jobs
Fig. 10. Overall Average Wait time
in terms of relative wait time. Nevertheless, our hybrid
method performs the best in terms of relative wait time
under all three queue policies since it uses feedback from
LeMieux clusters than the giant method, it waits much less the batch queue scheduler.
time than the giant method on the RTC and Star clusters. We can also deduce from Figure ?? that the users
Since we ran the same set of experiment on each cluster, of the Star cluster request long run times but not as
we hypothesized that the differences in the outcomes were many processors. In addition, we notice that the average
the results of each cluster’s unique combination of its requested load on Star is almost five times more than the
configuration and usage pattern. Therefore, we further actual load, the highest among all clusters we tested. This
analyzed the characteristics of each cluster’s batch queue means the Star users tend to request many more CPU hours
jobs. We calculated the number of jobs submitted each than they actually use. This can partially explain why the
day, processors a job requests, time a job runs, the CPU individual submission method works well on Star since the
hours a job requests, the actual load and the requested system reserves resources for the next highest priority job
load of the system over the duration of each log file. The by basing its start time on the running jobs’ requested time.
actual load is calculated by dividing the total CPU hour When a job finishes early, it creates a backfill window, so
Cluster Cluster Size Mean Jobs Mean Job Mean Job Mean Job Actual Request
per Day Width Run Time Request Size Load Load
Lonestar 5000 core 932 26.18 core 3.03 hour 274 hour 0.81 2.13
Ada 520 Core 1342 3.57 core 3.57 hour 25 hour 0.81 2.76
LeMieux 2048 Core 251 43.80 core 3.30 hour 329 hour 0.91 1.68
RTC 270 Core 108 2.43 core 13.69 hour 112 hour 0.57 1.87
Star 1200 Core 108 13.16 core 16.93 hour 1050 hour 0.83 3.94

Fig. 11. Cluster Configuration and Batch queue Job Characteristic

The Average Rela,ve Wait Time on Ada Cluster Average Resource Usage
Individual Giant Hybrid 1400
2.50 2.36
2.27 Individual 1181
2.16 1200 1136
2.02
Giant
1019
2.00 966
1000 Hybrid
1.69 1.64 843
1.57
1.51 775

cpu hour
800
1.50

1.19 591 588


600 548
1.00 440
390 401
400
288
202 202
0.50 200

0
0.00
FL FS FCFS Lonestar Ada LeMieux RTC Star

Fig. 12. The Effect of Queue Policy on Ada Fig. 13. The CPU Hour Usage

Star would have many backfill opportunities based on its size and inaccurate job request on the Star cluster, our
usage pattern. Small jobs, as generated by the individual hybrid method again has the lowest mean wait time.
method, are more likely to be able to use these backfill Figure ?? also explains the giant method’s ineffective-
slots. However, this does not explain why the giant method ness on the small RTC cluster. When virtual reservations in
works better under a queue policy that favors small jobs the giant method request more than 128 processors (which
on Star cluster. about 30% of the total workflows do), it takes more than
We computed the average resource usage for our work- half the cluster. Even when the queue policy favors large
flow applications on the clusters with FS queue policy. jobs, such a job cannot run until all the already running
The resource usage for a workflow application is the jobs on RTC finish. Figure ?? presents the average wait
sum of the actual running times for all placeholder jobs time of the workflows that require less (small DAG) or
submitted into the queue. The wait time is not included. more (large DAG) than 128 processors on the RTC cluster.
Figure ?? shows that the giant submission method uses It shows the giant method indeed suffers the most when a
almost three times more resources than individual method single workflow application requires too much of the entire
while our hybrid submission method uses 10-20% less than cluster. The same would be true for placeholders generated
the giant method. In both the hybrid and giant method, the by the hybrid method, but the estimated wait times prevent
additional CPU usage is mainly due to resources allocated our scheduler from generating such pathologies. As a
to the placeholder according to the level with the maximum result, our hybrid method outperforms both giant and
parallelism but not used on the other levels. On the Star individual under any policy on the RTC when the DAGs
cluster, we can see the average giant placeholder job uses are small and it virtually submits all the big DAGs in
less than 600 CPU hours while Figure ?? shows the individual mode.
average job on Star requests over 1000 CPU hours. This Figure ?? shows why the hybrid method performs the
means the giant jobs are actually small compared to other best on the LeMieux cluster. We can see that the LeMieux
jobs’ requests (although, again referring to Figure ??, not cluster’s ratio of requested load to actual load is the lowest,
their actual run time). This explains why all the execution which means that users do a good job in estimating their
methods work better under the queue policy that favors jobs’ running time. That greatly improves the accuracy of
small jobs on the Star cluster. At same time, we can see the batch queue start time estimation and in turn reduces
that the idle processor overhead for both the giant and the opportunities for individual jobs to be backfilled. In
hybrid methods can be substantial. Despite the large job short, the individual method has no leverage to schedule
The Average Wait Time on RTC Cluster Average Wait Time for Different ApplicaIons
300000 Individual Giant Hybrid
269611 80000
Individual
250000 67221 68740
70000 64867
Giant 63422
Hybrid 60000
200000
50000 48046 47560
Sec

Sec
150000
40000
32349
31497 31908
29023
100000 30000
22371 23635
20665
20000 16266 17919
50000 30740 30808
17682 10000
11867 9120
0 0
Small DAG Large DAG EMAN BLAST Montage Gaussian FFT

Fig. 14. The Average Wait Time of Small DAGs Fig. 15. The Average Wait Time for Different
on RTC Cluster Applications

its small tasks. On the other end of the spectrum, the model and historical traces of job wait times in the queue
accurate wait time estimation helps the hybrid method to produce a prediction for a user specified quantile at a
avoid submitting large requests that would endure long given confidence level without knowing the exact queuing
waits, as the giant method is prone to do. As a result, policy of the resource. We use the estimate provided by the
we see a better advantage for the the hybrid method on system itself, but in principle we could use any predictor.
LeMieux than any other cluster. There are several techniques for a user to reserve re-
The type of workflow application can also affect the sources in a batch queue system without using the system’s
performance of the execution methods. Figure ?? shows advanced reservation function. Condor glide-in [?] is used
the average wait time of the five workflows we tested to create condor [?] pools in a remote resource. Nurmi
averaged across all the clusters under the FL policy. While et al. [?] implemented probabilistic based reservations for
the giant method is best for Gaussian elimination, it is batch-scheduled resources. The basic idea is to use their
worst for the other four applications. The difference lies wait time prediction [?] to choose when to submit a job
in the application configuration as shown in figure ??. so that it runs at a given time. Walker et al. [?] developed
The Gaussian elimination workflow has the most levels an infrastructure that submits and manages job proxies
relative to the number of tasks among our test cases. across several clusters. A user can create a virtual login
For example, EMAN and Montage both have a constant session that would in turn submit the user’s jobs through
number of levels, and FFT grows logarithmically to a a proxy manager to a remote computing cluster. Kee et
total of level 20 in our test while the longest Gaussian al. [?] developed a virtual grid system that allows a user
DAG has over 100 levels. Since the tasks in the individual to specify a number of resource reservations. Our work is
submission method have to wait for the previous level of inspired by these techniques to get a personal cluster from
task to finish before it can be submitted into the queue, a batch queue controlled resource for each aggregation of
there are more stalls for the Gaussian workflow than other tasks in the workflow application.
applications. Another reason is the maximum parallelism Limited research has been done on scheduling a work-
for a Gaussian placeholder is 55 while other applications flow application on a batch queue controlled resources.
have up to 256 in our experiment settings. As we saw Nurmi et al. [?] took into account the queue wait time when
in Figure ??, the giant method performs better than the each individual task in a workflow application is scheduled.
individual method when a DAG’s maximum parallelism is Singh et al. [?] demonstrated the effectiveness of clustering
small relative to the cluster size. The giant method results a workflow application using the Montage [?] application.
on RTC cluster alone increase the average wait time for all Our approach builds on top of their ideas by dynamically
the applications but the Gaussian workflow. Again, we see choosing the clustering for the workflow, whereas they use
that our hybrid algorithm consistently has the least wait static mappings.
time for any workflow applications we tested.
VI. Conclusions and Future Work
V. Related Work
In this paper, we presented an algorithm that clusters a
Brevik et al. [?] provided upper bound prediction of the workflow application and submits them when the previous
queue wait time for an individual job. They used a binomial aggregation begins to run in the batch queue. The aggre-
gation granularity is computed so that it can minimize the Computing, page 14, Washington, DC, USA, 2006. IEEE Computer
total wait time experienced by the workflow by overlapping Society.
[5] James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, and Steven
most of the wait time and running time between the aggre- Tuecke. Condor-g: A computation management agent for multi-
gations. By using system-provided estimates of the current institutional grids. Cluster Computing, 5(3):237–246, 2002.
queue wait time, we were able to substantially improve [6] Bill Howe, Peter Lawson, Renee Bellinger, Erik Anderson,
Emanuele Santos, Juliana Freire, Carlos Scheidegger, António Bap-
turnaround time over standard strategies of submitting tista, and Cláudio Silva. End-to-end escience: Integrating workflow,
many small jobs or a single large job. The results that query, visualization, and provenance at an ocean observatory. In
we collected from running over half a million experiments ESCIENCE ’08: Proceedings of the 2008 Fourth IEEE International
Conference on eScience, pages 127–134, Washington, DC, USA,
using logs from five production HPC resources showed 2008. IEEE Computer Society.
that our hybrid execution method consistently results in [7] David B. Jackson, Quinn Snell, and Mark J. Clement. Core algo-
less overall wait time in the batch queue. We were able rithms of the maui scheduler. In JSSPP ’01: Revised Papers from
the 7th International Workshop on Job Scheduling Strategies for
to accomplish this without modifying the site policies or Parallel Processing, pages 87–102, London, UK, 2001. Springer-
software. Verlag.
Not every batch queue resource management softwares [8] Gideon Juve and Ewa Deelman. Resource provisioning options
for large-scale scientific workflows. eScience, IEEE International
provide the earliest job start time estimation yet so in Conference on, 0:608–613, 2008.
the future we would like to integrate this feature into [9] Yang-Suk Kee, C. Kesselman, D. Nurmi, and R. Wolski. Enabling
open source systems. Moreover, we believe that providing personal clusters on demand for batch resources using commodity
software. In Parallel and Distributed Processing, 2008. IPDPS
support for workflow DAGs directly in the batch queue 2008. IEEE International Symposium on, pages 1–7, April 2008.
software would be a valuable service to users, particularly [10] Arun Krishnan. Gridblast: a globus-based high-throughput imple-
when coupled with intelligent scheduling techniques such mentation of blast in a grid computing framework: Research articles.
Concurr. Comput. : Pract. Exper., 17(13):1607–1623, 2005.
as those we have presented. [11] Load Sharing Facility (LSF). http://www.platform.com/.
[12] S. Ludtke, P. Baldwin, and W. Chiu. EMAN: Semiautomated
software for high resolution single-particle reconstructions. J.
Acknowledgments Struct. Biol, pages 82–97, 1999.
[13] W. Gentzsch (Sun Microsystems). Sun grid engine: Towards
creating a compute power grid. In CCGRID ’01: Proceedings of the
This material is based on work supported by the Na- 1st International Symposium on Cluster Computing and the Grid,
tional Science Foundation under Cooperative Agreement page 35, Washington, DC, USA, 2001. IEEE Computer Society.
No. CCR-0331645 (the VGrADS Project). This work [14] Daniel Nurmi, Anirban Mandal, John Brevik, Chuck Koelbel, Rich
Wolski, and Ken Kennedy. Evaluation of a workflow scheduler
was supported in part by the Shared University Grid at using integrated performance modelling and batch queue wait
Rice funded by NSF under Grant EIA-0216467, and a time prediction. In SC ’06: Proceedings of the 2006 ACM/IEEE
partnership between Rice University, Sun Microsystems, conference on Supercomputing, page 119, New York, NY, USA,
2006. ACM.
and Sigma Solutions, Inc. We would also like to thank [15] Daniel Charles Nurmi, Rich Wolski, and John Brevik. Varq: virtual
Roger Moye from Rice University, Jeff Pummill, Dr. Amy advance reservations for queues. In HPDC ’08: Proceedings of
Apon, Wesley Emeneker from University of Arkansas, the 17th international symposium on High performance distributed
computing, pages 75–86, New York, NY, USA, 2008. ACM.
Rich Raymond, Chad Vizino from Pittsuburgh Supercom- [16] Open PBS. http://www.openpbs.org/.
puting Center and Warren Smith from Texas Advanced [17] Condor Research Project. http://www.cs.wisc.edu/condor.
Computing Center for providing the batch queue log data. [18] G. Singh, E. Deelman, and G. Bruce Berriman et al. Montage:
a Grid enabled image mosaic service for the National Virtual
Observatory. Astronomical Data Analysis Software and Systems,
References 2003.
[19] Gurmeet Singh, Mei-Hui Su, Karan Vahi, Ewa Deelman, Bruce Ber-
riman, John Good, Daniel S. Katz, and Gaurang Mehta. Workflow
[1] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, and David task clustering for best effort systems with pegasus. In MG ’08:
Jackson. The portable batch scheduler and the maui scheduler on Proceedings of the 15th ACM Mardi Gras conference, pages 1–8,
linux clusters. In ALS’00: Proceedings of the 4th annual Linux New York, NY, USA, 2008. ACM.
Showcase & Conference, pages 27–27, Berkeley, CA, USA, 2000. [20] Quinn Snell, Mark J. Clement, David B. Jackson, and Chad
USENIX Association. Gregory. The performance impact of advance reservation meta-
[2] John Brevik, Daniel Nurmi, and Rich Wolski. Predicting bounds scheduling. In IPDPS ’00/JSSPP ’00: Proceedings of the Workshop
on queuing delay for batch-scheduled parallel machines. In PPoPP on Job Scheduling Strategies for Parallel Processing, pages 137–
’06: Proceedings of the eleventh ACM SIGPLAN symposium on 153, London, UK, 2000. Springer-Verlag.
Principles and practice of parallel programming, pages 110–118, [21] TeraGrid. http://www.teragrid.org/about.
New York, NY, USA, 2006. ACM. [22] Texas Advanced Supercomputing Center. http://www.tacc.utexas.
[3] Inc. Cluster Resources. http://clusterresources.com/. edu/.
[4] Ewa Deelman, Scott Callaghan, Edward Field, Hunter Francoeur, [23] E. Walker, J.P. Gardner, V. Litvin, and E.L. Turner. Creating per-
Robert Graves, Nitin Gupta, Vipin Gupta, Thomas H. Jordan, Carl sonal adaptive clusters for managing scientific jobs in a distributed
Kesselman, Philip Maechling, John Mehringer, Gaurang Mehta, computing environment. In Challenges of Large Applications in
David Okaya, Karan Vahi, and Li Zhao. Managing large-scale Distributed Environments, 2006 IEEE, pages 95–103, 2006.
workflow execution from resource provisioning to provenance track-
ing: The cybershake example. In E-SCIENCE ’06: Proceedings of
the Second IEEE International Conference on e-Science and Grid

You might also like