Batch Queue Resource Scheduling For Workflow Applications
Batch Queue Resource Scheduling For Workflow Applications
Wait
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Time
2 3 4 5 6
7 8 9
7 8 9 7 8 9 7 8 9
10
10
10 10
communications). Figure ?? shows an example. The left is responsible for launching the workflow application and
side of the figure is the original DAG that represents a monitoring its progress. In general, it chooses partial DAGs
workflow application. The right side of the figure is a and submits placeholder jobs to the batch queue system.
clustered version of the same DAG that we aggregate all Individual workflow tasks execute in the placeholder jobs
the tasks in the same level into one aggregation. Our goal when those jobs come to the front of the queue, with
is to choose a clustering algorithm that will reduce the the application manager enforcing their dependences. The
total batch queue wait time. The main idea behind our peeling procedure selects the partial DAGs to minimize the
approach is that we can aggregate the workflow by level exposed waiting time. We now consider the parts in turn.
and submit a placeholder job for the later levels before Figure ?? shows the application manager. After select-
their predecessor finishes. In this way, we can overlap the ing and submitting the initial partial DAG (lines 1-5), the
running time of the predecessor level with the wait time manager becomes an event-driven system. The primary
of the successor levels. events that it responds to are:
Figure ?? illustrates this idea. The left shows the possi- • A placeholder job starts to run (lines 7-16). The
ble result of grouping the workflow DAG in figure ?? into manger first starts all the workflow tasks associated
two aggregations and submitting them in turn. The yellow with that job whose predecessors have finished. Then
rectangles represent the wait time of the two placeholder it invokes the peeling procedure to form the next
jobs in the queue. A placeholder job, represented by placeholder job and submit it to the queue.
a rectangle that contains one or more levels of tasks, • A placeholder job finishes running (lines 17-25).
is submitted into the queue as soon as its predecessor Normally, no processing is needed. However, if the
placeholder job starts. It asks for enough resources for the placeholder is terminated before all its tasks complete
tasks it holds to run in full parallelism. The wait time (i.e. because some predecessors were delayed in the
seen by the users for the clustering on the left is the batch queue), the manager must clean up. It cancels
dark yellow area marked ”real wait time”. We can see any placeholders that have not started, since some of
from the figure that it is less than the queue wait time their predecessors may be delayed. It also calls the
for the second task because of the overlap with task 1’s peeling procedure to reschedule the unfinished DAG
execution. Ideally, if the first placeholder job gets to run nodes (both interrupted tasks and those not yet run)
immediately and the later jobs’ wait times do not exceed and submits the new placeholder job into the queue.
their predecessor’s run times, the queue wait time for • A DAG task finishes (lines 26-32). The manger starts
the entire workflow application is eliminated, as shown all the successor tasks whose placehold job is already
on right side of figure ??. However, this perfect overlap running. One subtlety in the application manager is
cannot be guaranteed. Furthermore, if the wait time for a that the successors of a DAG node may be in the same
placeholder job is less than its predecessor’s run time (as is placeholder or a different one. In the latter case, the
the case for task 10), it must pad its requested time to honor manager must handle the possibility that a placeholder
its dependences. In turn, this will affect the wait time of the starts without any runnable tasks (lines 28-30). If
placeholder job. Balancing these effects requires heuristic all of a placeholder’s tasks are finished, the manger
scheduling. finishes the job to free the batch queue resource.
Our algorithm consists of two interrelated parts: an We choose to submit a new placeholder job only after its
application manager shown in Figure ??, and a “peeling” predecessor begins running. There are several reasons for
procedure shown in Figure ??. The application manager this design. In our experience with real queues, we discov-
Algorithm:runDAG (DAG dag, int sub time) Algorithm: peelLevel(levelized DAG, int sub time, int ear time)
1 task[] partial dag ← levelize(dag); 1 int runTime all, waitTime all;
2 int count ← 0; 2 int peel runTime[2], peel waitTIme[2];
3 Placeholder job ← peelLevel(partial dag, sub time, 0); 3 runTime all ← est runTime(DAG);
4 job. name ← count; 4 waitTime all ← est waitTime(runTime all, DAG.width,sub time);
5 submit job; 5 peel runTime[0] ← runTime all;
6 while ( dag is not finished) 6 peel waitTIme[0] ← waitTime all;
7 listen to batch queue and task events; 7 int level = groupLevel(DAG,sub time, ear time,
8 if (placeholder job n starts to run at time t) 8 peel runTime, peel waitTIme);
9 for all (task in job n.getTasks()) 9 if ( level == DAG.height)
10 if (all task predecessors have finished) 10 if (runTime all * 2 < waitTime all)
11 start task; 11 return the whole remaining DAG in a batch queue job
12 ear finTime ← job n.runTime; 12 else
13 partial dag ← levelize(dag.unmappedTasks()); 13 return submit the remaining DAG in individual mode
14 job ← peelLevel(partial dag, t, ear finTime ); 14 else
15 job. name ← ++count; 15 group levels to a partial dag;
16 submit job; 16 map each dag job to the batch queue job;
17 else if ( placeholder job n finishes running at time t) 17 return the partial dag in a placeholder job;
18 if ( job n has unfinished tasks)
19 partial dag ← levelize(job n.unfinishedTasks());
20 for all ( pending placeholder job job m ) Fig. 4. The DAG Peeling Procedure
21 cancel job m;
22 add job m.tasks() to partial dag ;
23 Placeholder jobResub ← peelLevel(partial dag, t, 0);
24 map all tasks in the partial dag to jobResub;
25 submit jobResub; as a single job is twice the total run time of the DAG. The
26 else if ( task dagTask finishes running at time t) intuition for this is that individual submission can take
27 delete the dagTask from its placeholder job advantage of the free resources or the backfill window.
28 for all (dagTask’s successor task chd task) When the one giant placeholder job’s wait time is twice as
29 if (chd task’s associated placeholder job is running)
30 start chd task;
long as the run time, the individual submission has a better
31 if (dagTask’s placehold Job has no more tasks to run) chance to finish earlier. This is a heuristic parameter chosen
32 stop dagTask’s placehold Job empirically. Otherwise, we use the partial DAG returned by
groupLevel. The earliest job start estimation we used is a
best effort approach like the showstart command in Maui.
Fig. 3. The DAG Application Manager However, our experience shows it is a reliable indicator
of the wait time with one experiment shows the mean
difference is within 5% of the real wait time.
ered that multiple outstanding jobs in the queue interfered Figure ?? shows the key groupLevel procedure. Al-
with each other. In turn, this often caused the wait time though the logic is somewhat complex, in essence we
for already-submitted jobs to lengthen, which both added perform a greedy search for a aggregation of DAG that
overhead and invalidated our existing schedules. Therefore, has enough granularity to hide later wait times and is wait-
we did not have a good estimate of the later placeholder’s effective. We define the wait effectiveness of a job as the
start time. Although our current design misses the potential ratio between its wait time and its running time; a smaller
of overlapping two placeholder jobs wait times with each ratio is better. The intuition behind this is that we want a
other or with running jobs, we can calculate the earliest job to either wait less or finish more tasks. However, we
start time of all the remaining tasks. This is one key to the do not search for the globally best wait-effectiveness. This
aggregate decision described in Figure ??. is because, once we group several layers of the DAG into a
Figures ?? shows the peeling procedure used by the wait-effective aggregation, any later jobs wait time can be
application manager. We refer to this process as “peeling” overlapped with run time of this aggregation. Continually
because it successively peels levels of the DAG off of the adding levels onto the current aggregation negates this
unfinished work list. First (lines 1-6), the main peelLevel benefit.
function estimates the wait time to submit the entire DAG Here is some more detailed explanation of our algo-
as a single placeholder job. It then invokes the groupLevel rithm. After some initialization in lines 1-6, the main
function (lines 7-8 and Figure ??) to search for a better loop in lines 8-37 repeatedly moves one DAG level from
alternative. If groupLevel does not improve the wait time the remaining work to the next placeholder job until the
(lines 10-13), the peeling procedure chooses to submit the aggregation is less wait-effective than the previous round.
DAG either as a single placeholder job or as one job per For each candidate job, lines 9-18 adjust the placeholder’s
task. The decision depends on whether the total wait time requested time to allow the workflow tasks to complete.
Algorithm: groupLevel (levelized DAG, int sub time, int ear sTime,
int peel runTime[2], int peel waitTIme[2] ) 1 peel_wait 1
Time[0]
1 int real runTime[2];
peel_wait
2 int runTime all, waitTime all,leeway; 2 4 5
6 leeway 2 4 5 Time[1]
3 3 6
3 runTime all ← peel runTime[0];
4 waitTime all ← peel waitTIme[0];
real_WaitTime
5 real runTime[0] ← peel runTime[0]; 9
7
6 partial dag ← level one of DAG; 8
11 do
12 peel runTime[1] ← peel runTime[1]+ leeway/2;
13 peel waitTIme[1] ← Fig. 6. Workflow Application Level Decision
14 est waitTime(peel runTime[1], DAG.width,sub time);
15 leeway← ear sTime + real runTime[1] - peel waitTIme[1];
16 while leeway > 10 mins
17 if (leeway > 0)
18 peel runTime[1] ← peel runTime[1] + leeway;
on, levels are added only while the wait-effectiveness of
19 int real WaitTime ← peel waitTIme[1] - ear sTime; the aggregation continues to improve (lines 27-37).
20 if ( real WaitTime < 0)
21 real WaitTime ← peel waitTime[1];
22 if (giant) IV. Experiments
23 if (real WaitTime > real runTime[1])
24 add one level to partial dag;
25 continue A. Experimental Methodology
26 giant ← false;
27 if (peel waitTime[1] - ear sTime > 0 )
28 if ( peel waitTime[1] / real runTime[1]
To test the performance of our algorithm, we developed
29 > peel waitTime[0] / real runTime[0] ) a prototype batch queue system simulator that implements
30 break; the core algorithms of the Maui batch queue scheduler
31 if ( peel waitTime[1] / real runTime[1] described in [?]. The input of the system is a batch
32 > waitTime all /runTime all ) queue log obtained from a production high performance
33 break;
34 peel waitTime[0] ← peel waitTime[1]
computing cluster and a batch queue policy configuration
35 peel runTime[0] ← peel runTime[1] file. It simulates the batch queue execution step by step
36 real runTime[0] ← real runTime[1] based on the input. We also implemented the job start time
37 add one level to partial dag; estimation function (the showstart command). The estima-
38 if (giant) tion is based on the batch queue policy and all the existing
39 return DAG.height;
40 else
queued and running jobs’ maximum requested time. It does
41 return partial dag.height-1; not forecast any future job submissions. Therefore, it is
a best effort estimation within the knowledge of a batch
Fig. 5. The Peel Level decision Procedure queue scheduler.
We implemented the methods of Section ?? to submit
placeholder jobs to this simulator. We also implemented
the runtime algorithm in that section, using events gener-
As the left side of figure ?? shows, this is sometimes ated by our simulator to drive the workflow management.
necessary because the (estimated) queue wait time is less We also implemented two other ways to execute a work-
than the time to complete the current job, creating what we flow application on a batch queue based resources. The first
term the leeway. A simple iteration adds the leeway to the is a straightforward way to submit each individual task to
job request until it is insignificant. (Of course, if the wait the batch queue when it is available to run, which we will
time is more than the time to execute predecessors, then no refer to as the individual submission method. The second
adjustment is needed, as in the right side of figure ??.) The is to submit a giant placeholder job that requests enough
loop then operates in one of two modes based on whether resources for the entire DAG to finish, which we will
a good aggregation has been identified. If no aggregation refer to as the giant submission method. We compare our
has been selected, more levels are added until the real run algorithm, which we will refer to as the hybrid submission
time is significant enough to create overlap for the next method, to the individual and giant method by simulating
aggregation(lines 19-25). Once this happens, the current a DAG submission into the queue using different methods
candidate is marked as a viable aggregation. From then with exactly the same experimental configuration.
1
2 2 2 2 2
4 4 4 4 4
(a) EMAN (b) BLAST (c) Montage (d) Gaussian Elimination (e) Fast Fourier Transform
90000
Individual
84861
90681 load is calculated by using the total CPU hour requested.
Giant 81992
80000
Hybrid 72559
Figure ?? presents each cluster’s configuration and our
70000
62851
calculations. The results clearly show each cluster has its
60000
own unique usage pattern, and we can use this to explain
Sec
50000
the variance in our experiment results. For example, Lon-
40000
33987 36099
33861
30000
28122
25924
estar cluster has the largest computing capacity among the
23122
20000 15960 five clusters. This explains why the average wait time of
10000 7070 8936 workflow application on Lonestar is much less than on
4927
0
the other clusters since the it’s much easier for Lonestar
Lonestar Ada LeMieux RTC Star
cluster to fulfill the resource demand of the same workflow
(a) FCFS Policy Results application than other clusters. The batch queue usage
Average Wait Time with FL Policy pattern can also affect the execution results in more subtle
100000
Individual
92239 ways.
90000
Giant 78873 79850 Figure ?? shows that the Ada cluster users tend to
80000 77351
Hybrid
70000 submit small jobs both in terms of processors and CPU
60000 55996 hours. However, the Ada’s actual load is not particularly
Sec
50000
light and it has a large number of jobs get submitted
40000 34416
31999 30259 each day. This explains why the giant method is more
30000
22472 23291 23573
20000 16583
effective on Ada than the individual method when the
10000 7570 9929
5693
queue policy favors large job, see figure ??. It is because
0 the giant placeholder job would usually be the job with
Lonestar Lonestar LeMieux RTC Star
the highest priority in the queue and thus could start early.
(b) FL Policy Results On the other hand, the individual job submission is less
Average Wait Time with FS Policy
effective not only because the queue policy favors large
100000 jobs but also, since most jobs in the queue are small jobs,
Individual 90130
90000
Giant
there are fewer opportunities to schedule an individual
80000
Hybrid job by backfilling. However, figure ?? does not show a
70000
63939
60000
59438 59490 very clear picture of why the giant method still performs
52871
relatively well when the policy favors small jobs (although
Sec
50000
0
In figure ?? by dividing each application’s wait time by its
Lonestar Ada LeMieux RTC Star running time before we computed the mean. In this way,
we give each workflow’s wait time an equal weight in the
(c) FS Policy Results
final result. Now, we can see that giant method actually
performs worse when the queue policy favors small jobs
Fig. 10. Overall Average Wait time
in terms of relative wait time. Nevertheless, our hybrid
method performs the best in terms of relative wait time
under all three queue policies since it uses feedback from
LeMieux clusters than the giant method, it waits much less the batch queue scheduler.
time than the giant method on the RTC and Star clusters. We can also deduce from Figure ?? that the users
Since we ran the same set of experiment on each cluster, of the Star cluster request long run times but not as
we hypothesized that the differences in the outcomes were many processors. In addition, we notice that the average
the results of each cluster’s unique combination of its requested load on Star is almost five times more than the
configuration and usage pattern. Therefore, we further actual load, the highest among all clusters we tested. This
analyzed the characteristics of each cluster’s batch queue means the Star users tend to request many more CPU hours
jobs. We calculated the number of jobs submitted each than they actually use. This can partially explain why the
day, processors a job requests, time a job runs, the CPU individual submission method works well on Star since the
hours a job requests, the actual load and the requested system reserves resources for the next highest priority job
load of the system over the duration of each log file. The by basing its start time on the running jobs’ requested time.
actual load is calculated by dividing the total CPU hour When a job finishes early, it creates a backfill window, so
Cluster Cluster Size Mean Jobs Mean Job Mean Job Mean Job Actual Request
per Day Width Run Time Request Size Load Load
Lonestar 5000 core 932 26.18 core 3.03 hour 274 hour 0.81 2.13
Ada 520 Core 1342 3.57 core 3.57 hour 25 hour 0.81 2.76
LeMieux 2048 Core 251 43.80 core 3.30 hour 329 hour 0.91 1.68
RTC 270 Core 108 2.43 core 13.69 hour 112 hour 0.57 1.87
Star 1200 Core 108 13.16 core 16.93 hour 1050 hour 0.83 3.94
The Average Rela,ve Wait Time on Ada Cluster Average Resource Usage
Individual Giant Hybrid 1400
2.50 2.36
2.27 Individual 1181
2.16 1200 1136
2.02
Giant
1019
2.00 966
1000 Hybrid
1.69 1.64 843
1.57
1.51 775
cpu hour
800
1.50
0
0.00
FL FS FCFS Lonestar Ada LeMieux RTC Star
Fig. 12. The Effect of Queue Policy on Ada Fig. 13. The CPU Hour Usage
Star would have many backfill opportunities based on its size and inaccurate job request on the Star cluster, our
usage pattern. Small jobs, as generated by the individual hybrid method again has the lowest mean wait time.
method, are more likely to be able to use these backfill Figure ?? also explains the giant method’s ineffective-
slots. However, this does not explain why the giant method ness on the small RTC cluster. When virtual reservations in
works better under a queue policy that favors small jobs the giant method request more than 128 processors (which
on Star cluster. about 30% of the total workflows do), it takes more than
We computed the average resource usage for our work- half the cluster. Even when the queue policy favors large
flow applications on the clusters with FS queue policy. jobs, such a job cannot run until all the already running
The resource usage for a workflow application is the jobs on RTC finish. Figure ?? presents the average wait
sum of the actual running times for all placeholder jobs time of the workflows that require less (small DAG) or
submitted into the queue. The wait time is not included. more (large DAG) than 128 processors on the RTC cluster.
Figure ?? shows that the giant submission method uses It shows the giant method indeed suffers the most when a
almost three times more resources than individual method single workflow application requires too much of the entire
while our hybrid submission method uses 10-20% less than cluster. The same would be true for placeholders generated
the giant method. In both the hybrid and giant method, the by the hybrid method, but the estimated wait times prevent
additional CPU usage is mainly due to resources allocated our scheduler from generating such pathologies. As a
to the placeholder according to the level with the maximum result, our hybrid method outperforms both giant and
parallelism but not used on the other levels. On the Star individual under any policy on the RTC when the DAGs
cluster, we can see the average giant placeholder job uses are small and it virtually submits all the big DAGs in
less than 600 CPU hours while Figure ?? shows the individual mode.
average job on Star requests over 1000 CPU hours. This Figure ?? shows why the hybrid method performs the
means the giant jobs are actually small compared to other best on the LeMieux cluster. We can see that the LeMieux
jobs’ requests (although, again referring to Figure ??, not cluster’s ratio of requested load to actual load is the lowest,
their actual run time). This explains why all the execution which means that users do a good job in estimating their
methods work better under the queue policy that favors jobs’ running time. That greatly improves the accuracy of
small jobs on the Star cluster. At same time, we can see the batch queue start time estimation and in turn reduces
that the idle processor overhead for both the giant and the opportunities for individual jobs to be backfilled. In
hybrid methods can be substantial. Despite the large job short, the individual method has no leverage to schedule
The Average Wait Time on RTC Cluster Average Wait Time for Different ApplicaIons
300000 Individual Giant Hybrid
269611 80000
Individual
250000 67221 68740
70000 64867
Giant 63422
Hybrid 60000
200000
50000 48046 47560
Sec
Sec
150000
40000
32349
31497 31908
29023
100000 30000
22371 23635
20665
20000 16266 17919
50000 30740 30808
17682 10000
11867 9120
0 0
Small DAG Large DAG EMAN BLAST Montage Gaussian FFT
Fig. 14. The Average Wait Time of Small DAGs Fig. 15. The Average Wait Time for Different
on RTC Cluster Applications
its small tasks. On the other end of the spectrum, the model and historical traces of job wait times in the queue
accurate wait time estimation helps the hybrid method to produce a prediction for a user specified quantile at a
avoid submitting large requests that would endure long given confidence level without knowing the exact queuing
waits, as the giant method is prone to do. As a result, policy of the resource. We use the estimate provided by the
we see a better advantage for the the hybrid method on system itself, but in principle we could use any predictor.
LeMieux than any other cluster. There are several techniques for a user to reserve re-
The type of workflow application can also affect the sources in a batch queue system without using the system’s
performance of the execution methods. Figure ?? shows advanced reservation function. Condor glide-in [?] is used
the average wait time of the five workflows we tested to create condor [?] pools in a remote resource. Nurmi
averaged across all the clusters under the FL policy. While et al. [?] implemented probabilistic based reservations for
the giant method is best for Gaussian elimination, it is batch-scheduled resources. The basic idea is to use their
worst for the other four applications. The difference lies wait time prediction [?] to choose when to submit a job
in the application configuration as shown in figure ??. so that it runs at a given time. Walker et al. [?] developed
The Gaussian elimination workflow has the most levels an infrastructure that submits and manages job proxies
relative to the number of tasks among our test cases. across several clusters. A user can create a virtual login
For example, EMAN and Montage both have a constant session that would in turn submit the user’s jobs through
number of levels, and FFT grows logarithmically to a a proxy manager to a remote computing cluster. Kee et
total of level 20 in our test while the longest Gaussian al. [?] developed a virtual grid system that allows a user
DAG has over 100 levels. Since the tasks in the individual to specify a number of resource reservations. Our work is
submission method have to wait for the previous level of inspired by these techniques to get a personal cluster from
task to finish before it can be submitted into the queue, a batch queue controlled resource for each aggregation of
there are more stalls for the Gaussian workflow than other tasks in the workflow application.
applications. Another reason is the maximum parallelism Limited research has been done on scheduling a work-
for a Gaussian placeholder is 55 while other applications flow application on a batch queue controlled resources.
have up to 256 in our experiment settings. As we saw Nurmi et al. [?] took into account the queue wait time when
in Figure ??, the giant method performs better than the each individual task in a workflow application is scheduled.
individual method when a DAG’s maximum parallelism is Singh et al. [?] demonstrated the effectiveness of clustering
small relative to the cluster size. The giant method results a workflow application using the Montage [?] application.
on RTC cluster alone increase the average wait time for all Our approach builds on top of their ideas by dynamically
the applications but the Gaussian workflow. Again, we see choosing the clustering for the workflow, whereas they use
that our hybrid algorithm consistently has the least wait static mappings.
time for any workflow applications we tested.
VI. Conclusions and Future Work
V. Related Work
In this paper, we presented an algorithm that clusters a
Brevik et al. [?] provided upper bound prediction of the workflow application and submits them when the previous
queue wait time for an individual job. They used a binomial aggregation begins to run in the batch queue. The aggre-
gation granularity is computed so that it can minimize the Computing, page 14, Washington, DC, USA, 2006. IEEE Computer
total wait time experienced by the workflow by overlapping Society.
[5] James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, and Steven
most of the wait time and running time between the aggre- Tuecke. Condor-g: A computation management agent for multi-
gations. By using system-provided estimates of the current institutional grids. Cluster Computing, 5(3):237–246, 2002.
queue wait time, we were able to substantially improve [6] Bill Howe, Peter Lawson, Renee Bellinger, Erik Anderson,
Emanuele Santos, Juliana Freire, Carlos Scheidegger, António Bap-
turnaround time over standard strategies of submitting tista, and Cláudio Silva. End-to-end escience: Integrating workflow,
many small jobs or a single large job. The results that query, visualization, and provenance at an ocean observatory. In
we collected from running over half a million experiments ESCIENCE ’08: Proceedings of the 2008 Fourth IEEE International
Conference on eScience, pages 127–134, Washington, DC, USA,
using logs from five production HPC resources showed 2008. IEEE Computer Society.
that our hybrid execution method consistently results in [7] David B. Jackson, Quinn Snell, and Mark J. Clement. Core algo-
less overall wait time in the batch queue. We were able rithms of the maui scheduler. In JSSPP ’01: Revised Papers from
the 7th International Workshop on Job Scheduling Strategies for
to accomplish this without modifying the site policies or Parallel Processing, pages 87–102, London, UK, 2001. Springer-
software. Verlag.
Not every batch queue resource management softwares [8] Gideon Juve and Ewa Deelman. Resource provisioning options
for large-scale scientific workflows. eScience, IEEE International
provide the earliest job start time estimation yet so in Conference on, 0:608–613, 2008.
the future we would like to integrate this feature into [9] Yang-Suk Kee, C. Kesselman, D. Nurmi, and R. Wolski. Enabling
open source systems. Moreover, we believe that providing personal clusters on demand for batch resources using commodity
software. In Parallel and Distributed Processing, 2008. IPDPS
support for workflow DAGs directly in the batch queue 2008. IEEE International Symposium on, pages 1–7, April 2008.
software would be a valuable service to users, particularly [10] Arun Krishnan. Gridblast: a globus-based high-throughput imple-
when coupled with intelligent scheduling techniques such mentation of blast in a grid computing framework: Research articles.
Concurr. Comput. : Pract. Exper., 17(13):1607–1623, 2005.
as those we have presented. [11] Load Sharing Facility (LSF). http://www.platform.com/.
[12] S. Ludtke, P. Baldwin, and W. Chiu. EMAN: Semiautomated
software for high resolution single-particle reconstructions. J.
Acknowledgments Struct. Biol, pages 82–97, 1999.
[13] W. Gentzsch (Sun Microsystems). Sun grid engine: Towards
creating a compute power grid. In CCGRID ’01: Proceedings of the
This material is based on work supported by the Na- 1st International Symposium on Cluster Computing and the Grid,
tional Science Foundation under Cooperative Agreement page 35, Washington, DC, USA, 2001. IEEE Computer Society.
No. CCR-0331645 (the VGrADS Project). This work [14] Daniel Nurmi, Anirban Mandal, John Brevik, Chuck Koelbel, Rich
Wolski, and Ken Kennedy. Evaluation of a workflow scheduler
was supported in part by the Shared University Grid at using integrated performance modelling and batch queue wait
Rice funded by NSF under Grant EIA-0216467, and a time prediction. In SC ’06: Proceedings of the 2006 ACM/IEEE
partnership between Rice University, Sun Microsystems, conference on Supercomputing, page 119, New York, NY, USA,
2006. ACM.
and Sigma Solutions, Inc. We would also like to thank [15] Daniel Charles Nurmi, Rich Wolski, and John Brevik. Varq: virtual
Roger Moye from Rice University, Jeff Pummill, Dr. Amy advance reservations for queues. In HPDC ’08: Proceedings of
Apon, Wesley Emeneker from University of Arkansas, the 17th international symposium on High performance distributed
computing, pages 75–86, New York, NY, USA, 2008. ACM.
Rich Raymond, Chad Vizino from Pittsuburgh Supercom- [16] Open PBS. http://www.openpbs.org/.
puting Center and Warren Smith from Texas Advanced [17] Condor Research Project. http://www.cs.wisc.edu/condor.
Computing Center for providing the batch queue log data. [18] G. Singh, E. Deelman, and G. Bruce Berriman et al. Montage:
a Grid enabled image mosaic service for the National Virtual
Observatory. Astronomical Data Analysis Software and Systems,
References 2003.
[19] Gurmeet Singh, Mei-Hui Su, Karan Vahi, Ewa Deelman, Bruce Ber-
riman, John Good, Daniel S. Katz, and Gaurang Mehta. Workflow
[1] Brett Bode, David M. Halstead, Ricky Kendall, Zhou Lei, and David task clustering for best effort systems with pegasus. In MG ’08:
Jackson. The portable batch scheduler and the maui scheduler on Proceedings of the 15th ACM Mardi Gras conference, pages 1–8,
linux clusters. In ALS’00: Proceedings of the 4th annual Linux New York, NY, USA, 2008. ACM.
Showcase & Conference, pages 27–27, Berkeley, CA, USA, 2000. [20] Quinn Snell, Mark J. Clement, David B. Jackson, and Chad
USENIX Association. Gregory. The performance impact of advance reservation meta-
[2] John Brevik, Daniel Nurmi, and Rich Wolski. Predicting bounds scheduling. In IPDPS ’00/JSSPP ’00: Proceedings of the Workshop
on queuing delay for batch-scheduled parallel machines. In PPoPP on Job Scheduling Strategies for Parallel Processing, pages 137–
’06: Proceedings of the eleventh ACM SIGPLAN symposium on 153, London, UK, 2000. Springer-Verlag.
Principles and practice of parallel programming, pages 110–118, [21] TeraGrid. http://www.teragrid.org/about.
New York, NY, USA, 2006. ACM. [22] Texas Advanced Supercomputing Center. http://www.tacc.utexas.
[3] Inc. Cluster Resources. http://clusterresources.com/. edu/.
[4] Ewa Deelman, Scott Callaghan, Edward Field, Hunter Francoeur, [23] E. Walker, J.P. Gardner, V. Litvin, and E.L. Turner. Creating per-
Robert Graves, Nitin Gupta, Vipin Gupta, Thomas H. Jordan, Carl sonal adaptive clusters for managing scientific jobs in a distributed
Kesselman, Philip Maechling, John Mehringer, Gaurang Mehta, computing environment. In Challenges of Large Applications in
David Okaya, Karan Vahi, and Li Zhao. Managing large-scale Distributed Environments, 2006 IEEE, pages 95–103, 2006.
workflow execution from resource provisioning to provenance track-
ing: The cybershake example. In E-SCIENCE ’06: Proceedings of
the Second IEEE International Conference on e-Science and Grid