Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs
Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs
Michel Cosnard
LORIA - INRIA Lorraine
615, rue du Jardin Botanique
BP 101
54602 Villiers Les Nancy, France
email: [email protected]
Abstract
Scheduling large task graphs is an important issue in parallel computing since it allows the treatment of big size problems. In this paper we tackle the following problem: how to
schedule a task graph, when it is too large to fit into memory? Our answer features the parameterized task graph
(PTG), which is a symbolic representation of the task graph.
We propose a dynamic scheduling algorithm which takes the
PTG as an entry and allows to generate a generic program.
The performances of the method are studied as well as its
limitations. We show that our algorithm finds good schedule
for coarse grain task graphs, has a very low memory cost,
and has a good computational complexity. When the average number of operations of each task is large enough, we
prove that the scheduling overhead is negligible with respect
to the makespan. The feasibility of our approach is studied on several compute-intensive kernels found in numerical
scientific applications.
1. Introduction
As the computational power of distributed memory parallel computer increases very large problems are then to be
solved. In the task parallelism approach, computations are
allocated onto processors and there is one control flow per
processor which is data driven. The task graph is a model
well suited for such an approach. The task graph is a DAG
where each node is a sequential task, and where edges correspond to dependences between tasks, mostly due to communication. A task is a set of sequential instructions that must
be executed on one processor.
In the literature a lot of work has been done to schedule
such task graphs [5], either allowing the duplication of tasks
on the processors [19, 20], or not [11, 17]. In order to generate a parallel program, static schedulers, like Pyrros [24],
need to have the complete task graph in memory. This solution cannot be implemented when dealing with very large
problems because the task graph is too large to fit into memory. Furthermore, a task graph is built when all the parameters have been instantiated. Thus, if the parameters change,
the analysis of the sequential program has to be repeated
and a new task graph is then rebuilt. Hence, static task
graph scheduling does not allow to build a generic program.
In Cilk [3], the task graph is scheduled at run-time. Cilk
can handle big size problems but communication cost is not
taken into consideration. The Cilk system give good results
with tree-like style computation (min-max search, backtrack
exploration, etc...) but it has not been designed for scientific loop-nest computations. In [2, 13] run time methods to
schedule task graphs are described addressing the problem
of processor memory requirement, but these works do not
consider DAG memory requirement. In [1], a tool CASCH,
is presented. It allows to generate a schedule and a parallel
code for a sequential program. Nevertheless CASCH uses
standard static scheduling algorithms and consequently, has
the same drawbacks as Pyrros.
In this paper we present and study a new approach that
allows to solve very large problems (of the order of a million tasks). This is a complete automatic parallelization line
for most of the compute-intensive kernels found in numerical scientific applications. The input is an annotated fortranlike sequential program. A tool, PlusPyr [18], generates an
intermediate program representation called the parameterized task graph (PTG) [9, 10]. The PTG is a compact program representation for DAG parallelism, and it requires a
small amount of space to be stored because it is independent
of program sizes. Once the parameter values are known, it
is possible to use the parameterized task graph to build the
task graph. This possibility is not considered bellow. We
propose a different approach which consists in building a
generic program that dynamically schedules the task graph.
This method requires the parameters value to be given at run
time. The parameterized task graph dynamic scheduler (PT-
cb
e d
d f
hg f
g l
j
k
&%
'
$!&% 'C mi jn&% '
!
"
#
$!&% '"(
)
We call *+, the granularity of the task graph . We
use the definition given by Gerasoulis and Yang in [15]:
-/.,0 13259;47:=<,68 > ? @ 468 @ 4CBEDGF!HJIJA K 9L M!N 9OQP FSR 9UT 4CBEDVF
HQW,A XJ9 Y&Y N 9OZP 9 R F\[[
A task graph is said coarse grain if *]&_^a` , i.e.
if all the communication costs of any task are smaller
than the task duration.
The execution model for task graphs is called macro data
flow. Each task first receives all the needed data, computes
without interruption and then sends the results to its successors. We do not allow task duplication, because duplication
results in an increase of the memory required by the scheduler. This is incompatible with one of our goals which is to
have a low memory cost algorithm.
1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 allocate T to the processor that minimizes its starting time;
6 for each task T in father(T) do
7 if T is the last son of T to be scheduled then
8
remove from memory all information on T
9 endif
10 endfor
11 allocated(T)=true;
12
explores the DAG and, for each task , schedules all the fathers of before allocating to a processor. We also add a
source node (the input task) that is the predecessor of all
the node in the DAG. The input task is the only source node
in the DAG, and is always scheduled to the processor at
time .
Lines 7 and 8 are justified as follows: each time a task
is scheduled we put allocated(T)=true. Thus, we need
a data structure (in our case an AVL tree [16]), to store all
the values of allocated(T). When all the sons of task have
been scheduled, the test on the line 3 will never be performed
again for . Then, we can remove from the AVL all the informations about . Hence, lines 7 and 8 allow a major reduction of the memory used during execution of the algorithm.
Figure 1 gives the general scheme of PTGDS algorithm. PTGDS starts from a node which is topologically a descendant
from all the other tasks (the output task). It recursively
3
1 while true do
2 case type message of
3 execute :
4 if all the required data are in memory
5
then execute the task; send new computed data if required
(check send list);
6
If possible execute a task with the new data computed
(check exec list);
7
else store this message in exec list;
8 send :
9 if all the required data are in memory
10 then send the data;
11 else store this message in send list;
12 receive :
13 store the data in memory;
14 execute task if possible (check exec list);
15
1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 send the order of executing T to the executor which, according
to PTGDS, minimizes its starting time;
6 send, to executors which execute the fathers of task T, the order
of transmitting data to the executor which executes T;
7 for each task T in father(T) do
8 if T is the last son of T to be scheduled then
9
remove from memory all information on T
10 endif
11 endfor
12 allocated(T)=true;
13
lyb
b#Z b
Theorem 1 If r,Q and
Sb ukwtxlQ y/vCzv{\
b
then
q &
q ,
q b
where is a negligible time with comparison to ,
.
b
Proof of Theorem 1 Let {=,s
be the supervisor
time.
b
l
{&
schedule time messages time
The maximum amount of communication done by the supervisor is rs,G , since it has to communicate to the execub
l bl l
tors at most each time two processors have to send data
] { &Z
b b r, lQJ ukwtx_t r,G
Let
] { &
b
Sukwt xQ r, and
&td since
r,Q r,Q :
Then, it exists some constant such that:
{&
\Q
(1)
In [7, 8] we have shown that the computational complex, and that on an unity of PTGDS is
bounded number of processors for any DAG :
q ,s
b b
(2)
b
By theorem b hypothesis,
/vz|{\ . Thisd b implies that:
/ \!v , and finally \Q b
b b equation
1 and 2 leads to {=,
/ q \v{ , . Replacing
b
l b q lh b q
according to lemma 1 ,s
7~
/q /{=,
&
. Since,
b
q &
~3 q {=,=
,
&
, then:
We cannot have a superlinear speedup, thus:
In many cases (particularly, in all the examples of section 6.1), the number of edges of the task graph is of the
same order of the number of nodes. The main limitation of
our approach is that dynamic scheduling is costly when dealing with fine grain task graphs. Theorem 1 gives a sufficient
condition on the average number of task operations such that
4
n=100
n=1000
35
o Givens
x Gauss
30
+ Gauss + Backsolve
5
* Jordan
25
Speedup
Speedup
Power (m=100,l=1)
20
15
Power (m=100,l=1)
o Givens
10
* Jordan
x Gauss
+ Gauss + Backsolve
0
0
10
15
20
# processors
25
30
0
0
35
10
15
20
# processors
25
30
35
Figure 3. Speedup Simulation vs. Number of Processors for Several Examples and Several Matrix
Sizes
)\Qt_ s.
},
i(`E
Program
# Tasks
Max # tasks
Gauss
Givens
Gauss & BS
Jordan
Power (m=100)
5150
4952
5052
5052
10002
200
295
201
200
9805
Gauss
Givens
Gauss & BS
Jordan
Power (m=100)
501500
499502
500502
500502
99202
2000
2995
2001
2000
98005
`E)t)
# Nodes
Max # nodes
# Edges
Max # edges
20593
24856
30501
40394
30011
199
198
102
101
102
15348
14851
20100
15052
39802
398
494
303
301
10100
2005993
2498506
3005001
4003994
299111
1999
1998
1002
1001
102
1503498
1498501
2001000
1500502
396202
3998
4994
3003
3001
99200
#`E)t)Q)
Table 1. Memory Cost of Main Data Structures of PTGDS, for Various Matrix Size
power (m=40,l=1); p=32
Givens; p=16
150
9000
8000
x: parallel time
o: scheduling time
7000
x: parallel time
o: scheduling time
6000
time sec
time sec
100
50
5000
4000
3000
2000
1000
0
0
10
A_cost
12
0
0
14
4
1000
x 10
2000
3000
A_cost
4000
5000
6000
Figure 4. Execution Time of PTGDS and Execution Time Simulation vs. Average Number of Task
Operations
b
=b q
q
Q)\ z|{ , `E) Q )\_ ,z|{
q , \ q &
value.
Table 1 shows that the memory required to schedule the
task graph is only a small portion of the total memory required by the whole task graph. The differences between
and
show that required memory increases linearly, while (except for Power), the size of the
DAG increases quadratically.
`G)t)
the time taken to schedule the DAG when the average number of operations of tasks increases. For Power we see that
when
then
. For
the Givens algorithm we see that when
then
.
`G)t)t)
7. Conclusion
[7]
[8]
[9]
[10]
[11]
[12]
[13]
8. Acknowledgements
[14]
This work is part of the European Community Eureka EuroTOPS project. We would like to thank Michel Loi for providing us with the PlusPyr software, Tao Yang of UCSB for
very helpful discussions about this paper and the anonymous
referees for valuable comments and suggestions.
[15]
[16]
[17]
References
[1] I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. Automatic Parallelization and Scheduling on Multiprocessors using CASCH. In ICPP97, Aug. 1997.
[2] G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient
scheduling for languages with fine-grained parallelism. In
Proceedings Symposium on Parallel Algorithms and Architectures, pages 112, July 1995.
[3] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP95, Santa Barbara, California, July 1995.
[4] S. Chintor and R. Enbody.
Performance Degradation
in Large Wormhole-Routed Interprocessor Communication
Networks. In Proceedings of ICPP90, volume I, pages 424
428, 1990.
[5] P. Chretienne and C.Picouleau. Scheduling Theory and its
Applications, chapter 4, Scheduling with Communication
Delays: A Survey, pages 6589. John Wiley and Sons Ltd,
1995.
[6] P. Clauss, V. Loechner, and D. K. Wilde. Deriving Formulae to Count Solutions to Parameterized Linear Systems
using Ehrhart Polynomials: Applications to the Analysis
of Nested-Loop Programs. Technical report, Universite
[18]
[19]
[20]
[21]
[22]
[23]
[24]