0% found this document useful (0 votes)
77 views7 pages

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

ipps 98

Uploaded by

Rosário Cunha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views7 pages

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

ipps 98

Uploaded by

Rosário Cunha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

Michel Cosnard
LORIA - INRIA Lorraine
615, rue du Jardin Botanique
BP 101
54602 Villiers Les Nancy, France
email: [email protected]

Abstract
Scheduling large task graphs is an important issue in parallel computing since it allows the treatment of big size problems. In this paper we tackle the following problem: how to
schedule a task graph, when it is too large to fit into memory? Our answer features the parameterized task graph
(PTG), which is a symbolic representation of the task graph.
We propose a dynamic scheduling algorithm which takes the
PTG as an entry and allows to generate a generic program.
The performances of the method are studied as well as its
limitations. We show that our algorithm finds good schedule
for coarse grain task graphs, has a very low memory cost,
and has a good computational complexity. When the average number of operations of each task is large enough, we
prove that the scheduling overhead is negligible with respect
to the makespan. The feasibility of our approach is studied on several compute-intensive kernels found in numerical
scientific applications.

1. Introduction
As the computational power of distributed memory parallel computer increases very large problems are then to be
solved. In the task parallelism approach, computations are
allocated onto processors and there is one control flow per
processor which is data driven. The task graph is a model
well suited for such an approach. The task graph is a DAG
where each node is a sequential task, and where edges correspond to dependences between tasks, mostly due to communication. A task is a set of sequential instructions that must
be executed on one processor.
In the literature a lot of work has been done to schedule
such task graphs [5], either allowing the duplication of tasks
on the processors [19, 20], or not [11, 17]. In order to generate a parallel program, static schedulers, like Pyrros [24],

Emmanuel Jeannot and Laurence Rougeot


LIP, ENS de Lyon
46, allee dItalie
69364 Lyon cedex 07, France
email: ejeannot,lrougeot @ens-lyon.fr

need to have the complete task graph in memory. This solution cannot be implemented when dealing with very large
problems because the task graph is too large to fit into memory. Furthermore, a task graph is built when all the parameters have been instantiated. Thus, if the parameters change,
the analysis of the sequential program has to be repeated
and a new task graph is then rebuilt. Hence, static task
graph scheduling does not allow to build a generic program.
In Cilk [3], the task graph is scheduled at run-time. Cilk
can handle big size problems but communication cost is not
taken into consideration. The Cilk system give good results
with tree-like style computation (min-max search, backtrack
exploration, etc...) but it has not been designed for scientific loop-nest computations. In [2, 13] run time methods to
schedule task graphs are described addressing the problem
of processor memory requirement, but these works do not
consider DAG memory requirement. In [1], a tool CASCH,
is presented. It allows to generate a schedule and a parallel
code for a sequential program. Nevertheless CASCH uses
standard static scheduling algorithms and consequently, has
the same drawbacks as Pyrros.
In this paper we present and study a new approach that
allows to solve very large problems (of the order of a million tasks). This is a complete automatic parallelization line
for most of the compute-intensive kernels found in numerical scientific applications. The input is an annotated fortranlike sequential program. A tool, PlusPyr [18], generates an
intermediate program representation called the parameterized task graph (PTG) [9, 10]. The PTG is a compact program representation for DAG parallelism, and it requires a
small amount of space to be stored because it is independent
of program sizes. Once the parameter values are known, it
is possible to use the parameterized task graph to build the
task graph. This possibility is not considered bellow. We
propose a different approach which consists in building a
generic program that dynamically schedules the task graph.
This method requires the parameters value to be given at run
time. The parameterized task graph dynamic scheduler (PT-

In this paper we do not discuss issues concerning the


topology of the target machine. We deal with a clique of processors. It is useful to recall here that, in the wormhole routing mode, which is widely used now, the communication
cost is not easily affected by the inter-processor distance, except if contention is high[4]. Therefore, the following parallel computer model will be used:

GDS), which has been studied in [7, 8], is a new algorithm


which uses the parameterized task graph to explore the DAG
and schedule the tasks. During the scheduling, the key point
is that, at any moment, PTGDS has only a small part of the
task graph in memory. This paper is focused on the feasibility of such an approach. We study the parallelization of several computation intensive kernels. We show that the use of
the parameterized task graph allows to schedule large problems. We discuss the limitation of our approach and derive
some required conditions such that the scheduling overhead
is negligible with respect to the parallel time.
This paper is organized as follows: Section 2, gives the
definitions used along this paper. Section 3 deals with the
PlusPyr software. In Section 4 we describe the dynamic approach and the PTGDS algorithm. Section 5 studies theoretical results concerning PTGDS behavior and our dynamic
approach. Section 6 presents the experimental results we
obtained. Finally, in section 7 concluding remarks are given.

cb
e d

is the number of processors (they are all identical).

d  f
hg  f

is the time taken by the execution of one elementary


instruction. If is the number of operations performed
by task (the computational cost of task ), we have
.

g l
j
k
&%
'
 $!&% 'C mi jn&% '

is the startup time of a communication, and is the


transmission rate. If
is the number of items sent
from task to task ,
.

3. PlusPyr and the Parameterized Task Graph


2. Definitions and Models

We have proposed the parameterized task graph [9] as a


solution for automatically deriving task graphs from sequential programs. It uses parameters which have to be instantiated in order to build the expanded task graph. It is mainly
composed of generic task codes and communication rules.
A generic task is a set of instructions which have to be executed sequentially.
The communication rules represent symbolic dependences between generic tasks. They have two forms. Reception rules define data received by generic tasks. Emission rules describe the data sent by a generic task. They also
give the symbolic value of the communication volume along
the generic edge.
In this paper the DAG which is built when values are
given to the parameters will be called the the expanded task
graph, or simply the task graph.
PlusPyr [18], a tool built by Michel Loi, is able, given a
sequential program, to derive the parameterized task graph.
PlusPyr is also able, once the values of the parameters are
given, to construct the expanded task graph. There are some
limitations concerning the input language, for more details
see [9]. The analysis performed by PlusPyr also gives the
symbolic computational cost of each generic task. This analysis is based on the work done by Feautrier on integer parametric programming (see [12, 22] for an introduction) and
[9, 10], for more details.

Throughout this paper, we will use the following general


definitions:

 


  

A task graph, is an annotated directed acyclic graph


(DAG), defined by the tuple
.
is
is
the node set, each node representing a task.
the number of nodes. In this paper we will say node or
task indifferently. is the edge set. There is an edge
from task to task if there is a dependence between
task and task (i.e. task must be executed after the
end of task ).
is the number of edges. is the
task weight (or task duration) set,
will represent
task duration.
is the set of edge weights (or communication volumes),
represents the communication cost along the edge from node to node . It
becomes if the two tasks are mapped on the same processor.



      

!

"
#


 $!&% '"(


)

We call *+, the granularity of the task graph  . We
use the definition given by Gerasoulis and Yang in [15]:
-/.,0 13259;47:=<,68 > ? @ 468 @ 4CBEDGF!HJIJA K 9L M!N 9OQP FSR 9UT 4CBEDVF HQW,A XJ9 Y&Y N 9OZP 9 R F\[[
A task graph  is said coarse grain if *]&_^a` , i.e.
if all the communication costs of any task are smaller
than the task duration.
The execution model for task graphs is called macro data
flow. Each task first receives all the needed data, computes
without interruption and then sends the results to its successors. We do not allow task duplication, because duplication
results in an increase of the memory required by the scheduler. This is incompatible with one of our goals which is to
have a low memory cost algorithm.

4. Dynamic Execution of a Parameterized Task


Graph
In the literature, dynamic execution policy is mainly used
in the following two cases : (1) dealing with nondeterministic programs [3], (2) load balancing and process migration
2

[21, 23]. Here we use dynamic execution for the following


reasons: (1) to obtain a generic parallel program for each
sequential program : we give the parameters value, at runtime, and then schedule the tasks during the execution. We
use the parameterized task graph because it is problem size
independent, (2) to handle big size problems. It is impossible to have a very large task graph in memory. The parameterized task graph allows to build at run-time, the small part
of the DAG needed to perform task mapping optimization.

1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 allocate T to the processor that minimizes its starting time;
6 for each task T in father(T) do
7 if T is the last son of T to be scheduled then
8
remove from memory all information on T
9 endif
10 endfor
11 allocated(T)=true;
12

4.2. Combining Scheduling and Dynamic Execution


The target code is an SPMD generic program which uses
a supervisor/executor protocol. One supervisor executes
PTGDS, and sends orders to the processors. There is one
executor per processor. We assume that the supervisor is executed on the parallel machine host. Each executor receives
messages from the supervisor. Those messages command
either the execution of the task or the sending of data to another. For instance, when a task is assigned to processor
, the supervisor orders the executors to send to the data
needed for the computation of . Incoming messages are
handled independently by each executor. The execution is
dynamic, since the supervisor sends orders to the executors
while scheduling the task graph. Such a complete dynamic
execution of the parameterized task graph will be called PTGDE (Parameterized Task Graph Dynamic Execution). Figure 2 gives the code of the supervisor and one executor. All
the executors have the same code: the sequential generic
task, and the communication protocol.
Each executor manages two lists. The execution messages that cannot be performed because the data are not yet
present in the local processor memory are stored in exec list.
This list is updated when new data arrive or new data are
computed. The send messages that cannot be performed because the data have not already been computed by the processor are stored in send list. This list is updated when new
data are computed.

Figure 1. The PTGDS Algorithm


PTGDS schedules the tasks: it determines on which processor and when each task has to be executed. For each task
, when the parameters value is known:

4.1. PTGDS, the Scheduling Algorithm

 

explores the DAG and, for each task , schedules all the fathers of before allocating to a processor. We also add a
source node (the input task) that is the predecessor of all
the node in the DAG. The input task is the only source node
in the DAG, and is always scheduled to the processor at
time .
Lines 7 and 8 are justified as follows: each time a task
is scheduled we put allocated(T)=true. Thus, we need
a data structure (in our case an AVL tree [16]), to store all
the values of allocated(T). When all the sons of task have
been scheduled, the test on the line 3 will never be performed
again for . Then, we can remove from the AVL all the informations about . Hence, lines 7 and 8 allow a major reduction of the memory used during execution of the algorithm.

We use the code analysis performed by PlusPyr to determine the duration of .

We use the parameterized task graph, to determine the


set of the children of , the set of the parents, and the
communication volume between parents or children.
Most of the time in the PTG the set of sons/fathers of
is described by a polyhedron. In the program, in order to have all the sons/fathers, we generate a loop nest
that scans all the points of this polyhedron. We use a
tool called enum which has been developed at the Universite de Rennes [14].It transforms a parameterized
polyhedron into a loop nest. The number of items sent
between two tasks is also described by a polyhedron.
In the program we generate a parameterized polynomial that enumerates the number of points in the polyhedron, i.e. the number of items. We use the polylib
from the Universite de Strasbourg [6], to build such a
polynomial.

5. Minimizing the Impact of the Scheduling


Overhead
In this section we show that it is possible to derive a simple upper bound on the average number of operations of
tasks in order to obtain an efficient execution on a parallel
computer.

Figure 1 gives the general scheme of PTGDS algorithm. PTGDS starts from a node which is topologically a descendant
from all the other tasks (the output task). It recursively
3

1 while true do
2 case type message of
3 execute :
4 if all the required data are in memory
5
then execute the task; send new computed data if required
(check send list);
6
If possible execute a task with the new data computed
(check exec list);
7
else store this message in exec list;
8 send :
9 if all the required data are in memory
10 then send the data;
11 else store this message in send list;
12 receive :
13 store the data in memory;
14 execute task if possible (check exec list);
15

1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 send the order of executing T to the executor which, according
to PTGDS, minimizes its starting time;
6 send, to executors which execute the fathers of task T, the order
of transmitting data to the executor which executes T;
7 for each task T in father(T) do
8 if T is the last son of T to be scheduled then
9
remove from memory all information on T
10 endif
11 endfor
12 allocated(T)=true;
13

Figure 2. PTGDE code. Left: the Supervisor; Right: an Executor

lyb

rsS, tUvukwtx S 


q ]z|{E}J&_~ q &~` l `  q ]z|{E}J&
*],
q
where   , is the parallel time found by PTGDS and
q  z|{G} , is the optimal parallel time.
b
Definition 1 Let us call Czv{ , the average number of operaq
tions of tasks in  , i.e. zv{ \!v , . Let  ,s
 be
b
the parallel program execution time, of the parameterized
b
task graph  scheduled by PTGDS, on a processors parb
allel computer. Let ] { ,s
 be the time taken by PTGDS
on the supervisor to b schedule  on processors.
b
b l
q
q
b
Lemma 1 =&
_~
 ,
~ {=,

q &


b#Z b
Theorem 1 If r,Q and

Sb ukwtxlQ y/vCzv{\
b
then
q  &
  q   ,
 q b
where is a negligible time with comparison to   ,
 .
b
Proof of Theorem 1 Let  {=,s
 be the supervisor
time.
b
l
 {&
 schedule time messages time
The maximum amount of communication done by the supervisor is rs,G , since it has to communicate to the execub
l bl l
tors at most each time two processors have to send data
] { &Z
b b r, lQJ ukwtx_t r,G
Let 
] { &
b  
Sukwt xQ r, and
&td since
 r,Q r,Q :
Then, it exists some constant such that:
 {&
 \Q
(1)

In [7, 8] we have shown that the computational complex, and that on an unity of PTGDS is
bounded number of processors for any DAG :

the scheduling overhead is negligible with comparison to the


parallel time.

Proof of Lemma 1: Due to scheduling overhead the time


taken by the executor to finish is greater than the total parallel time. Moreover, since PTGDS is executed on the supervisor and the parallel code is executed simultaneously on
executors, the total dynamic execution time is lower than the
sequentialization of the supervisor time plus the executors
time.

q   ,s
b   b
(2)
b
By theorem b hypothesis,
 /vz|{\ . Thisd b implies that:
 / \!v ,  and finally \Q b 
b b equation
1 and 2 leads to  {=,

/ q \v{ ,  . Replacing
b
l b q lh b q
according to lemma 1 ,s
7~
/q  /{=, 
&
 .  Since,
b
q  &
~3  q {=,=
,
 &
 , then:
We cannot have a superlinear speedup, thus:

In many cases (particularly, in all the examples of section 6.1), the number of edges of the task graph is of the
same order of the number of nodes. The main limitation of
our approach is that dynamic scheduling is costly when dealing with fine grain task graphs. Theorem 1 gives a sufficient
condition on the average number of task operations such that
4

n=100

n=1000

35

o Givens
x Gauss

30

+ Gauss + Backsolve
5

* Jordan

25

Speedup

Speedup

Power (m=100,l=1)

20

15

Power (m=100,l=1)
o Givens

10

* Jordan
x Gauss
+ Gauss + Backsolve

0
0

10

15
20
# processors

25

30

0
0

35

10

15
20
# processors

25

30

35

Figure 3. Speedup Simulation vs. Number of Processors for Several Examples and Several Matrix
Sizes

6. Results and Experiments

for one operation on a double

)\Qt_ s.

6.2. Speedup Simulations


6.1. Examples of programs
In Figure 3, the speedup simulations show that, when the
matrix size is high, PTGDS has no difficulty to use all the
processors.

We present here the set of program examples we used for


our experiments. Extensive tests on the parallelization of
these numerical kernels have been carried out.
The Gaussian Elimination: this program is composed of
two generic tasks: one for the computation of the pivot column, one for the update of the submatrix.
Gaussian Elimination and backward substitution: this
program has two generic tasks: one for the ijk Gaussian
Elimination, and a second one for solving the system by a
backward substitution. This shows that we can easily analyze the concatenation of two programs.
The Givens algorithm: we have a simple program, with
only one generic task. The main loop is composed of many
instructions: it shows that complex computational costs can
be handled.
Jordan Diagonalization: The Jordan method is decomposed into two generic tasks, one to diagonalize the matrix
and the other one to solve the system. In the two generic
tasks, for-loops are sequentialized.
Power of a matrix: We compute the
power of matrix
of order . In this example, 3 parameters (n,m and l) are
used. Hence, the computational cost and the granularity can
be easily tuned.
For all these examples experiments have been carried out
to validate our approach. Speedup simulations, memory
cost measurement and scheduling overhead comparison to
the parallel time were done. All the tests were done using
the following values (found on Sparc 5 workstation linked
by Ethernet on PVM) :
s double,
ms and

6.3. Memory Cost


In Table 1 we study the memory cost of PTGDS for several task graphs. We have instantiated parameters value and
run PTGDS on the expanded graph. The column # Tasks,
is the number of tasks in the DAG. PTGDS optimizes the
number of tasks in memory. The list of tasks needed to allocate other tasks is handled by an AVL tree. Column Max
# tasks is the maximum size of the AVL during the execution. The parameterized task graph is a description of the
edges in the task graph. sometime an edge in the task graph
is generated by several rules. Hence, when we use the communication rules to explore the DAG, the program generates more edges and tasks than there are in the DAG. Our
program ensures the schedule correctness by checking duplicated edges and tasks. However, it is necessary to use all
the rules in order to count the amount of data sent between
tasks. Column # Nodes gives the number of tasks generated
by the recursive DAG exploration. Column Max # nodes is
the maximum number of tasks in memory during the execution. Column # Edges is the number of edges generated
by the recursive DAG exploration. Column Max # edges is
the maximum number of edges in memory during the execution. Edges and tasks are removed from memory when the
corresponding recursive step is finished. These three data
structures are the only one which vary with the parameters

},

i(`E



Program

# Tasks

Max # tasks

Gauss
Givens
Gauss & BS
Jordan
Power (m=100)

5150
4952
5052
5052
10002

200
295
201
200
9805

Gauss
Givens
Gauss & BS
Jordan
Power (m=100)

501500
499502
500502
500502
99202

2000
2995
2001
2000
98005

`E)t)

# Nodes

Max # nodes

# Edges

Max # edges

20593
24856
30501
40394
30011

199
198
102
101
102

15348
14851
20100
15052
39802

398
494
303
301
10100

2005993
2498506
3005001
4003994
299111

1999
1998
1002
1001
102

1503498
1498501
2001000
1500502
396202

3998
4994
3003
3001
99200

#`E)t)Q)

Table 1. Memory Cost of Main Data Structures of PTGDS, for Various Matrix Size
power (m=40,l=1); p=32

Givens; p=16

150

9000
8000

x: parallel time
o: scheduling time

7000
x: parallel time
o: scheduling time

6000
time sec

time sec

100

50

5000
4000
3000
2000
1000

0
0

10

A_cost

12

0
0

14
4

1000

x 10

2000

3000
A_cost

4000

5000

6000

Figure 4. Execution Time of PTGDS and Execution Time Simulation vs. Average Number of Task
Operations

b
=b q
q
Q)\ z|{ , `E)  Q )\_ ,z|{

q , \ q &

value.
Table 1 shows that the memory required to schedule the
task graph is only a small portion of the total memory required by the whole task graph. The differences between
and
show that required memory increases linearly, while (except for Power), the size of the
DAG increases quadratically.

`G)t)

the time taken to schedule the DAG when the average number of operations of tasks increases. For Power we see that
when
then
. For
the Givens algorithm we see that when
then
.

`G)t)t)

7. Conclusion

6.4. Comparison Between Scheduling Overhead


and Execution Time

In this paper we have presented a scheme for a complete


line of automatic parallelization for some kind of programs.
This work is based on the PlusPyr tool which builds the parameterized task graph. We have conducted experiments on
various programs. We have given requirements for this approach to be valid. Theoretical results show that PTGDS
finds good schedule on an unbounded number of processors
for coarse grain task graphs, and has a competitive computa-

In this section we show that, according to theorem 1, the


amount of time taken by PTGDS to schedule the parameterized task graph is negligible when the average number of
task operations of is high. In Figure 4, the simulated execution time of Power and Givens on a -processors machine (such as defined in this section), has been compared to

tional complexity. In this paper, we have given a theoretical


bound for the average number of operations a task should
perform if we want the schedule overhead to be negligible
in comparison to the parallel time. Our experiments show
that, (1) PTGDS finds good schedules on a fixed number of
processors, (2) the algorithm memory cost is low, so we can
handle very large task graphs, (3) for some programs, with
tasks large enough, PTGDS execution time is small with regards to the makespan. Thus, it is possible to schedule dynamically such programs.
In conclusion, we are implementing parameterized task
graphs dynamic scheduling in order to have generic programs and to deal with large task graphs. This method appears to be efficient on coarse grain task graphs, when the
average number of operations of the tasks is high, as for instance, for block algorithms.
In our future works we plan to the realize the code generator, and to study a new method to schedule statically parameterized task graphs.

[7]

[8]

[9]

[10]

[11]
[12]

[13]

8. Acknowledgements
[14]

This work is part of the European Community Eureka EuroTOPS project. We would like to thank Michel Loi for providing us with the PlusPyr software, Tao Yang of UCSB for
very helpful discussions about this paper and the anonymous
referees for valuable comments and suggestions.

[15]

[16]

[17]

References
[1] I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. Automatic Parallelization and Scheduling on Multiprocessors using CASCH. In ICPP97, Aug. 1997.
[2] G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient
scheduling for languages with fine-grained parallelism. In
Proceedings Symposium on Parallel Algorithms and Architectures, pages 112, July 1995.
[3] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP95, Santa Barbara, California, July 1995.
[4] S. Chintor and R. Enbody.
Performance Degradation
in Large Wormhole-Routed Interprocessor Communication
Networks. In Proceedings of ICPP90, volume I, pages 424
428, 1990.
[5] P. Chretienne and C.Picouleau. Scheduling Theory and its
Applications, chapter 4, Scheduling with Communication
Delays: A Survey, pages 6589. John Wiley and Sons Ltd,
1995.
[6] P. Clauss, V. Loechner, and D. K. Wilde. Deriving Formulae to Count Solutions to Parameterized Linear Systems
using Ehrhart Polynomials: Applications to the Analysis
of Nested-Loop Programs. Technical report, Universite

[18]

[19]

[20]

[21]

[22]
[23]

[24]

de Strasbourg, April 1997.


research report RR 97-05
http://icps.u-strasbg.fr/pub-97/pub-97-05.ps.gz.
M. Cosnard and E. Jeannot. Automatic Coarse-Grained Parallelization Techniques. In Grandinetti and Kowalik, editor,
NATO workshop : Advances in High Performance Computing. Kluwer academic Publishers, 1997.
M. Cosnard and E. Jeannot. Building and Scheduling Coarse
Grain Task Graphs. Technical Report RR97-03, Laboratoire de lInformatique du Parallelisme, Ecole Normale
Superieure de Lyon, France, Feb. 1997.
M. Cosnard and M. Loi. Automatic Task Graph Generation Techniques. Parallel Processing Letters, 5(4):527538,
1995.
M. Cosnard and M. Loi. A Simple Algorithm for the Generation of Efficient Loop Structures. Internationnal Journal of
Parallel Programming, 24(3):265289, June 1996.
H. El-Rewini, T. Lewis, and H. Ali. Task Scheduling in Parallel and Distributed Systems. Prentice Hall, 1994.
P. Feautrier. Dataflow analysis of array and scalar references.
Internationnal Journal of Parallel Programming, 20(1):23
53, 1991.
C. Fu and T. Yang. Space and time efficient execution of parallel irregular computations. In sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP97), Las Vegas, June 1997.
M. L. Fur. Compilation de boucles dirige par la distribution
des donnes. PhD thesis, Universite de Rennes I, July 1995.
A. Gerasoulis and T. Yang. On the Granularity and Clustering of Direct Acyclic Task Graphs. IEEE Transactions on
Parallel and Distributed Systems, 4(6):686701, June 1993.
E. Horowitz, S. Sahni, and S. Anderson-Freed. Fundamentals of data structures in C. W.H. Freeman and company,
New-York, 1993.
Y.-K. Kwok and I. Ahmad. Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to
Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(5):506521, May 1996.
M. Loi. Construction et execution de graphe de taches acycliques a` gros grain. PhD thesis, Ecole Normale Superieure
de Lyon, France, 1996.
M. Palis, J.-C. Liou, and D. Wei. Task Clustering and
Scheduling for Distributed Memory Parallel Architectures.
IEEE Transactions on Parallel and Distributed Systems,
7(1):4655, Jan. 1996.
C. Papadimitriou and M. Yannakakis. Toward an Architecture Independent Analysis of Parallel Algorithms. SIAM
Journal on Computing, 19(2):322328, 1990.
C. Perez. Load Balancing HPF Programs by Migrating Virtual Processors. In Second International Workshop on HighLevel Programming Models and Supportive Environments,
HIPS97. IEEE Computer Society Press, Apr. 1997.
A. Schrijver. Theory of linear and integer programming.
John Wiley & sons, 1986.
N. Shivarati, P. Krueger, and M. Singhal. Load Distributing
for Localy Distributed Systems. Computer, 25(12):3344,
Dec. 1992.
T. Yang and A. Gerasoulis. Pyrros: Static Task Scheduling
and Code Generation for Message Passing Multiprocessor.
In Supercomputing92, pages 428437, Washington D.C.,
July 1992. ACM.

You might also like