0% found this document useful (0 votes)

77 views7 pages

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

ipps 98

Uploaded by

Rosário Cunha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views7 pages

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

ipps 98

Uploaded by

Rosário Cunha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

Michel Cosnard
LORIA - INRIA Lorraine
615, rue du Jardin Botanique
BP 101
54602 Villiers Les Nancy, France
email: [email protected]

Abstract
Scheduling large task graphs is an important issue in parallel computing since it allows the treatment of big size problems. In this paper we tackle the following problem: how to
schedule a task graph, when it is too large to fit into memory? Our answer features the parameterized task graph
(PTG), which is a symbolic representation of the task graph.
We propose a dynamic scheduling algorithm which takes the
PTG as an entry and allows to generate a generic program.
The performances of the method are studied as well as its
limitations. We show that our algorithm finds good schedule
for coarse grain task graphs, has a very low memory cost,
and has a good computational complexity. When the average number of operations of each task is large enough, we
prove that the scheduling overhead is negligible with respect
to the makespan. The feasibility of our approach is studied on several compute-intensive kernels found in numerical
scientific applications.

1. Introduction
As the computational power of distributed memory parallel computer increases very large problems are then to be
solved. In the task parallelism approach, computations are
allocated onto processors and there is one control flow per
processor which is data driven. The task graph is a model
well suited for such an approach. The task graph is a DAG
where each node is a sequential task, and where edges correspond to dependences between tasks, mostly due to communication. A task is a set of sequential instructions that must
be executed on one processor.
In the literature a lot of work has been done to schedule
such task graphs [5], either allowing the duplication of tasks
on the processors [19, 20], or not [11, 17]. In order to generate a parallel program, static schedulers, like Pyrros [24],

Emmanuel Jeannot and Laurence Rougeot

LIP, ENS de Lyon
46, allee dItalie
69364 Lyon cedex 07, France
email: ejeannot,lrougeot @ens-lyon.fr

need to have the complete task graph in memory. This solution cannot be implemented when dealing with very large
problems because the task graph is too large to fit into memory. Furthermore, a task graph is built when all the parameters have been instantiated. Thus, if the parameters change,
the analysis of the sequential program has to be repeated
and a new task graph is then rebuilt. Hence, static task
graph scheduling does not allow to build a generic program.
In Cilk [3], the task graph is scheduled at run-time. Cilk
can handle big size problems but communication cost is not
taken into consideration. The Cilk system give good results
with tree-like style computation (min-max search, backtrack
exploration, etc...) but it has not been designed for scientific loop-nest computations. In [2, 13] run time methods to
schedule task graphs are described addressing the problem
of processor memory requirement, but these works do not
consider DAG memory requirement. In [1], a tool CASCH,
is presented. It allows to generate a schedule and a parallel
code for a sequential program. Nevertheless CASCH uses
standard static scheduling algorithms and consequently, has
the same drawbacks as Pyrros.
In this paper we present and study a new approach that
allows to solve very large problems (of the order of a million tasks). This is a complete automatic parallelization line
for most of the compute-intensive kernels found in numerical scientific applications. The input is an annotated fortranlike sequential program. A tool, PlusPyr [18], generates an
intermediate program representation called the parameterized task graph (PTG) [9, 10]. The PTG is a compact program representation for DAG parallelism, and it requires a
small amount of space to be stored because it is independent
of program sizes. Once the parameter values are known, it
is possible to use the parameterized task graph to build the
task graph. This possibility is not considered bellow. We
propose a different approach which consists in building a
generic program that dynamically schedules the task graph.
This method requires the parameters value to be given at run
time. The parameterized task graph dynamic scheduler (PT-

In this paper we do not discuss issues concerning the

topology of the target machine. We deal with a clique of processors. It is useful to recall here that, in the wormhole routing mode, which is widely used now, the communication
cost is not easily affected by the inter-processor distance, except if contention is high[4]. Therefore, the following parallel computer model will be used:

GDS), which has been studied in [7, 8], is a new algorithm

which uses the parameterized task graph to explore the DAG
and schedule the tasks. During the scheduling, the key point
is that, at any moment, PTGDS has only a small part of the
task graph in memory. This paper is focused on the feasibility of such an approach. We study the parallelization of several computation intensive kernels. We show that the use of
the parameterized task graph allows to schedule large problems. We discuss the limitation of our approach and derive
some required conditions such that the scheduling overhead
is negligible with respect to the parallel time.
This paper is organized as follows: Section 2, gives the
definitions used along this paper. Section 3 deals with the
PlusPyr software. In Section 4 we describe the dynamic approach and the PTGDS algorithm. Section 5 studies theoretical results concerning PTGDS behavior and our dynamic
approach. Section 6 presents the experimental results we
obtained. Finally, in section 7 concluding remarks are given.

cb
e d

is the number of processors (they are all identical).

d f
hg f

is the time taken by the execution of one elementary

instruction. If is the number of operations performed
by task (the computational cost of task ), we have
.

g l
j
k
&%
'
$!&% 'C mi jn&% '

is the startup time of a communication, and is the

transmission rate. If
is the number of items sent
from task to task ,
.

3. PlusPyr and the Parameterized Task Graph

2. Definitions and Models

We have proposed the parameterized task graph [9] as a

solution for automatically deriving task graphs from sequential programs. It uses parameters which have to be instantiated in order to build the expanded task graph. It is mainly
composed of generic task codes and communication rules.
A generic task is a set of instructions which have to be executed sequentially.
The communication rules represent symbolic dependences between generic tasks. They have two forms. Reception rules define data received by generic tasks. Emission rules describe the data sent by a generic task. They also
give the symbolic value of the communication volume along
the generic edge.
In this paper the DAG which is built when values are
given to the parameters will be called the the expanded task
graph, or simply the task graph.
PlusPyr [18], a tool built by Michel Loi, is able, given a
sequential program, to derive the parameterized task graph.
PlusPyr is also able, once the values of the parameters are
given, to construct the expanded task graph. There are some
limitations concerning the input language, for more details
see [9]. The analysis performed by PlusPyr also gives the
symbolic computational cost of each generic task. This analysis is based on the work done by Feautrier on integer parametric programming (see [12, 22] for an introduction) and
[9, 10], for more details.

Throughout this paper, we will use the following general

definitions:

A task graph, is an annotated directed acyclic graph

(DAG), defined by the tuple
.
is
is
the node set, each node representing a task.
the number of nodes. In this paper we will say node or
task indifferently. is the edge set. There is an edge
from task to task if there is a dependence between
task and task (i.e. task must be executed after the
end of task ).
is the number of edges. is the
task weight (or task duration) set,
will represent
task duration.
is the set of edge weights (or communication volumes),
represents the communication cost along the edge from node to node . It
becomes if the two tasks are mapped on the same processor.

!

"
#

$!&% '"(

)

We call *+, the granularity of the task graph . We
use the definition given by Gerasoulis and Yang in [15]:
-/.,0 13259;47:=<,68 > ? @ 468 @ 4CBEDGF!HJIJA K 9L M!N 9OQP FSR 9UT 4CBEDVF HQW,A XJ9 Y&Y N 9OZP 9 R F\[[
A task graph is said coarse grain if *]&_^a` , i.e.
if all the communication costs of any task are smaller
than the task duration.
The execution model for task graphs is called macro data
flow. Each task first receives all the needed data, computes
without interruption and then sends the results to its successors. We do not allow task duplication, because duplication
results in an increase of the memory required by the scheduler. This is incompatible with one of our goals which is to
have a low memory cost algorithm.

4. Dynamic Execution of a Parameterized Task

Graph
In the literature, dynamic execution policy is mainly used
in the following two cases : (1) dealing with nondeterministic programs [3], (2) load balancing and process migration
2

[21, 23]. Here we use dynamic execution for the following

reasons: (1) to obtain a generic parallel program for each
sequential program : we give the parameters value, at runtime, and then schedule the tasks during the execution. We
use the parameterized task graph because it is problem size
independent, (2) to handle big size problems. It is impossible to have a very large task graph in memory. The parameterized task graph allows to build at run-time, the small part
of the DAG needed to perform task mapping optimization.

1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 allocate T to the processor that minimizes its starting time;
6 for each task T in father(T) do
7 if T is the last son of T to be scheduled then
8
remove from memory all information on T
9 endif
10 endfor
11 allocated(T)=true;
12

4.2. Combining Scheduling and Dynamic Execution

The target code is an SPMD generic program which uses
a supervisor/executor protocol. One supervisor executes
PTGDS, and sends orders to the processors. There is one
executor per processor. We assume that the supervisor is executed on the parallel machine host. Each executor receives
messages from the supervisor. Those messages command
either the execution of the task or the sending of data to another. For instance, when a task is assigned to processor
, the supervisor orders the executors to send to the data
needed for the computation of . Incoming messages are
handled independently by each executor. The execution is
dynamic, since the supervisor sends orders to the executors
while scheduling the task graph. Such a complete dynamic
execution of the parameterized task graph will be called PTGDE (Parameterized Task Graph Dynamic Execution). Figure 2 gives the code of the supervisor and one executor. All
the executors have the same code: the sequential generic
task, and the communication protocol.
Each executor manages two lists. The execution messages that cannot be performed because the data are not yet
present in the local processor memory are stored in exec list.
This list is updated when new data arrive or new data are
computed. The send messages that cannot be performed because the data have not already been computed by the processor are stored in send list. This list is updated when new
data are computed.

Figure 1. The PTGDS Algorithm

PTGDS schedules the tasks: it determines on which processor and when each task has to be executed. For each task
, when the parameters value is known:

4.1. PTGDS, the Scheduling Algorithm

explores the DAG and, for each task , schedules all the fathers of before allocating to a processor. We also add a
source node (the input task) that is the predecessor of all
the node in the DAG. The input task is the only source node
in the DAG, and is always scheduled to the processor at
time .
Lines 7 and 8 are justified as follows: each time a task
is scheduled we put allocated(T)=true. Thus, we need
a data structure (in our case an AVL tree [16]), to store all
the values of allocated(T). When all the sons of task have
been scheduled, the test on the line 3 will never be performed
again for . Then, we can remove from the AVL all the informations about . Hence, lines 7 and 8 allow a major reduction of the memory used during execution of the algorithm.

We use the code analysis performed by PlusPyr to determine the duration of .

We use the parameterized task graph, to determine the

set of the children of , the set of the parents, and the
communication volume between parents or children.
Most of the time in the PTG the set of sons/fathers of
is described by a polyhedron. In the program, in order to have all the sons/fathers, we generate a loop nest
that scans all the points of this polyhedron. We use a
tool called enum which has been developed at the Universite de Rennes [14].It transforms a parameterized
polyhedron into a loop nest. The number of items sent
between two tasks is also described by a polyhedron.
In the program we generate a parameterized polynomial that enumerates the number of points in the polyhedron, i.e. the number of items. We use the polylib
from the Universite de Strasbourg [6], to build such a
polynomial.

5. Minimizing the Impact of the Scheduling

Overhead
In this section we show that it is possible to derive a simple upper bound on the average number of operations of
tasks in order to obtain an efficient execution on a parallel
computer.

Figure 1 gives the general scheme of PTGDS algorithm. PTGDS starts from a node which is topologically a descendant
from all the other tasks (the output task). It recursively
3

1 while true do
2 case type message of
3 execute :
4 if all the required data are in memory
5
then execute the task; send new computed data if required
(check send list);
6
If possible execute a task with the new data computed
(check exec list);
7
else store this message in exec list;
8 send :
9 if all the required data are in memory
10 then send the data;
11 else store this message in send list;
12 receive :
13 store the data in memory;
14 execute task if possible (check exec list);
15

1 schedule(task T)
2 for each task T in father(T) do
3 if not(allocated(T)) then schedule(T);
4 endfor
5 send the order of executing T to the executor which, according
to PTGDS, minimizes its starting time;
6 send, to executors which execute the fathers of task T, the order
of transmitting data to the executor which executes T;
7 for each task T in father(T) do
8 if T is the last son of T to be scheduled then
9
remove from memory all information on T
10 endif
11 endfor
12 allocated(T)=true;
13

Figure 2. PTGDE code. Left: the Supervisor; Right: an Executor

lyb

rsS, tUvukwtx S

q ]z|{E}J&_~ q &~` l ` q ]z|{E}J&
*],
q
where , is the parallel time found by PTGDS and
q z|{G} , is the optimal parallel time.
b
Definition 1 Let us call Czv{ , the average number of operaq
tions of tasks in , i.e. zv{ \!v , . Let ,s
be
b
the parallel program execution time, of the parameterized
b
task graph scheduled by PTGDS, on a processors parb
allel computer. Let ] { ,s
be the time taken by PTGDS
on the supervisor to b schedule on processors.
b
b l
q
q
b
Lemma 1 =&
_~
,
~ {=,

q &

b#Z b
Theorem 1 If r,Q and

Sb ukwtxlQ y/vCzv{\
b
then
q &
q ,
q b
where is a negligible time with comparison to ,
.
b
Proof of Theorem 1 Let {=,s
be the supervisor
time.
b
l
{&
schedule time messages time
The maximum amount of communication done by the supervisor is rs,G , since it has to communicate to the execub
l bl l
tors at most each time two processors have to send data
] { &Z
b b r, lQJ ukwtx_t r,G
Let
] { &
b
Sukwt xQ r, and
&td since
r,Q r,Q :
Then, it exists some constant such that:
{&
\Q
(1)

In [7, 8] we have shown that the computational complex, and that on an unity of PTGDS is
bounded number of processors for any DAG :

the scheduling overhead is negligible with comparison to the

parallel time.

Proof of Lemma 1: Due to scheduling overhead the time

taken by the executor to finish is greater than the total parallel time. Moreover, since PTGDS is executed on the supervisor and the parallel code is executed simultaneously on
executors, the total dynamic execution time is lower than the
sequentialization of the supervisor time plus the executors
time.

q ,s
b b
(2)
b
By theorem b hypothesis,
/vz|{\ . Thisd b implies that:
/ \!v , and finally \Q b
b b equation
1 and 2 leads to {=,

/ q \v{ , . Replacing
b
l b q lh b q
according to lemma 1 ,s
7~
/q /{=,
&
. Since,
b
q &
~3 q {=,=
,
&
, then:
We cannot have a superlinear speedup, thus:

In many cases (particularly, in all the examples of section 6.1), the number of edges of the task graph is of the
same order of the number of nodes. The main limitation of
our approach is that dynamic scheduling is costly when dealing with fine grain task graphs. Theorem 1 gives a sufficient
condition on the average number of task operations such that
4

n=100

n=1000

o Givens
x Gauss

+ Gauss + Backsolve
5

* Jordan

Speedup

Power (m=100,l=1)

Power (m=100,l=1)
o Givens

* Jordan
x Gauss
+ Gauss + Backsolve

0
0

15
20
# processors

0
0

15
20
# processors

Figure 3. Speedup Simulation vs. Number of Processors for Several Examples and Several Matrix
Sizes

6. Results and Experiments

for one operation on a double

)\Qt_ s.

6.2. Speedup Simulations

6.1. Examples of programs
In Figure 3, the speedup simulations show that, when the
matrix size is high, PTGDS has no difficulty to use all the
processors.

We present here the set of program examples we used for

our experiments. Extensive tests on the parallelization of
these numerical kernels have been carried out.
The Gaussian Elimination: this program is composed of
two generic tasks: one for the computation of the pivot column, one for the update of the submatrix.
Gaussian Elimination and backward substitution: this
program has two generic tasks: one for the ijk Gaussian
Elimination, and a second one for solving the system by a
backward substitution. This shows that we can easily analyze the concatenation of two programs.
The Givens algorithm: we have a simple program, with
only one generic task. The main loop is composed of many
instructions: it shows that complex computational costs can
be handled.
Jordan Diagonalization: The Jordan method is decomposed into two generic tasks, one to diagonalize the matrix
and the other one to solve the system. In the two generic
tasks, for-loops are sequentialized.
Power of a matrix: We compute the
power of matrix
of order . In this example, 3 parameters (n,m and l) are
used. Hence, the computational cost and the granularity can
be easily tuned.
For all these examples experiments have been carried out
to validate our approach. Speedup simulations, memory
cost measurement and scheduling overhead comparison to
the parallel time were done. All the tests were done using
the following values (found on Sparc 5 workstation linked
by Ethernet on PVM) :
s double,
ms and

6.3. Memory Cost

In Table 1 we study the memory cost of PTGDS for several task graphs. We have instantiated parameters value and
run PTGDS on the expanded graph. The column # Tasks,
is the number of tasks in the DAG. PTGDS optimizes the
number of tasks in memory. The list of tasks needed to allocate other tasks is handled by an AVL tree. Column Max
# tasks is the maximum size of the AVL during the execution. The parameterized task graph is a description of the
edges in the task graph. sometime an edge in the task graph
is generated by several rules. Hence, when we use the communication rules to explore the DAG, the program generates more edges and tasks than there are in the DAG. Our
program ensures the schedule correctness by checking duplicated edges and tasks. However, it is necessary to use all
the rules in order to count the amount of data sent between
tasks. Column # Nodes gives the number of tasks generated
by the recursive DAG exploration. Column Max # nodes is
the maximum number of tasks in memory during the execution. Column # Edges is the number of edges generated
by the recursive DAG exploration. Column Max # edges is
the maximum number of edges in memory during the execution. Edges and tasks are removed from memory when the
corresponding recursive step is finished. These three data
structures are the only one which vary with the parameters

i(`E

Program

# Tasks

Max # tasks

Gauss
Givens
Gauss & BS
Jordan
Power (m=100)

5150
4952
5052
5052
10002

200
295
201
200
9805

Gauss
Givens
Gauss & BS
Jordan
Power (m=100)

501500
499502
500502
500502
99202

2000
2995
2001
2000
98005

`E)t)

# Nodes

Max # nodes

# Edges

Max # edges

20593
24856
30501
40394
30011

199
198
102
101
102

15348
14851
20100
15052
39802

398
494
303
301
10100

2005993
2498506
3005001
4003994
299111

1999
1998
1002
1001
102

1503498
1498501
2001000
1500502
396202

3998
4994
3003
3001
99200

#`E)t)Q)

Table 1. Memory Cost of Main Data Structures of PTGDS, for Various Matrix Size
power (m=40,l=1); p=32

Givens; p=16

150

9000
8000

x: parallel time
o: scheduling time

7000
x: parallel time
o: scheduling time

6000
time sec

time sec

100

5000
4000
3000
2000
1000

0
0

A_cost

0
0

14
4

1000

x 10

2000

3000
A_cost

4000

5000

6000

Figure 4. Execution Time of PTGDS and Execution Time Simulation vs. Average Number of Task
Operations

b
=b q
q
Q)\ z|{ , `E) Q )\_ ,z|{

q , \ q &

value.
Table 1 shows that the memory required to schedule the
task graph is only a small portion of the total memory required by the whole task graph. The differences between
and
show that required memory increases linearly, while (except for Power), the size of the
DAG increases quadratically.

`G)t)

the time taken to schedule the DAG when the average number of operations of tasks increases. For Power we see that
when
then
. For
the Givens algorithm we see that when
then
.

`G)t)t)

7. Conclusion

6.4. Comparison Between Scheduling Overhead

and Execution Time

In this paper we have presented a scheme for a complete

line of automatic parallelization for some kind of programs.
This work is based on the PlusPyr tool which builds the parameterized task graph. We have conducted experiments on
various programs. We have given requirements for this approach to be valid. Theoretical results show that PTGDS
finds good schedule on an unbounded number of processors
for coarse grain task graphs, and has a competitive computa-

In this section we show that, according to theorem 1, the

amount of time taken by PTGDS to schedule the parameterized task graph is negligible when the average number of
task operations of is high. In Figure 4, the simulated execution time of Power and Givens on a -processors machine (such as defined in this section), has been compared to

tional complexity. In this paper, we have given a theoretical

bound for the average number of operations a task should
perform if we want the schedule overhead to be negligible
in comparison to the parallel time. Our experiments show
that, (1) PTGDS finds good schedules on a fixed number of
processors, (2) the algorithm memory cost is low, so we can
handle very large task graphs, (3) for some programs, with
tasks large enough, PTGDS execution time is small with regards to the makespan. Thus, it is possible to schedule dynamically such programs.
In conclusion, we are implementing parameterized task
graphs dynamic scheduling in order to have generic programs and to deal with large task graphs. This method appears to be efficient on coarse grain task graphs, when the
average number of operations of the tasks is high, as for instance, for block algorithms.
In our future works we plan to the realize the code generator, and to study a new method to schedule statically parameterized task graphs.

[7]

[8]

[9]

[10]

[11]
[12]

[13]

8. Acknowledgements
[14]

This work is part of the European Community Eureka EuroTOPS project. We would like to thank Michel Loi for providing us with the PlusPyr software, Tao Yang of UCSB for
very helpful discussions about this paper and the anonymous
referees for valuable comments and suggestions.

[15]

[16]

[17]

References
[1] I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. Automatic Parallelization and Scheduling on Multiprocessors using CASCH. In ICPP97, Aug. 1997.
[2] G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient
scheduling for languages with fine-grained parallelism. In
Proceedings Symposium on Parallel Algorithms and Architectures, pages 112, July 1995.
[3] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP95, Santa Barbara, California, July 1995.
[4] S. Chintor and R. Enbody.
Performance Degradation
in Large Wormhole-Routed Interprocessor Communication
Networks. In Proceedings of ICPP90, volume I, pages 424
428, 1990.
[5] P. Chretienne and C.Picouleau. Scheduling Theory and its
Applications, chapter 4, Scheduling with Communication
Delays: A Survey, pages 6589. John Wiley and Sons Ltd,
1995.
[6] P. Clauss, V. Loechner, and D. K. Wilde. Deriving Formulae to Count Solutions to Parameterized Linear Systems
using Ehrhart Polynomials: Applications to the Analysis
of Nested-Loop Programs. Technical report, Universite

[18]

[19]

[20]

[21]

[22]
[23]

[24]

de Strasbourg, April 1997.

research report RR 97-05
http://icps.u-strasbg.fr/pub-97/pub-97-05.ps.gz.
M. Cosnard and E. Jeannot. Automatic Coarse-Grained Parallelization Techniques. In Grandinetti and Kowalik, editor,
NATO workshop : Advances in High Performance Computing. Kluwer academic Publishers, 1997.
M. Cosnard and E. Jeannot. Building and Scheduling Coarse
Grain Task Graphs. Technical Report RR97-03, Laboratoire de lInformatique du Parallelisme, Ecole Normale
Superieure de Lyon, France, Feb. 1997.
M. Cosnard and M. Loi. Automatic Task Graph Generation Techniques. Parallel Processing Letters, 5(4):527538,
1995.
M. Cosnard and M. Loi. A Simple Algorithm for the Generation of Efficient Loop Structures. Internationnal Journal of
Parallel Programming, 24(3):265289, June 1996.
H. El-Rewini, T. Lewis, and H. Ali. Task Scheduling in Parallel and Distributed Systems. Prentice Hall, 1994.
P. Feautrier. Dataflow analysis of array and scalar references.
Internationnal Journal of Parallel Programming, 20(1):23
53, 1991.
C. Fu and T. Yang. Space and time efficient execution of parallel irregular computations. In sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP97), Las Vegas, June 1997.
M. L. Fur. Compilation de boucles dirige par la distribution
des donnes. PhD thesis, Universite de Rennes I, July 1995.
A. Gerasoulis and T. Yang. On the Granularity and Clustering of Direct Acyclic Task Graphs. IEEE Transactions on
Parallel and Distributed Systems, 4(6):686701, June 1993.
E. Horowitz, S. Sahni, and S. Anderson-Freed. Fundamentals of data structures in C. W.H. Freeman and company,
New-York, 1993.
Y.-K. Kwok and I. Ahmad. Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to
Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(5):506521, May 1996.
M. Loi. Construction et execution de graphe de taches acycliques a` gros grain. PhD thesis, Ecole Normale Superieure
de Lyon, France, 1996.
M. Palis, J.-C. Liou, and D. Wei. Task Clustering and
Scheduling for Distributed Memory Parallel Architectures.
IEEE Transactions on Parallel and Distributed Systems,
7(1):4655, Jan. 1996.
C. Papadimitriou and M. Yannakakis. Toward an Architecture Independent Analysis of Parallel Algorithms. SIAM
Journal on Computing, 19(2):322328, 1990.
C. Perez. Load Balancing HPF Programs by Migrating Virtual Processors. In Second International Workshop on HighLevel Programming Models and Supportive Environments,
HIPS97. IEEE Computer Society Press, Apr. 1997.
A. Schrijver. Theory of linear and integer programming.
John Wiley & sons, 1986.
N. Shivarati, P. Krueger, and M. Singhal. Load Distributing
for Localy Distributed Systems. Computer, 25(12):3344,
Dec. 1992.
T. Yang and A. Gerasoulis. Pyrros: Static Task Scheduling
and Code Generation for Message Passing Multiprocessor.
In Supercomputing92, pages 428437, Washington D.C.,
July 1992. ACM.

Ambit Optimist 8 Installation Guide
0% (1)
Ambit Optimist 8 Installation Guide
87 pages
Model Design Process Anaplan
0% (1)
Model Design Process Anaplan
6 pages
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Taskflow A Lightweight Parallel and Heterogeneous Task Graph Computing System
No ratings yet
Taskflow A Lightweight Parallel and Heterogeneous Task Graph Computing System
18 pages
tpds21 Taskflow
No ratings yet
tpds21 Taskflow
18 pages
490 322 1 PB
No ratings yet
490 322 1 PB
6 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Provably Efficient Scheduling For Languages With Fine-Grained Parallelism
No ratings yet
Provably Efficient Scheduling For Languages With Fine-Grained Parallelism
41 pages
Paraplop 2010 The Task Graph Pattern Workshop Submission
No ratings yet
Paraplop 2010 The Task Graph Pattern Workshop Submission
11 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
Scheduling Using Genetic Algorithms
No ratings yet
Scheduling Using Genetic Algorithms
8 pages
Mastering Dynamic Programming in Python
From Everand
Mastering Dynamic Programming in Python
Ed A Norex
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
PDC Report
No ratings yet
PDC Report
22 pages
Cluster
No ratings yet
Cluster
13 pages
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
CISA Exam-Testing Concept-PERT/CPM/Gantt Chart/FPA/EVA/Timebox (Chapter-3)
From Everand
CISA Exam-Testing Concept-PERT/CPM/Gantt Chart/FPA/EVA/Timebox (Chapter-3)
Hemang Doshi
1.5/5 (3)
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
From Everand
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
Adam Jones
No ratings yet
DC Module 4
No ratings yet
DC Module 4
78 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
5 Thesis PDF
No ratings yet
5 Thesis PDF
186 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
IGNOU Operating System Previous Years Solved Papers
From Everand
IGNOU Operating System Previous Years Solved Papers
Manish Soni
No ratings yet
Scheduling Irregular Parallel Computations On Hierarchical Caches
No ratings yet
Scheduling Irregular Parallel Computations On Hierarchical Caches
30 pages
Padp Unit 4up
No ratings yet
Padp Unit 4up
147 pages
Models of Parallel Algoritms and Simple Parallel Algorithms
No ratings yet
Models of Parallel Algoritms and Simple Parallel Algorithms
40 pages
Cme323 Lec2
No ratings yet
Cme323 Lec2
5 pages
AA Part1
No ratings yet
AA Part1
43 pages
Computer Algebra: Fundamentals and Applications
From Everand
Computer Algebra: Fundamentals and Applications
Fouad Sabry
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
RTOS Module 1-IRIS
No ratings yet
RTOS Module 1-IRIS
32 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
39 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Manuscript 01072022 Clean
No ratings yet
Manuscript 01072022 Clean
8 pages
Scalable Distributed Depth-First Search With Greedy Work Stealing
No ratings yet
Scalable Distributed Depth-First Search With Greedy Work Stealing
15 pages
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
No ratings yet
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
84 pages
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
89 pages
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
From Everand
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
A Dynamic Graph-Based Malware Classifier
No ratings yet
A Dynamic Graph-Based Malware Classifier
119 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
RPNG: A Tool For Random Process Network Generation: Prediction MB - Data3 MB - Pred - Prop MB - Prop Ref - Frames
No ratings yet
RPNG: A Tool For Random Process Network Generation: Prediction MB - Data3 MB - Pred - Prop MB - Prop Ref - Frames
13 pages
Parallel Algorithm - Introduction
No ratings yet
Parallel Algorithm - Introduction
36 pages
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
No ratings yet
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
35 pages
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
From Everand
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PowerGraph Distributed Graph-Parallel Computation On Natural Graphs
No ratings yet
PowerGraph Distributed Graph-Parallel Computation On Natural Graphs
14 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Project Control and Evaluation
From Everand
Project Control and Evaluation
Michele Dove
No ratings yet
Emulations, Scheduling and Patterns
No ratings yet
Emulations, Scheduling and Patterns
30 pages
Emulations, Scheduling and Patterns
No ratings yet
Emulations, Scheduling and Patterns
30 pages
Unit - Iv Tasks and Loading
No ratings yet
Unit - Iv Tasks and Loading
6 pages
18-Assignment 1 - Solution
No ratings yet
18-Assignment 1 - Solution
12 pages
Dependence-Based Program Analysis
No ratings yet
Dependence-Based Program Analysis
12 pages
HTC DC3 Whitepaper v1.3 Online Version
No ratings yet
HTC DC3 Whitepaper v1.3 Online Version
8 pages
Aerohive-Whitepaper-Dynamic Airtime Scheduling
No ratings yet
Aerohive-Whitepaper-Dynamic Airtime Scheduling
15 pages
Dynamic Scheduling For Large-Scale Distributed-Memory Ray Tracing
No ratings yet
Dynamic Scheduling For Large-Scale Distributed-Memory Ray Tracing
10 pages
IBM Test Data Management V0.4
No ratings yet
IBM Test Data Management V0.4
7 pages
IMNSWP 16038 AnalyticsManifesto FINAL
No ratings yet
IMNSWP 16038 AnalyticsManifesto FINAL
16 pages
Anthill Protocol
No ratings yet
Anthill Protocol
2 pages
Case Study: A Case Study On Subledger Accounting, Oracle Release 12
No ratings yet
Case Study: A Case Study On Subledger Accounting, Oracle Release 12
13 pages
10 EIM Q2M1 TLE10 - EIM - Q2 - Mod1 - Wk1-5 - Elec-Meter-Connection-and-Grounding - v3
100% (1)
10 EIM Q2M1 TLE10 - EIM - Q2 - Mod1 - Wk1-5 - Elec-Meter-Connection-and-Grounding - v3
35 pages
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
No ratings yet
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
3 pages
Farm Machinery & Equipment - W2
No ratings yet
Farm Machinery & Equipment - W2
33 pages
Practise Questions For Test 2
No ratings yet
Practise Questions For Test 2
10 pages
Syllabus For Paper I & Ii For M.B.B.S Course Subject - Anatomy
No ratings yet
Syllabus For Paper I & Ii For M.B.B.S Course Subject - Anatomy
10 pages
1 s2.0 S0360319923002951 Main
No ratings yet
1 s2.0 S0360319923002951 Main
25 pages
PC Specification List
No ratings yet
PC Specification List
12 pages
2550Q-4th2021 - (EB187139-EEFB-462C
No ratings yet
2550Q-4th2021 - (EB187139-EEFB-462C
3 pages
Kikambala Revised Drawings
No ratings yet
Kikambala Revised Drawings
1 page
CS4670: Computer Vision: Lecture 5: Feature Detection and Matching
No ratings yet
CS4670: Computer Vision: Lecture 5: Feature Detection and Matching
46 pages
Thayer, Vice President Kamala Harris Visit To Vietnam Scene Setter
No ratings yet
Thayer, Vice President Kamala Harris Visit To Vietnam Scene Setter
3 pages
Quality Practices and Problems in Free Software Projects: Martin Michlmayr, Francis Hunt, David Probert
No ratings yet
Quality Practices and Problems in Free Software Projects: Martin Michlmayr, Francis Hunt, David Probert
5 pages
Woman-Centered Coaching Revolution - Lesson 1 - Handout
No ratings yet
Woman-Centered Coaching Revolution - Lesson 1 - Handout
28 pages
U CMR March 2023
80% (5)
U CMR March 2023
2 pages
Cement Statement PDF
No ratings yet
Cement Statement PDF
6 pages
Background To IPSAS Implementation in Nigeria
67% (3)
Background To IPSAS Implementation in Nigeria
28 pages
Administration: Order of Completion
No ratings yet
Administration: Order of Completion
24 pages
Refresh Parent Grid After Sub-Grid Save in UI For ASP - NET MVC Grid - Telerik Forums
No ratings yet
Refresh Parent Grid After Sub-Grid Save in UI For ASP - NET MVC Grid - Telerik Forums
3 pages
Health Informatics Quiz 1-6
No ratings yet
Health Informatics Quiz 1-6
11 pages
Sec 4 Water - Resources - (Regulation - and - Management) - Act, - 2010-1-16
No ratings yet
Sec 4 Water - Resources - (Regulation - and - Management) - Act, - 2010-1-16
16 pages
Index: Powerpoint
No ratings yet
Index: Powerpoint
24 pages
10 Vallarta v. CA
No ratings yet
10 Vallarta v. CA
2 pages
Spatial Information Technology For Sustainable Development Goals
No ratings yet
Spatial Information Technology For Sustainable Development Goals
254 pages
Department of Management Studies MS 5320-Human Resource Management
No ratings yet
Department of Management Studies MS 5320-Human Resource Management
3 pages
7th Sem Mech Internal Question Papers
No ratings yet
7th Sem Mech Internal Question Papers
16 pages
Group2 Ece142
No ratings yet
Group2 Ece142
61 pages

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

Uploaded by

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

Uploaded by

Low Memory Cost Dynamic Scheduling of Large Coarse Grain Task Graphs

Emmanuel Jeannot and Laurence Rougeot

In this paper we do not discuss issues concerning the

GDS), which has been studied in [7, 8], is a new algorithm

is the number of processors (they are all identical).

is the time taken by the execution of one elementary

is the startup time of a communication, and is the

3. PlusPyr and the Parameterized Task Graph

We have proposed the parameterized task graph [9] as a

Throughout this paper, we will use the following general

A task graph, is an annotated directed acyclic graph

4. Dynamic Execution of a Parameterized Task

[21, 23]. Here we use dynamic execution for the following

4.2. Combining Scheduling and Dynamic Execution

Figure 1. The PTGDS Algorithm

4.1. PTGDS, the Scheduling Algorithm

We use the code analysis performed by PlusPyr to determine the duration of .

We use the parameterized task graph, to determine the

5. Minimizing the Impact of the Scheduling

Figure 2. PTGDE code. Left: the Supervisor; Right: an Executor

rsS, tUvukwtx S

the scheduling overhead is negligible with comparison to the

Proof of Lemma 1: Due to scheduling overhead the time

6. Results and Experiments

for one operation on a double

6.2. Speedup Simulations

We present here the set of program examples we used for

6.3. Memory Cost

6.4. Comparison Between Scheduling Overhead

In this paper we have presented a scheme for a complete

In this section we show that, according to theorem 1, the

tional complexity. In this paper, we have given a theoretical

de Strasbourg, April 1997.

You might also like

rsS, tUvukwtx S