Dependencies, Instruction Scheduling, Optimization, and Parallelism
Dependencies, Instruction Scheduling, Optimization, and Parallelism
variable = …
…
… = variable
Data Dependencies (2 of 3)
• Output Dependence
• A variable is written and later is written again
variable = …
…
variable = …
… = variable
…
variable = …
• Also contains independent iterations, but exhibits worse data locality than
the program fragment on the previous slide
• In the previous program fragment, operations can be performed while data is still in
registers
Loop-Level Parallelism (3 of 3)
• Going back to the first fragment, with M processors and with each processor
numbered p (zero origin), the previous loop can be rewritten, as follows:
b = ceil(n/M);
for(i = b*p; i < min(n, b*(p+1)); i++) {
Z[i] = X[i] – Y[i];
Z[i] = Z[i] * Z[i];
}
• Approximately equal size, independent iterations are created for each processor
FORTRAN PARALLEL DO
• FORTRAN has a PARALLEL DO statement that tells the compiler there
are no dependencies across its iterations
PARALLEL DO I = 1, N
A(I) = A(I) + B(I)
ENDDO
ISO C99 restrict
• ISO C99 has the restrict type qualifier for pointers to tell the compiler
there are no aliases to access the object to which it points
void add(int n, int *restrict dest, int *restrict op1, int *restrict op2) {
int i;
for(i = 0; i < n; i++)
dest[i] = op1[i] + op2[i];
}
Loop-Carried Dependence (1 of 7)
• Here is a slightly more complicated example of a loop-carried dependence:
double Z[100];
for(i = 0; i < 91; i++) {
Z[i+10] = Z[i];
}
• These distances and directions can be computed for each nested loop
iteration variable and for each statement in the loop
• For this example, the first 10 iterations can run with no dependencies
• Then, each iteration can run so long as the iteration 10 before it has
completed
Loop-Carried Dependence (4 of 7)
• For which values of x and y does x+10 equal y in the range 0 <= x, y <
91?
• An exact test would tell us if there exists a solution in the specified range
• An inexact test would tell us if there exists a solution, but not necessarily in
the specified range
double Z[100];
for(i = 0; i < 91; i++) {
Z[i] = Z[i+10];
}
• Once again, for this example, the first 10 iterations can run with no
dependencies
• Then, each iteration can run so long as the iteration 10 before it has
completed
Loop-Carried Dependence (7 of 7)
• Here is a more complicated example of a loop-carried dependence:
double A[200];
for(i = 0; i < 100; i++) {
A[2*i + 2] = A[2*i + 1];
}
• Signs of the a terms and of c (i.e., if any of the a terms or c are negative) are
irrelevant
Eager Evaluation
• Execute code to evaluate an expression when the result is assigned
(bound) to a variable
• This is the usual evaluation methodology using in most programming
languages
• Eager evaluation is a straight-forward implementation of the program
Futures/Lazy Evaluation/Call-by-Need
• Delayed evaluation until actually needed
• Most common method of evaluation in executing Haskell programs
• Sometimes operations are performed, but only a portion of the result
is needed
• Example: array inversion, but only some elements needed
• Sometimes operations are performed, but control flow means the
result may not be used
• Side-effects (e.g., input/output) must occur when expected
• May allow infinite-size data structures to be declared
• Causes the minimal amount of computation to be performed
Speculative Evaluation
• Execute code in advance of being needed if resources are available
• Take advantage of idle resources
• Have result immediately available, if needed
• Either side-effects must not occur (e.g., input/output) or must be able
to be reverted or undone (e.g., changing values of variables)
• Overall more computation may be performed, but the overall time to
completion of a program can be reduced
Locality of Data to Processor
• In a multi-processor system, having data local to a processor is very
important
• Data in registers is fastest
• Data in memory is an order-of-magnitude slower
• Data accessed over a network is slower
• Data in mass storage is much slower