Adv Topic Compiler Supported ILP
Adv Topic Compiler Supported ILP
ANGELIN GLADSTON
Slide Sources: Patterson &
Hennessy COD book & website
Scheduling Code for the
MIPS Pipeline
Example:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
Notes:
the loop is parallel – the body of each iteration is
independent of that of other iterations
conceptually : if we had 1000 CPUs, we could distribute
one iteration to each CPU and compute in parallel
(=simultaneously)
Only the compiler can exploit such instruction-level
parallelism (ILP), not the hardware! Why?
because only the compiler has a global view of the code
the hardware sees each line of code only after it is fetched
from memory, not all together – in particular, not the whole
loop
the compiler must schedule the code intelligently to
Scheduling Code for the
MIPS Pipeline
Assume FP operation latencies as below
latency indicates number of intervening cycles required
between producing and consuming instruction to avoid
stall
Assume integer ALU operation latency of 0 and
integer load latencyFPof 1
Latency table
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Unscheduled Code
Original C loop statement: for (i=1000; i>0; i=i-1) x[i] = x[i] + s;
Unscheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1) ;F0 = array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes per DW
BNE R1,R2,Loop ;branch R1!=R2
Execution cycles for the unscheduled code:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
Why one stall? Think of when the
DADDUI R1,R1,#-8 7 optimized MIPS pipeline resolves
stall 8 branch outcomes…
BNE R1,R2,Loop 9 Delayed branch stall
stall 10
10 clock cycles per iteration
Scheduled Code
Scheduled code for the MIPS pipeline:
No stalls! One iteration of the unrolled loop runs in 14 clock cycles. Therefore,
3.5 clock cycles per iteration of original loop vs. 6 cycles for scheduled but not unrolled loop
Notes
Scheduling code (if possible) to avoid stalls is always
a win and optimizing compilers typically generate
scheduled assembly
Unrolling loops can be advantageous but there are
potential problems
growth in code size
register pressure: aggressive unrolling and scheduling
requires allocation of multiple registers
Enhancing Loop-Level
Parallelism
Consider the previous running example:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
there is no loop-carried dependence – where data used in a
later iteration depends on data produced in an earlier one
in other words, all iterations could (conceptually) be
executed in parallel
Contrast with the following loop:
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] +
A[i+1]; /* S2 */ }
what are the dependences?
A Loop with Dependences
For the loop:
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /*
S2 */ }
what are the dependences?
There are two different dependences:
loop-carried:
S1 computes A[i+1] using value of A[i] computed in previous
iteration
S2 computes B[i+1] using value of B[i] computed in previous
iteration
not loop-carried:
S2 uses the value A[i+1] computed by S1 in the same A[i-1]
iteration
A[i]
The loop-carried dependences in this case force
successive iterations of the loop to execute in series.
Why? A[i+1]
S1 of iteration i depends on S1 of iteration i-1 which in turn
Another Loop with
Dependences
Generally, loop-carried dependences hinder ILP
if there are no loop-carried dependences all iterations could be
executed in parallel
even if there are loop-carried dependences it may be possible
to parallelize the loop – an analysis of the dependences is
required…
For the loop:
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
what are the dependences?
There is one loop-carried dependence:
S1 uses the value of B[i] computed in a previous iteration by
S2 B[i]
but this does not force iterations to execute in series. Why…?
…because S1 of iteration i depends on S2 of iteration i-1…,
A[i]
and the chain of dependences stops here!
Parallelizing Loops with
Short Chains of
Dependences
Parallelize the loop:
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Parallelized code:
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];