9 Loop Unrolling
9 Loop Unrolling
dependence
• To keep a pipeline full, parallelism amomg instructions
must be exploited by finding sequences of unrelated
instructions that can be overlapped in the pipeline.
• To avoid a pipeline stall, the execution of dependent
instruction must be separated from the source instruction
by a distance in clock cycles equal to the pipeline latency
of that source instruction.
ILP
Loop Level Parallelism and Dependence
Two types of dependencies limit the degree to which Loop Level Parallelism can be
exploited.
Two types of dependencies
A[ i + 1] = A[ i ] + C [ i ] ; // S1
B[ i + 1] = B[ i ] + A [i + 1] ; // S2
N.B. how do we know A[i+1] and A[i+1] refer to the same location? In general by
performing pointer/index variable analysis from conditions known at compile time.
ILP
An Example of Loop Level Dependences
A[ i + 1] = A[ i ] + C [ i ] ; // S1
B[ i + 1] = B[ i ] + A [i + 1] ; // S2
We’ll make use of these concepts when we talk about software pipelining and loop unrolling !
LOCAL
We will look at two local optimizations, applicable to loops:
These two are usually complementary in the sense that scheduling of software pipelined
instructions usually applies loop unrolling during some earlier transformation to expose more
ILP, exposing more potential candidates “to be moved across different iterations of the loop”.
LOCAL
KEY IDEA: Eliminating this overhead could potentially significantly increase the
performance of the loop:
Also assume that functional units are fully pipelined or replicated, such that one
instruction can issue every clock cycle (assuming it’s not waiting on a result!)
Assume no structural hazards exist, as a result of the previous assumption
* - CC == Clock Cycles
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions
Hence, if we could decrease the loop management overhead, we could increase the
performance.
Make n copies of the loop body, adjusting the loop terminating conditions
and perhaps renaming registers (we’ll very soon see why!),
This results in less loop management overhead, since we effectively merge
n iterations into one !
This exposes more ILP, since it allows instructions from different iterations to
be scheduled together!
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions
The unrolled loop from the running example with an unroll factor of n = 4 would then be:
The unrolled loop from the running example with an unroll factor of n = 4 would then be:
Let’s schedule the unrolled loop on our pipeline: CLOCK CYCLE ISSUED
Let’s schedule the unrolled loop on our pipeline: CLOCK CYCLE ISSUED
Unrolling with an unroll factor of n, increases the code size by (approximately) n. This might
present a problem,
Imagine unrolling a loop with a factor n= 4, that is executed a number of times that is not a
multiple of four:
one would need to provide a copy of the original loop and the unrolled loop,
this is a problem, since we usually don’t know the upper bound (UB) on the induction
variable (which we took for granted in our example),
more formally, the original copy should be included if (UB mod n != 0), i.e. number of
iterations is not a multiple of the unroll factor
LOCAL
STATIC LOOP UNROLLING (continued)
We usually ALSO need to perform register renaming to reduce dependencies within the
unrolled loop. This increases the register pressure!
The criteria for performing loop unrolling are therefore usually very restrictive!
multi-cycle functional Unit of a RISC processor(MIPS)