0% found this document useful (0 votes)

46 views21 pages

9 Loop Unrolling

Uploaded by

kasyap sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views21 pages

9 Loop Unrolling

Uploaded by

kasyap sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

LOOP level parallism and

dependence
• To keep a pipeline full, parallelism amomg instructions
must be exploited by finding sequences of unrelated
instructions that can be overlapped in the pipeline.
• To avoid a pipeline stall, the execution of dependent
instruction must be separated from the source instruction
by a distance in clock cycles equal to the pipeline latency
of that source instruction.
ILP
Loop Level Parallelism and Dependence

Q: What is Loop Level Parallelism?

A: ILP that exists as a result of iterating a loop.

Two types of dependencies limit the degree to which Loop Level Parallelism can be
exploited.
Two types of dependencies

Loop Carried Loop Independent

A dependence, which only A dependence within the body

applies if a loop is iterated. of the loop itself (i.e. within
one iteration).
ILP
An Example of Loop Level Dependences

Consider the following loop:

for (i = 0; i <= 100; i++) {

A[ i + 1] = A[ i ] + C [ i ] ; // S1

B[ i + 1] = B[ i ] + A [i + 1] ; // S2

A Loop Independent Dependence

N.B. how do we know A[i+1] and A[i+1] refer to the same location? In general by
performing pointer/index variable analysis from conditions known at compile time.
ILP
An Example of Loop Level Dependences

Consider the following loop:

for (i = 0; i <= 100; i++) {

A[ i + 1] = A[ i ] + C [ i ] ; // S1

B[ i + 1] = B[ i ] + A [i + 1] ; // S2

Two Loop Carried Dependences

We’ll make use of these concepts when we talk about software pipelining and loop unrolling !
LOCAL
We will look at two local optimizations, applicable to loops:

STATIC LOOP UNROLLING SOFTWARE PIPELINING

Loop Unrolling replaces the body of a loop Reschedule instructions from a

with several copies of the loop body, thus sequence of loop iterations to
exposing more ILP. enhance ability to exploit more
ILP.

KEY IDEA: KEY IDEA:

Reduce loop control overhead and thus Reduce stalls due to data
increase performance dependencies.

These two are usually complementary in the sense that scheduling of software pipelined
instructions usually applies loop unrolling during some earlier transformation to expose more
ILP, exposing more potential candidates “to be moved across different iterations of the loop”.
LOCAL

STATIC LOOP UNROLLING

OBSERVATION: A high proportion of loop instructions executed are loop management

instructions (next example should give a clearer picture) on the induction variable.

KEY IDEA: Eliminating this overhead could potentially significantly increase the
performance of the loop:

We’ll use the following loop as our example:

for (i = 999 ; i >= 0 ; i -- ) {

x[ i ] = x[ i ] + constant;
}
STATIC LOOP UNROLLING (continued) – a trivial translation to MIPS LOCAL

Our example translates into the MIPS

assembly code below (without any
scheduling).
for (i = 999 ; i >= 0 ; i -- ) {
Note the loop independent
x[ i ] = x[ i ] + constant; dependence in the loop ,i.e. x[ i ] on
} x[ i ]
R1 is initially the address of the
element in the array with the highest
address
F2 CONTAINS THE SCALAR VALUE
R2 is precomuted. So 8(R2) is the
last element to oerate on. Or 1st
element in array.
Loop : L.D F0,0(R1) ; F0 = array element.
ADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(R1) ; store result
DADDUI R1,R1,#-8 ; decrement ptr
BNE R1,R2,Loop ; branch if R1 !=R2
LOCAL
STATIC LOOP UNROLLING (continued)

Let us assume the following latencies for our pipeline:

INSTRUCTION PRODUCING RESULT INSTRUCTION USING RESULT LATENCY (in CC)*

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Also assume that functional units are fully pipelined or replicated, such that one
instruction can issue every clock cycle (assuming it’s not waiting on a result!)
Assume no structural hazards exist, as a result of the previous assumption

* - CC == Clock Cycles
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Let us issue the MIPS sequence of instructions obtained:

CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1

stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
stall 10
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Let us issue the MIPS sequence of instructions obtained:

CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1 Each iteration of the loop

stall takes 10 cycles!
2
 We can improve performance
ADD.D F4,F0,F2 3
by rearranging the instructions,
stall 4 in the next slide.
stall 5
S.D F4,0(R1) 6 We can push S.D. after BNE,
DADDUI R1,R1,#-8 7 if we alter the offset!
stall 8
BNE R1,R2,Loop 9 We can push ADDUI
stall 10 between L.D. and ADD.D,
since R1 is not used
anywhere within the loop
body (i.e. it’s the induction
variable)
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Here is the rescheduled loop:

CLOCK CYCLE ISSUED  Each iteration now takes 6

cycles
Loop : L.D F0,0(R1) 1
 This is the best we can
DADDUI R1,R1,#-8 2 achieve because of the inherent
ADD.D F4,F0,F2 3 dependencies and pipeline
latencies!
stall 4
BNE R1,R2,Loop 5
S.D F4,8(R1) 6

Here we’ve decremented R1 before we’ve stored

F4. Hence need an offset of 8!
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Here is the rescheduled loop:

CLOCK CYCLE ISSUED Observe that 3 out of the 6

cycles per loop iteration are due
Loop : L.D F0,0(R1) 1 to loop overhead !
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
stall 4
BNE R1,R2,Loop 5
S.D F4,8(R1) 6
LOCAL
STATIC LOOP UNROLLING (continued)

Hence, if we could decrease the loop management overhead, we could increase the
performance.

SOLUTION : Static Loop Unrolling

 Make n copies of the loop body, adjusting the loop terminating conditions
and perhaps renaming registers (we’ll very soon see why!),
 This results in less loop management overhead, since we effectively merge
n iterations into one !
 This exposes more ILP, since it allows instructions from different iterations to
be scheduled together!
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

The unrolled loop from the running example with an unroll factor of n = 4 would then be:

Loop : L.D F0,0(R1)

ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D
F14,-24(R1)
ADD.D
F16,F14,F2
S.D
F16,-24(R1)
DADDUI
BNE R1,R1,#-32
R1,R2,Loop
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

The unrolled loop from the running example with an unroll factor of n = 4 would then be:

Loop : L.D F0,0(R1) Note the renamed

registers. This eliminates
ADD.D F4,F0,F2
dependencies between
S.D F4,0(R1) each of n loop bodies of
L.D different iterations.
F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
n loop
Bodies for L.D F10,-16(R1)
n=4 ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D
F14,-24(R1) Note the adjustments
ADD.D for store and load
F16,F14,F2 offsets (only store
S.D
F16,-24(R1) highlighted red)!
DADDUI
Adjusted loop R1,R1,#-32
BNE
overhead
R1,R2,Loop
instructions
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Let’s schedule the unrolled loop on our pipeline: CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1

L.D F6,-8(R1) 2
L.D F10,-16(R1) 3
L.D F14,-24(R1) 4
ADD.D F4,F0,F2 5
ADD.D F8,F6,F2 6
ADD.D F12,F10,F2 7
ADD.D F16,F14,F2 8
S.D F4,0(R1) 9
S.D F8,-8(R1) 10
DADDUI R1,R1,#-32 11
S.D F12,16(R1) 12
BNE R1,R2,Loop 13
S.D F16,8(R1); 14
LOCAL
STATIC LOOP UNROLLING (continued) – issuing our instructions

Let’s schedule the unrolled loop on our pipeline: CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1

This takes 14 cycles for 1
L.D F6,-8(R1) 2
iteration of the unrolled
loop. L.D F10,-16(R1) 3
Therefore w.r.t. original L.D F14,-24(R1) 4
loop we now have 14/4 = ADD.D F4,F0,F2 5
3.5 cycles per iteration.
ADD.D F8,F6,F2 6
Previously 6 was the best
we could do! ADD.D F12,F10,F2 7
 We gain an increase in ADD.D F16,F14,F2 8
performance, at the S.D F4,0(R1) 9
expense of extra code and
higher register S.D F8,-8(R1) 10
usage/pressure DADDUI R1,R1,#-32 11
 The performance gain S.D F12,16(R1) 12
on superscalar
BNE R1,R2,Loop 13
architectures would be
even higher! S.D F16,8(R1); 14
LOCAL
STATIC LOOP UNROLLING (continued)

However loop unrolling has some significant complications and disadvantages:

Unrolling with an unroll factor of n, increases the code size by (approximately) n. This might
present a problem,

Imagine unrolling a loop with a factor n= 4, that is executed a number of times that is not a
multiple of four:
 one would need to provide a copy of the original loop and the unrolled loop,

 this would increase code size and management overhead significantly,

 this is a problem, since we usually don’t know the upper bound (UB) on the induction
variable (which we took for granted in our example),

 more formally, the original copy should be included if (UB mod n != 0), i.e. number of
iterations is not a multiple of the unroll factor
LOCAL
STATIC LOOP UNROLLING (continued)

However loop unrolling has some significant complications and disadvantages:

We usually ALSO need to perform register renaming to reduce dependencies within the
unrolled loop. This increases the register pressure!

The criteria for performing loop unrolling are therefore usually very restrictive!
multi-cycle functional Unit of a RISC processor(MIPS)

VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
CMP3010L05-Hazard Continue ILP
No ratings yet
CMP3010L05-Hazard Continue ILP
54 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
Exploiting ILP With Software Approach
No ratings yet
Exploiting ILP With Software Approach
104 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Unit II
No ratings yet
Unit II
84 pages
Pipelining Achieves Instruction Level Parallelism (ILP)
No ratings yet
Pipelining Achieves Instruction Level Parallelism (ILP)
59 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
2.advanced Compiler Support For ILP
100% (1)
2.advanced Compiler Support For ILP
16 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Adv Topic Compiler Supported ILP
No ratings yet
Adv Topic Compiler Supported ILP
17 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
MCM 3 Notes
No ratings yet
MCM 3 Notes
28 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
ILP-Solution For CO5
No ratings yet
ILP-Solution For CO5
27 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
Compiler Architecture
No ratings yet
Compiler Architecture
16 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
13 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
Data Dependences and Hazards
No ratings yet
Data Dependences and Hazards
24 pages
Computer Architecture ILP - Techniques For Increasing
No ratings yet
Computer Architecture ILP - Techniques For Increasing
11 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
Solution 2
No ratings yet
Solution 2
3 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
5 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Lec 11
No ratings yet
Lec 11
19 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
4 pages
Adv Topic Compiler Supported ILPSlides
No ratings yet
Adv Topic Compiler Supported ILPSlides
18 pages
MN Loop Unrolling
No ratings yet
MN Loop Unrolling
5 pages
Lec 15
No ratings yet
Lec 15
15 pages
Lec 12
No ratings yet
Lec 12
15 pages
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
No ratings yet
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
23 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages

9 Loop Unrolling

Uploaded by

9 Loop Unrolling

Uploaded by

LOOP level parallism and

Q: What is Loop Level Parallelism?

Loop Carried Loop Independent

A dependence, which only A dependence within the body

Consider the following loop:

for (i = 0; i <= 100; i++) {

A Loop Independent Dependence

Consider the following loop:

for (i = 0; i <= 100; i++) {

Two Loop Carried Dependences

STATIC LOOP UNROLLING SOFTWARE PIPELINING

Loop Unrolling replaces the body of a loop Reschedule instructions from a

KEY IDEA: KEY IDEA:

STATIC LOOP UNROLLING

OBSERVATION: A high proportion of loop instructions executed are loop management

We’ll use the following loop as our example:

for (i = 999 ; i >= 0 ; i -- ) {

Our example translates into the MIPS

Let us assume the following latencies for our pipeline:

INSTRUCTION PRODUCING RESULT INSTRUCTION USING RESULT LATENCY (in CC)*

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Let us issue the MIPS sequence of instructions obtained:

CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1

Let us issue the MIPS sequence of instructions obtained:

CLOCK CYCLE ISSUED

Loop : L.D F0,0(R1) 1 Each iteration of the loop

Here is the rescheduled loop:

CLOCK CYCLE ISSUED  Each iteration now takes 6

Here we’ve decremented R1 before we’ve stored

Here is the rescheduled loop:

CLOCK CYCLE ISSUED Observe that 3 out of the 6

SOLUTION : Static Loop Unrolling

Loop : L.D F0,0(R1)

Loop : L.D F0,0(R1) Note the renamed

Loop : L.D F0,0(R1) 1

Loop : L.D F0,0(R1) 1

However loop unrolling has some significant complications and disadvantages:

 this would increase code size and management overhead significantly,

However loop unrolling has some significant complications and disadvantages:

You might also like