Adv Topic Compiler Supported ILP

The document discusses scheduling code for the MIPS pipeline to exploit instruction-level parallelism (ILP) through techniques such as loop unrolling and scheduling. It highlights the importance of the compiler in optimizing code execution by reducing stalls and improving performance. Additionally, it addresses the challenges of loop-carried dependencies that can hinder parallel execution of loop iterations.

Uploaded by

Angelin Santhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

Adv Topic Compiler Supported ILP

Uploaded by

Angelin Santhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 17

Computer Architecture

Compiler for ILP

ANGELIN GLADSTON
Slide Sources: Patterson &
Hennessy COD book & website
Scheduling Code for the
MIPS Pipeline
 Example:
 for (i=1000; i>0; i=i-1)

x[i] = x[i] + s;

 Notes:
 the loop is parallel – the body of each iteration is
independent of that of other iterations
 conceptually : if we had 1000 CPUs, we could distribute
one iteration to each CPU and compute in parallel
(=simultaneously)
 Only the compiler can exploit such instruction-level
parallelism (ILP), not the hardware! Why?
 because only the compiler has a global view of the code
 the hardware sees each line of code only after it is fetched
from memory, not all together – in particular, not the whole
loop
 the compiler must schedule the code intelligently to
Scheduling Code for the
MIPS Pipeline
 Assume FP operation latencies as below
 latency indicates number of intervening cycles required
between producing and consuming instruction to avoid
stall
 Assume integer ALU operation latency of 0 and
integer load latencyFPof 1
Latency table
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Unscheduled Code
Original C loop statement: for (i=1000; i>0; i=i-1) x[i] = x[i] + s;
Unscheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1) ;F0 = array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes per DW
BNE R1,R2,Loop ;branch R1!=R2
Execution cycles for the unscheduled code:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
Why one stall? Think of when the
DADDUI R1,R1,#-8 7 optimized MIPS pipeline resolves
stall 8 branch outcomes…
BNE R1,R2,Loop 9 Delayed branch stall
stall 10
10 clock cycles per iteration
Scheduled Code
Scheduled code for the MIPS pipeline:

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8
ADD.D F4,F0,F2
BNE R1,R2,Loop ;delayed branch
S.D F4,8(R1) ;altered and interchanged with DADDUI

Execution cycles for the scheduled code:

Clock cycle issued
Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
stall 4
BNE R1,R2,Loop 5
S.D F4,8(R1) 6
6 clock cycles per iteration is optimal because of the dependencies.
Only 3 of the
operations (L.D, ADD.D & S.D) actually operate on the array, the other three are
loop overhead…
Compiler has to be “smart” to perform this scheduling
 e.g., interchanging the DADDUI and S.D instructions
requires understanding the dependence between them and
accordingly changing the S.D store address from 0(R1) to
Unrolling Loops
 The 3 clock cycle per iteration overhead delay in the
scheduled code of the previous example may be
reduced…
 …by amortizing the loop overhead over multiple loop
iterations
 For this we need to unroll the loop and block multiple
iterations into one
 Loop unrolling also allows improved scheduling by
exposing increased ILP – between instruction from
different iterations
 Example…
Unrolling Loops – High-
level

for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;

C equivalent of unrolling to block four iterations into one:


for (i=250; i>0; i=i-1)
{
x[4*i] = x[4*i] + s;
x[4*i-1] = x[4*i-1] + s;
x[4*i-2] = x[4*i-2] + s;
x[4*i-3] = x[4*i-3] + s;
}
Unrolled Loop – not
Scheduled
Unrolled but unscheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
 Notes:
 four copies of the loop body have been unrolled
 different registers are used in each copy – to facilitate
future scheduling
 three branches and three decrements of R1 have been
Executing the Unrolled
Unscheduled Loop Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
L.D F6,-8(R1) 7
stall 8
ADD.D F8,F6,F2 9
stall 10
stall 11
S.D F8,-8(R1) 12
L.D F10,-16(R1) 13
stall 14
ADD.D F12,F10,F2 15
stall 16
stall 17
S.D F12,-16(R1) 18
L.D F14,-24(R1) 19
stall 20
ADD.D F16,F14,F2 21
stall 22
stall 23
S.D F16,-24(R1) 24
DADDUI R1,R1,#-32 25
stall 26
BNE R1,R2,Loop 27
stall 28

One iteration of the unrolled loop runs in 28 clock cycles. Therefore,

7 clock cycles per iteration of original loop – slower than scheduled original loop!
Scheduling the Unrolled
Loop
Unrolled and scheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,-16(R1)
BNE R1,R2,Loop
S.D F16,-24(R1) ;8-32 = -24
Executing the Unrolled
and
Scheduled Loop
Clock cycle issued
Loop: L.D F0,0(R1) 1
L.D F6,-8(R1) 2
L.D F10,-16(R1) 3
L.D F14,-24(R1) 4
ADD.D F4,F0,F2 5
ADD.D F8,F6,F2 6
ADD.D F12,F10,F2 7
ADD.D F16,F14,F2 8
S.D F4,0(R1) 9
S.D F8,-8(R1) 10
DADDUI R1,R1,#-32 11
S.D F12,-16(R1) 12
BNE R1,R2,Loop 13
S.D F16,-24(R1) 14

No stalls! One iteration of the unrolled loop runs in 14 clock cycles. Therefore,
3.5 clock cycles per iteration of original loop vs. 6 cycles for scheduled but not unrolled loop
Notes
 Scheduling code (if possible) to avoid stalls is always
a win and optimizing compilers typically generate
scheduled assembly
 Unrolling loops can be advantageous but there are
potential problems
 growth in code size
 register pressure: aggressive unrolling and scheduling
requires allocation of multiple registers
Enhancing Loop-Level
Parallelism
 Consider the previous running example:
 for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
 there is no loop-carried dependence – where data used in a
later iteration depends on data produced in an earlier one
 in other words, all iterations could (conceptually) be
executed in parallel
 Contrast with the following loop:
 for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] +
A[i+1]; /* S2 */ }
 what are the dependences?
A Loop with Dependences
 For the loop:
 for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /*
S2 */ }
 what are the dependences?
 There are two different dependences:
 loop-carried:

S1 computes A[i+1] using value of A[i] computed in previous
iteration

S2 computes B[i+1] using value of B[i] computed in previous
iteration
 not loop-carried:

S2 uses the value A[i+1] computed by S1 in the same A[i-1]
iteration
A[i]
 The loop-carried dependences in this case force
successive iterations of the loop to execute in series.
Why? A[i+1]
 S1 of iteration i depends on S1 of iteration i-1 which in turn
Another Loop with
Dependences
 Generally, loop-carried dependences hinder ILP
 if there are no loop-carried dependences all iterations could be
executed in parallel
 even if there are loop-carried dependences it may be possible
to parallelize the loop – an analysis of the dependences is
required…
 For the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
 what are the dependences?
 There is one loop-carried dependence:
 S1 uses the value of B[i] computed in a previous iteration by
S2 B[i]
 but this does not force iterations to execute in series. Why…?
 …because S1 of iteration i depends on S2 of iteration i-1…,
A[i]
and the chain of dependences stops here!
Parallelizing Loops with
Short Chains of
Dependences
 Parallelize the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
 Parallelized code:
 A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];

 the dependence between the two statements in the loop is

no longer loop-carried and iterations of the loop may be
executed in parallel
Another Example
 Analyze the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
C[i+1] = E[i] + D[i]; /*
S3 */ }

VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Compression Assignment
100% (1)
Compression Assignment
7 pages
Q # 1 What Is Machine Cycle in Computer and How It Works?
100% (1)
Q # 1 What Is Machine Cycle in Computer and How It Works?
4 pages
Tutorial 07 - Mid Paper Discussion (LO1 - LO3)
No ratings yet
Tutorial 07 - Mid Paper Discussion (LO1 - LO3)
6 pages
Computer Architecture MCQ
No ratings yet
Computer Architecture MCQ
102 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
Unit II
No ratings yet
Unit II
84 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
43-Instruction Scheduling and Software Pipelining-19!11!2024
No ratings yet
43-Instruction Scheduling and Software Pipelining-19!11!2024
25 pages
MP MC PT-2 Ans
No ratings yet
MP MC PT-2 Ans
23 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Introduction To Advanced Pipelining
No ratings yet
Introduction To Advanced Pipelining
64 pages
ILP2 (Unit4)
No ratings yet
ILP2 (Unit4)
27 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
MC Practicals 2
No ratings yet
MC Practicals 2
12 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
9 Loop Unrolling
No ratings yet
9 Loop Unrolling
21 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
En m3 Ex Sol
No ratings yet
En m3 Ex Sol
35 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
HSC CS Paper II Most IMP Questions
No ratings yet
HSC CS Paper II Most IMP Questions
132 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Data Dependences and Hazards
No ratings yet
Data Dependences and Hazards
24 pages
Lec 11
No ratings yet
Lec 11
19 pages
FA18-CSE-075 (ES Lab Report)
No ratings yet
FA18-CSE-075 (ES Lab Report)
29 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Basic Compiler Techniques For Exposing ILP: by Bhanu Kiran B
No ratings yet
Basic Compiler Techniques For Exposing ILP: by Bhanu Kiran B
8 pages
5th Exp
No ratings yet
5th Exp
9 pages
Adv Topic Compiler Supported ILPSlides
No ratings yet
Adv Topic Compiler Supported ILPSlides
18 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Tut10 Selected Ans
No ratings yet
Tut10 Selected Ans
7 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
MN Loop Unrolling
No ratings yet
MN Loop Unrolling
5 pages
cs433 Fa19 hw4 Solution
No ratings yet
cs433 Fa19 hw4 Solution
12 pages
Chapter 03 Solution
No ratings yet
Chapter 03 Solution
19 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
5 pages
hw4 Cse490-590-Sp2025 Sol
No ratings yet
hw4 Cse490-590-Sp2025 Sol
7 pages
Solution 2
No ratings yet
Solution 2
3 pages
Solution - CNE 307 - ES - Final Exam - Final Version-S1-1445 - DR Anis
No ratings yet
Solution - CNE 307 - ES - Final Exam - Final Version-S1-1445 - DR Anis
6 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Midterm Solutions Mar 30
No ratings yet
Midterm Solutions Mar 30
6 pages
Compte Rendu TP N°1: Microcontroleur
No ratings yet
Compte Rendu TP N°1: Microcontroleur
7 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
4 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Activity 1 - 2021 Hpca
No ratings yet
Activity 1 - 2021 Hpca
4 pages
Csa HW 4
No ratings yet
Csa HW 4
2 pages
Chapter 6 Pipelining
100% (1)
Chapter 6 Pipelining
58 pages
Mid Semester Exam Solutions
No ratings yet
Mid Semester Exam Solutions
2 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
CSCI 510: Computer Architecture Written Assignment 2 Solutions
No ratings yet
CSCI 510: Computer Architecture Written Assignment 2 Solutions
6 pages
No. of Cycles IF ID EXE MEM WB
No ratings yet
No. of Cycles IF ID EXE MEM WB
5 pages
F10 E1 Solution
No ratings yet
F10 E1 Solution
5 pages
Cse-Vii-Advanced Computer Architectures (10cs74) - Solution
100% (1)
Cse-Vii-Advanced Computer Architectures (10cs74) - Solution
111 pages
Chip Basics: Time, Area, Power, Reliability, Configurability
No ratings yet
Chip Basics: Time, Area, Power, Reliability, Configurability
46 pages
ARM: An Advanced Microcontroller
No ratings yet
ARM: An Advanced Microcontroller
54 pages
MIC Chap 1 Notes
No ratings yet
MIC Chap 1 Notes
15 pages
ps1 Sol
No ratings yet
ps1 Sol
11 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
CPU With Systems Bus
33% (3)
CPU With Systems Bus
35 pages
Ca-Module Ii Notes
No ratings yet
Ca-Module Ii Notes
75 pages
Unit3 - AMIT YADAV COA
No ratings yet
Unit3 - AMIT YADAV COA
89 pages
Chapter 6-Computer Architecture
No ratings yet
Chapter 6-Computer Architecture
29 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Assignment 1
No ratings yet
Assignment 1
18 pages
R700-Family Instruction Set Architecture
No ratings yet
R700-Family Instruction Set Architecture
392 pages
CS:APP3e Guide To Y86-64 Processor Simulators: Write Back
No ratings yet
CS:APP3e Guide To Y86-64 Processor Simulators: Write Back
13 pages
The 8086 Microprocessor: 1.1 Introduction To Microprocessors and Interfacing
No ratings yet
The 8086 Microprocessor: 1.1 Introduction To Microprocessors and Interfacing
31 pages
RISC, CISC & Pipeline Notes
No ratings yet
RISC, CISC & Pipeline Notes
7 pages
Lesson Plan - COA
No ratings yet
Lesson Plan - COA
4 pages
Advanced Computer Architecture: Pages 1-7 Summary
No ratings yet
Advanced Computer Architecture: Pages 1-7 Summary
18 pages
CE403 - Computer Organization and Architecture
No ratings yet
CE403 - Computer Organization and Architecture
4 pages
Computer Architecture Scenarios
No ratings yet
Computer Architecture Scenarios
10 pages
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
No ratings yet
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
13 pages
Comporg Chapter2
No ratings yet
Comporg Chapter2
4 pages
Computer Architecture Assignment 1
No ratings yet
Computer Architecture Assignment 1
5 pages
Comp Archi 2018
No ratings yet
Comp Archi 2018
2 pages
CSS 548 Joshua Lo
No ratings yet
CSS 548 Joshua Lo
13 pages
NorthWestNet NUSIRG Internet Guide
From Everand
NorthWestNet NUSIRG Internet Guide
NorthWestNet
No ratings yet
Exercises in Electronics: Operational Amplifier Circuits
From Everand
Exercises in Electronics: Operational Amplifier Circuits
Roland Büchi
3/5 (1)
Calculated Encryption
From Everand
Calculated Encryption
John C Livingstone
No ratings yet

Adv Topic Compiler Supported ILP

Uploaded by

Adv Topic Compiler Supported ILP

Uploaded by

Computer Architecture

Compiler for ILP

Loop: L.D F0,0(R1)

Execution cycles for the scheduled code:

C equivalent of unrolling to block four iterations into one:

One iteration of the unrolled loop runs in 28 clock cycles. Therefore,

 the dependence between the two statements in the loop is

You might also like