0% found this document useful (0 votes)

84 views35 pages

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

high performance computing

Uploaded by

Chengzi Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views35 pages

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

high performance computing

Uploaded by

Chengzi Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Very Large Instruction Word (VLIW)

• VLIW – architectures and scheduling techniques (Ch. 3.5)

ü VLIW architecture (3.5.2)
ü VLIW and loop unrolling (3.5.3)
ü VLIW and software pipelining (3.5.4)
ü Non-cyclic VLIW scheduling (3.5.5)
ü Predicated instructions (3.5.6)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Static Scheduling:
Revisiting Pipeline Design
(Ch 3.5.2)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Duality of Dynamic and Static
Techniques
• Instruction scheduling: Compiler moves instructions.
Same issues: data-flow and exception model
• Software register renaming for WAW and WAR hazards
• Memory disambiguation must be done by the compiler
• Branch prediction scheme: static prediction
• Speculation: speculate based on static branch prediction.
Test dynamically and execute patch-up/recovery code if the
speculation fails

Sometimes there is no need to speculate because the

compiler knows the structure of the program (e.g. loops)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW (Very-Long Instruction Word)
Architectures

• Pipeline is simple with no hazard detection

• Compiler schedules instructions in “packets” or long
instruction words (two memory, two floating-point and an
integer operation in the example)
• Forwarding helps but is not needed

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Program Example

Compiler scheduling
• Local (inside a basic block) or global (across basic blocks)
• Cyclic (loop unrolling or software pipelining) or non-cyclic
(trace scheduling)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Loop Unrolling for VLIW
(Ch 3.5.3)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Loop Unrolling

UNUSED UNUSED

Issues with loop unrolling

• Code size
• Empty slots
• Register pressure
• Binary compatibility
• Limited scope for ILP exploitation

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Quiz 7.1

Which of the following statements are correct assuming

that all instructions take a single cycle to execute

a) IPC = 5
b) IPC = 23/9
c) The number of unused slots is 21

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Software Pipelining for VLIW
(Ch 3.5.4)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Software Pipelining

Time

Data hazards
• RAW hazards are handled correctly
• For WAR hazards, use rotating registers (register renaming technique)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Rotating Registers

LD F0,0(R1)
ADD F4,F0,F2
SD F4,0(R1)
• Two iterations between LD and ADD; RR6 and RR4 point to same
physical register
• Three iterations between ADD and SD; RR3 and RR0 point to same
register

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Quiz 7.2

Let RR6 point to physical register X in iteration Y. There are 18

physical registers. Which of the following statements are correct?

a) RR0 points to X after Y+13 iterations

b) RR5 points to X after Y+17 iterations
c) RR0 points to X after Y+6 iterations
d) RR5 points to X after Y+1 iterations

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

VLIW–Loop Carried Dependencies 1(3)
Loop carried dependency = dependency across loop iterations

Dependency spans two loop iterations:

Dependency is through memory; rotating registers do not help

Let’s look at the data dependency graph!

Loop Carried Dependencies

VLIW–Loop Carried Dependencies 2(3)
(iteration, min cycles to resolve RAW)

• The cycle in the graph takes 6 cycles and spans two

iterations; 3 cycles at least per iteration

VLIW–Loop Carried Dependencies 3(3)

Quiz 7.3

How many clocks does it take to execute all instructions in each of

the original iterations?
a) 3
b) 6
c) 9

Non-Cyclic Scheduling
(Ch 3.5.5)

Non-Cyclic Scheduling
• Most likely path (A, SB, C in the
example) is established through
profiling
• This path (A, B, C), called trace, forms
a larger basic block of code for
instruction scheduling
• Instruction scheduling respects RAW
dependencies but can ignore control
dependencies
• Must fix the execution on branch
misspeculation so that misspeculated
trace (A, D, C) is correctly executed.

Example
Original code Trace schedule Optimized trace
Most likely trace LW R4,0(R1) LW R4,0(R1) LW R4,0(R1)
ADDI R6,R4,#1 ADDI R6,R4,#1 LW R6,0(R2)
/* block A*/ LW R6,0(R2) BEQ R5,R4,LAB1
BEQ R5,R4,LAB BEQ R5,R4,LAB1 LAB2: SW R6,0(R1)
LW R6,0(R2) LAB2: SW R6,0(R1) /* jump to second trace if
/*block B*/ /* jump to second trace if prediction is wrong */
/*block D* empty/ prediction is wrong */ ….
LAB: SW R6,0(R1) …. LAB1: ADDI R6,R4,#1
LAB1: ADDI R6,R4,#1 J LAB2
J LAB2

Quiz 7.4
Original code Trace schedule Optimized trace

LW R4,0(R1) LW R4,0(R1) LW R4,0(R1)

ADDI R6,R4,#1 ADDI R6,R4,#1 LW R6,0(R2)
/* block A*/ LW R6,0(R2) BEQ R5,R4,LAB1
BEQ R5,R4,LAB BEQ R5,R4,LAB1 LAB2: SW R6,0(R1)
LW R6,0(R2) LAB2: SW R6,0(R1) /* jump to second trace if
/*block B*/ /* jump to second trace if prediction is wrong */
/*block D* empty/ prediction is wrong */ ….
LAB: SW R6,0(R1) …. LAB1: ADDI R6,R4,#1
LAB1: ADDI R6,R4,#1 J LAB2
J LAB2
Which of the following statements are correct?
a)The “non-taken” trace consists of 1 more instruction in the original
code compared to the optimized trace
b)The “non-taken” trace consists of the same number of instructions in
the original code and the optimized trace
c)The “taken” optimized trace executes two more instructions than in
the original code

Predicated Execution
(Ch 3.5.6)

Predicated Instructions
• Trace scheduling works well if branches are highly biased (one trace
is considerably more likely than another)

Predicated instruction = conditionally executed instruction

Example 1) CLWZ R1,0(R2),R3; if (R3)==0 then LW R1,0(R2)

• Only executed if condition is met; other No Operation

• Predication can be applied to any instruction
• Needs an additional operand – a predicate register
• Longer instruction not a problem in VLIW

Example – Predicated Execution
Original code Predicated code
LW R4,0(R1) LW R4,0(R1)
ADDI R6,R4,#1 ADDI R6,R4,#1
BEQ R5,R4,LAB SUB R3,R5,R4
LW R6,0(R2) CLWNZ R6,0(R2),R3
LAB: SW R6,0(R1) SW R6,0(R1)

What you should know by now
• VLIW architectures
– Parallel simple pipelines
– No support for dynamic scheduling
– Assumes compiler does static scheduling

• VLIW and loop unrolling

– Challenge is to fill operation slots
• VLIW and software pipelining
– Renaming using rotating registers
– Impact of slot conflicts
– Impact of loop-carried dependencies
• Trace scheduling
• Predicated (conditional) instructions

Computer Organization and Architecture (COA) 2017 May - June Old Solved Question Paper
100% (1)
Computer Organization and Architecture (COA) 2017 May - June Old Solved Question Paper
35 pages
Server Memory Population Rules For HPE Gen11 Servers With 4th Gen Intel Xeon Scalable Processors-A50007437enw
No ratings yet
Server Memory Population Rules For HPE Gen11 Servers With 4th Gen Intel Xeon Scalable Processors-A50007437enw
21 pages
Microprocessor MCQ: For 4th Sem MCQ: For 6th Sem MCQ
No ratings yet
Microprocessor MCQ: For 4th Sem MCQ: For 6th Sem MCQ
24 pages
Aca Important Questions 2 Marks 16marks
60% (5)
Aca Important Questions 2 Marks 16marks
18 pages
Lecture04 MOS
No ratings yet
Lecture04 MOS
59 pages
Computer Architecture Midterm1 Cmu
No ratings yet
Computer Architecture Midterm1 Cmu
30 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Exploiting ILP With Software Approach
No ratings yet
Exploiting ILP With Software Approach
104 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
10 M1 C2 SIC XE Assembler SolvedProblem
No ratings yet
10 M1 C2 SIC XE Assembler SolvedProblem
392 pages
Unit II
No ratings yet
Unit II
84 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
No ratings yet
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
21 pages
Vliw
No ratings yet
Vliw
22 pages
Dram Controller: Mahdi Nazm Bojnordi
No ratings yet
Dram Controller: Mahdi Nazm Bojnordi
28 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
Cs 303: Computer Organization & Architecture
No ratings yet
Cs 303: Computer Organization & Architecture
56 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Ca08 2014 PDF
No ratings yet
Ca08 2014 PDF
54 pages
What Are SFRs
No ratings yet
What Are SFRs
9 pages
App C
No ratings yet
App C
50 pages
Cs152 Sp16 F Sol VLIW
No ratings yet
Cs152 Sp16 F Sol VLIW
40 pages
Zareen 13
No ratings yet
Zareen 13
13 pages
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
No ratings yet
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
46 pages
CMP3010L05-Hazard Continue ILP
No ratings yet
CMP3010L05-Hazard Continue ILP
54 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
13 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
ch4 3
No ratings yet
ch4 3
61 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
Correlating (Global) Branch Predictors Correlating Branch Predictors
No ratings yet
Correlating (Global) Branch Predictors Correlating Branch Predictors
3 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Zareen 14
No ratings yet
Zareen 14
9 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
ME8791 Mechatronics - UNIT II
No ratings yet
ME8791 Mechatronics - UNIT II
52 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
No ratings yet
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
18 pages
Very Large Scale Instruction Word
No ratings yet
Very Large Scale Instruction Word
22 pages
03ILP Speculation and Advanced Topics
No ratings yet
03ILP Speculation and Advanced Topics
48 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Lecture12 Vliw
No ratings yet
Lecture12 Vliw
19 pages
Memory-Interfacing: Lesson
No ratings yet
Memory-Interfacing: Lesson
8 pages
VLIW Processors: Spring 2003 CSE P548 1
No ratings yet
VLIW Processors: Spring 2003 CSE P548 1
17 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Session8-Memory Unit
No ratings yet
Session8-Memory Unit
40 pages
CA Virtual Memory
No ratings yet
CA Virtual Memory
3 pages
HW3S24 Sol
No ratings yet
HW3S24 Sol
16 pages
Sp11-Quiz1 Soln
No ratings yet
Sp11-Quiz1 Soln
20 pages
IT3030E CA Chap5 CPU - Removed
No ratings yet
IT3030E CA Chap5 CPU - Removed
26 pages
Hardware of The PIC16F877
No ratings yet
Hardware of The PIC16F877
2 pages
VLSI Module 3 PDF
No ratings yet
VLSI Module 3 PDF
34 pages
Pipelining: Basic Concepts
No ratings yet
Pipelining: Basic Concepts
20 pages
Vliw Processor: Submitted By, Manjiri Phadnis. Neha Naik. Guided By, Prof. M.S. Nagmode
No ratings yet
Vliw Processor: Submitted By, Manjiri Phadnis. Neha Naik. Guided By, Prof. M.S. Nagmode
23 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
Lec 15
No ratings yet
Lec 15
15 pages
4.cache Memory
No ratings yet
4.cache Memory
35 pages
Lec5 PDF
No ratings yet
Lec5 PDF
23 pages
Multi Core Handling Guide in Traveo II: Associated Part Family
No ratings yet
Multi Core Handling Guide in Traveo II: Associated Part Family
55 pages
IC Device Manufacturing - Overview Nptel
No ratings yet
IC Device Manufacturing - Overview Nptel
19 pages
Le Up Ers 00 Instruction
No ratings yet
Le Up Ers 00 Instruction
10 pages
Unit 1
No ratings yet
Unit 1
43 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
COA Unit-4 Notes
No ratings yet
COA Unit-4 Notes
35 pages
EE557 SP25 HW2 Sol
No ratings yet
EE557 SP25 HW2 Sol
9 pages
Vliw Processors
No ratings yet
Vliw Processors
20 pages
Ehb 222e Lecture 6 - Mosfet DC Analysis
No ratings yet
Ehb 222e Lecture 6 - Mosfet DC Analysis
30 pages
Chapter 4: Data Transfers, Addressing, and Arithmetic: Assembly Language For x86 Processors 7th Edition
No ratings yet
Chapter 4: Data Transfers, Addressing, and Arithmetic: Assembly Language For x86 Processors 7th Edition
30 pages
1.vliw & Epic
No ratings yet
1.vliw & Epic
5 pages
Virtual Memory-Unit 5
No ratings yet
Virtual Memory-Unit 5
24 pages
93C46 MicrochipTechnology
No ratings yet
93C46 MicrochipTechnology
20 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Superpipelining
No ratings yet
Superpipelining
7 pages
Lecture #2
No ratings yet
Lecture #2
11 pages
Migrating From Macronix MX25U25643G To Infineon S25FS256S: About This Document
No ratings yet
Migrating From Macronix MX25U25643G To Infineon S25FS256S: About This Document
15 pages
Logic Families
No ratings yet
Logic Families
14 pages
Emitter-Coupled Logic Element Simulation PDF
No ratings yet
Emitter-Coupled Logic Element Simulation PDF
4 pages
Homework 3
No ratings yet
Homework 3
3 pages
1st and 2nd Generation AMD Embedded G-Series System-on-Chip (SOC)
No ratings yet
1st and 2nd Generation AMD Embedded G-Series System-on-Chip (SOC)
3 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
How We Made DRAM
No ratings yet
How We Made DRAM
1 page
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
From Everand
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
Redouane MEDDANE
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Digital Spectral Analysis MATLAB® Software User Guide
From Everand
Digital Spectral Analysis MATLAB® Software User Guide
S. Lawrence Marple, Jr.
No ratings yet
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
From Everand
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
Mulayam Singh
No ratings yet

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

Uploaded by

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

Uploaded by

Very Large Instruction Word (VLIW)

• VLIW – architectures and scheduling techniques (Ch. 3.5)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Sometimes there is no need to speculate because the

Michel Dubois, Murali Annavaram and Per Stenström © 2019

• Pipeline is simple with no hazard detection

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Issues with loop unrolling

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Which of the following statements are correct assuming

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Let RR6 point to physical register X in iteration Y. There are 18

a) RR0 points to X after Y+13 iterations

Michel Dubois, Murali Annavaram and Per Stenström © 2019

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Dependency spans two loop iterations:

Dependency is through memory; rotating registers do not help

Let’s look at the data dependency graph!

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

• The cycle in the graph takes 6 cycles and spans two

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

How many clocks does it take to execute all instructions in each of

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

LW R4,0(R1) LW R4,0(R1) LW R4,0(R1)

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Predicated instruction = conditionally executed instruction

Example 1) CLWZ R1,0(R2),R3; if (R3)==0 then LW R1,0(R2)

• Only executed if condition is met; other No Operation

Michel Dubois, Murali Annavaram and Per Stenström © 2019

Michel Dubois, Murali Annavaram and Per Stenström © 2019

• VLIW and loop unrolling

You might also like