0% found this document useful (0 votes)
84 views35 pages

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

high performance computing

Uploaded by

Chengzi Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views35 pages

Very Large Instruction Word (VLIW) : - VLIW - Architectures and Scheduling Techniques (Ch. 3.5)

high performance computing

Uploaded by

Chengzi Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Very Large Instruction Word (VLIW)

• VLIW – architectures and scheduling techniques (Ch. 3.5)


ü VLIW architecture (3.5.2)
ü VLIW and loop unrolling (3.5.3)
ü VLIW and software pipelining (3.5.4)
ü Non-cyclic VLIW scheduling (3.5.5)
ü Predicated instructions (3.5.6)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Static Scheduling:
Revisiting Pipeline Design
(Ch 3.5.2)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Duality of Dynamic and Static
Techniques
• Instruction scheduling: Compiler moves instructions.
Same issues: data-flow and exception model
• Software register renaming for WAW and WAR hazards
• Memory disambiguation must be done by the compiler
• Branch prediction scheme: static prediction
• Speculation: speculate based on static branch prediction.
Test dynamically and execute patch-up/recovery code if the
speculation fails

Sometimes there is no need to speculate because the


compiler knows the structure of the program (e.g. loops)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW (Very-Long Instruction Word)
Architectures

• Pipeline is simple with no hazard detection


• Compiler schedules instructions in “packets” or long
instruction words (two memory, two floating-point and an
integer operation in the example)
• Forwarding helps but is not needed

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Program Example

Compiler scheduling
• Local (inside a basic block) or global (across basic blocks)
• Cyclic (loop unrolling or software pipelining) or non-cyclic
(trace scheduling)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Loop Unrolling for VLIW
(Ch 3.5.3)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Loop Unrolling

UNUSED UNUSED

Issues with loop unrolling


• Code size
• Empty slots
• Register pressure
• Binary compatibility
• Limited scope for ILP exploitation

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Quiz 7.1

Which of the following statements are correct assuming


that all instructions take a single cycle to execute

a) IPC = 5
b) IPC = 23/9
c) The number of unused slots is 21

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Software Pipelining for VLIW
(Ch 3.5.4)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Software Pipelining

Time

Data hazards
• RAW hazards are handled correctly
• For WAR hazards, use rotating registers (register renaming technique)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Rotating Registers

LD F0,0(R1)
ADD F4,F0,F2
SD F4,0(R1)
• Two iterations between LD and ADD; RR6 and RR4 point to same
physical register
• Three iterations between ADD and SD; RR3 and RR0 point to same
register

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Quiz 7.2

Let RR6 point to physical register X in iteration Y. There are 18


physical registers. Which of the following statements are correct?

a) RR0 points to X after Y+13 iterations


b) RR5 points to X after Y+17 iterations
c) RR0 points to X after Y+6 iterations
d) RR5 points to X after Y+1 iterations

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW – Slot Conflicts
• Restrict the number of slots: 1 LD/ST, 1 FP, 1 BR/INT
• Two LD/ST per iterations so two instructions for kernel

With rotating registers

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW–Loop Carried Dependencies 1(3)
Loop carried dependency = dependency across loop iterations

Dependency spans two loop iterations:

Dependency is through memory; rotating registers do not help

Let’s look at the data dependency graph!

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Loop Carried Dependencies

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW–Loop Carried Dependencies 2(3)
(iteration, min cycles to resolve RAW)

• The cycle in the graph takes 6 cycles and spans two


iterations; 3 cycles at least per iteration

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW–Loop Carried Dependencies 3(3)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


VLIW–Loop Carried Dependencies 3(3)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Quiz 7.3

How many clocks does it take to execute all instructions in each of


the original iterations?
a) 3
b) 6
c) 9

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Non-Cyclic Scheduling
(Ch 3.5.5)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Non-Cyclic Scheduling
• Most likely path (A, SB, C in the
example) is established through
profiling
• This path (A, B, C), called trace, forms
a larger basic block of code for
instruction scheduling
• Instruction scheduling respects RAW
dependencies but can ignore control
dependencies
• Must fix the execution on branch
misspeculation so that misspeculated
trace (A, D, C) is correctly executed.

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Example
Original code Trace schedule Optimized trace
Most likely trace LW R4,0(R1) LW R4,0(R1) LW R4,0(R1)
ADDI R6,R4,#1 ADDI R6,R4,#1 LW R6,0(R2)
/* block A*/ LW R6,0(R2) BEQ R5,R4,LAB1
BEQ R5,R4,LAB BEQ R5,R4,LAB1 LAB2: SW R6,0(R1)
LW R6,0(R2) LAB2: SW R6,0(R1) /* jump to second trace if
/*block B*/ /* jump to second trace if prediction is wrong */
/*block D* empty/ prediction is wrong */ ….
LAB: SW R6,0(R1) …. LAB1: ADDI R6,R4,#1
LAB1: ADDI R6,R4,#1 J LAB2
J LAB2

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Quiz 7.4
Original code Trace schedule Optimized trace

LW R4,0(R1) LW R4,0(R1) LW R4,0(R1)


ADDI R6,R4,#1 ADDI R6,R4,#1 LW R6,0(R2)
/* block A*/ LW R6,0(R2) BEQ R5,R4,LAB1
BEQ R5,R4,LAB BEQ R5,R4,LAB1 LAB2: SW R6,0(R1)
LW R6,0(R2) LAB2: SW R6,0(R1) /* jump to second trace if
/*block B*/ /* jump to second trace if prediction is wrong */
/*block D* empty/ prediction is wrong */ ….
LAB: SW R6,0(R1) …. LAB1: ADDI R6,R4,#1
LAB1: ADDI R6,R4,#1 J LAB2
J LAB2
Which of the following statements are correct?
a)The “non-taken” trace consists of 1 more instruction in the original
code compared to the optimized trace
b)The “non-taken” trace consists of the same number of instructions in
the original code and the optimized trace
c)The “taken” optimized trace executes two more instructions than in
the original code

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Predicated Execution
(Ch 3.5.6)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Predicated Instructions
• Trace scheduling works well if branches are highly biased (one trace
is considerably more likely than another)

Predicated instruction = conditionally executed instruction

Example 1) CLWZ R1,0(R2),R3; if (R3)==0 then LW R1,0(R2)

• Only executed if condition is met; other No Operation


• Predication can be applied to any instruction
• Needs an additional operand – a predicate register
• Longer instruction not a problem in VLIW

Michel Dubois, Murali Annavaram and Per Stenström © 2019


Example – Predicated Execution
Original code Predicated code
LW R4,0(R1) LW R4,0(R1)
ADDI R6,R4,#1 ADDI R6,R4,#1
BEQ R5,R4,LAB SUB R3,R5,R4
LW R6,0(R2) CLWNZ R6,0(R2),R3
LAB: SW R6,0(R1) SW R6,0(R1)

Michel Dubois, Murali Annavaram and Per Stenström © 2019


What you should know by now
• VLIW architectures
– Parallel simple pipelines
– No support for dynamic scheduling
– Assumes compiler does static scheduling

• VLIW and loop unrolling


– Challenge is to fill operation slots
• VLIW and software pipelining
– Renaming using rotating registers
– Impact of slot conflicts
– Impact of loop-carried dependencies
• Trace scheduling
• Predicated (conditional) instructions

Michel
Michel
Dubois,
Dubois,
Murali
Murali
Annavaram
Annavaram
and Per
and Stenström
Per Stenström
© 2012
© 2019

You might also like