4 MultiIssue 2024
4 MultiIssue 2024
Jie Zhang
[Adapted from EE488, Myoungsoo Jung, KAIST]
[email protected]
Remind: Performance w/ Pipelining
ØWe finish one instruction per cycle
• After the initial warm-up period
ØInstruction throughput
• Latch at end of each stage adds latency
• Longest stage determines clock cycle time
• Example:
Peking University
Then, How about using Deeper pipelines?
Ø Called Super-pipelining
Ø Goal: Speedup ~ number of stages
Ø Idea: let’s have a lot of stages
• Vaguely defined by deep pipelining
• How about > 30? Think Pentium 4.
Ø What does this mean by?
• While splitting a stage into multiple sub-stages, the machine can issue a new instructio
n every minor cycle
• If one splits the stage into “m” sub-stages, the clock cycle period “T” should be reduce
d by “T/m”
T
Fetch Decode Inst
F1 F2 D1 D2 E1 E2
T/m
Peking University
Ideal Case of Super-pipelining
ØThe super-pipeline produces the result every T/m clock cycle
T/m
Instr. i F1 F2 D1 D2 E1 E2
Instr. i+1 F1 F2 D1 D2 E1 E2
Super-
Pipelining Instr. i+2 F1 F2 D1 D2 E1 E2
Gain
Instr. i+3 F1 F2 D1 D2 E1 E2
Peking University
There is an Optimal Number of Stages
ØFor a given architecture and associated instruction set
ØAlso need to consider workload characteristics
Diminishing
Returns:
Increasing the
Around ~15 stages
number of stages
over this limit
reduces the
overall
performance
Pipelined Parallel
operation operation
EX1 EX2 EX3
Instr
EX1 EX2 EX3
Peking University
Super-scalar
Dynamic multiple-issue processors (dec
ision making at run time by the hardwar
e)
Peking University
Dynamic Multiple Issue Machine (Superscalar)
ØSuperscalar architectures allow several instructions to be issued and complet
ed per clock cycle
ØA superscalar architecture consists of a number of pipelines that are working
in parallel (N-way Superscalar)
• Can issue up to N instructions per cycle
Register File
Addr Data A1
A2
PC
A3 RD1
Instruction RD4 RD1
Memory A4 A1 RD2
ALUs
A5 A2 Data
A6 RD2
RD5 Memory Multi
WD3
WD6 Multi WD1 mux
WD2
ALUs
Multi port Multi port
Peking University
Super-pipelining vs. Super-scalar
Ø Note that super-pipelining is orthogonal with super-scalar
Instr. i F1 F2 D1 D2 E1 E2
Super- Instr. i+1 F1 F2 D1 D2 E1 E2
Pipelining
Instr. i+2 F1 F2 D1 D2 E1 E2
Instr. i+3 F1 F2 D1 D2 E1 E2
Parallel execution
Instr. i Fetch Decode Inst
2-way Instr. i+1 Fetch Decode Inst
Super-
scalar Instr. i+2 Fetch Decode Inst
Gain
Instr. i+3 Fetch Decode Inst
Decod
Fetch Inst
e Decod
Fetch Inst
e Decod
Fetch Inst
e
Peking University
Pipelining vs. Super-scaler
ØSuperscalarity ex.: sum of array elements
(assume that load-to-use dependence only take a cycle for the sake of brevity)
Assembly Code Cannot execute
C Program
loop: ld $r2, 10($r1)
in parallel
Peking University
From Sequential Instructions to Parallel Execution
Q1. How’s the ability of a superscalar processor to execute instructions in parallel deter
mined?
A. The number and nature of parallel pipelines
B. The mechanism that the processor uses finds independent instructions (which can be executed in paralle
l)
Peking University
Execute in Parallel
But Make Sure Sequential Order
Ø Execute and complete instructions in their sequential order (but a little chance to execu
te in parallel)
Ø To improve parallelism, superscalar has to look ahead and try to find independent instru
ctions to execute in parallel
Peking University
Example Scenario of Parallel Execution Policy
Ø We consider the following instruction sequence:
I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.)
I2: ADD R1,R8,R9 R1 ← R8 + R9
I3: MUL R4,R2,R3 R4 ← R2 * R3
I4: MUL R5,R6,R7 R5 ← R6 * R7
I5: ADD R10,R5,R7 R10 ← R5 + R7
I6: ADD R11,R2,R3 R11 ← R2 + R3
Ø Assumption:
• I1 requires two cycles to execute
• I3 and I4 are in conflict for the same functional unit
• I5 depends on the value produced by I4 (we have a true data dependency between I4
and I5);
• I2, I5 and I6 are in conflict for the same functional unit
Peking University
Example Scenario of Parallel Execution Policy
Ø Parallel execution policy 1
• Issue: In-Order & Completion: In-Order
• Instructions are issued in the exact order that would correspond to sequential executi
on; results are written (completion) in the same order
ØParallel execution policy 2
• Issue: Out-of-Order & Completion: Out-of-Order
• Out-of-order issue takes the set of decoded instructions the processor looks
ahead and issues any instruction, in any order, as long as the program
execution is correct
Peking University
Example Scenario of Parallel Execution Policy
Ø Parallel execution policy 1
• Issue: In-Order & Completion: In-Order
• Instructions are issued in the exact order that would correspond to sequential executi
on; results are written (completion) in the same order
ØParallel execution policy 2
• Issue: Out-of-Order & Completion: Out-of-Order
• Out-of-order issue takes the set of decoded instructions the processor looks
ahead and issues any instruction, in any order, as long as the program
execution is correct
Peking University
Parallel Execution Policy1
In-Order Issue with In-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I3: MUL $R4, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
I5: ADD $R10, $R5, $R7
I6: ADD $R11, $R2, $R3 : True data dependency
An instruction
completes only
after the
Cycle Decode/Issue Execute Writeback/Complete
previous one
1 I1 I2 has completed
2 I3 I4 I1 I2
3 I5 I6 I1
4 I3 I1 I2
5 I4 I3
An instruction
6 cannot be I5 I4
7 issued before I6 I5
8 the previous I6
one has been ADDF ADD MUL
issued unit unit unit Peking University
Parallelism Depends on the Program
In-Order Issue with In-Order Completion
Ø The processor detects and handles (by stalling) true data dependencies and resource co
nflicts.
Ø As instructions are issued and completed in their strict order, the resulting parallelism is
very much dependent on the way where the program is written/compiled.
Peking University
Rewrite Code for Better Parallelism
In-Order Issue with In-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I6: ADD $R11, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
I5: ADD $R10, $R5, $R7
I3: MUL $R4, $R2, $R3 : True data dependency
Peking University
Parallel Execution Policy2
Out-of-Order Issue with Out-of-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I3: MUL $R4, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
Out-of-Order Similar to Issue,
I5: ADD $R10, $R5, $R7
Issue does not it does not
I6: ADD $R11, $R2, $R3 : True data dependency
need to wait Need to need to wait
until I1 is remove true until I1 is
Cycle Decode/Issue
executed Execute
dependency Writeback/Complete
completed
1 I1 I2 issue!
2 I3 I4 I1 I2
3 I5 I6 I1 I3 I2
4 I6
I5 I4 I1 I3
5 I5 I4 I6
6 I5
7
8
ADDF ADD MUL
unit unit unit Peking University
Challenges and Considerations of Superscaler
ØNo free lunch! There are dependencies!
ØMust check dependencies for all instructions, which are
ØSimultaneously decoded
ØIn-progress in the pipeline (e.g., previously issued)
Dependences
(constraints)
Peking University
Superscalar Architecture
ØPutting all together; More detailed Superscalar Architecture
(queues, reservation
Decode & Rename&
Instruction issuing
Instr. Window
stations, etc.)
Addr. Calc. &
Branch pred.
Dispatch
Memory
Integer
Fetch &
Instr$
unit
Integer
unit
Register
Files Commit
Peking University
Ancient Superscalar Architecture
Ø PowerPC 6XX
• Six independent execution units:
• Branch execution unit
• Load/store unit
• Three integer units
• Floating point unit
• Out-of-order issue
ØPentium I/II
• P-I: Three independent units
• P-II: out-of-order, five instructions can be issued in a cycle
Peking University
VLIW
Static multiple-issue processors (de
cision making at compile time by t
he compiler)
Peking University
VLIW: Very Long Instruction Word
ØKey Idea: Replace a traditional sequential ISA with a new ISA that enables
the compiler to encode instruction-level parallelism (ILP) directly in the
hardware/software interface
VLIW Compiler VLIW Processor
C Program
VLIW
for(i=0;i<n;i++) ISA
dest[i]=
src[i]*coeff;
Find
independent Schedule
operations operatio Direct execution
ns
ØSub-instructions within a long instruction must be independent
ØMultiple “sub-instructions” can be packed into one long instruction
ØEach “slot” in a VLIW instruction for a specific functional unit
Peking University
VLIW Hardware (TinyRV1 VLIW Processor)
Y-pipe X-pipe L-pipe S-pipe
VLIW lw x6, sw x6,
Instruction mul x1,x2,x3 add x4, x1, x5 0(x7) 0(x8)
mul add lw sw
Y0 Y1 Y2 Y3
X0
F D W
4 4 4
L0 L1
S0 S1
Ø Key Ideas:
• Get rid of control flow
• Predicated execution, loop unrolling
• Optimize frequently executed code-paths
• Trace scheduling
• Others: Software pipelining
Peking University
Compile Technique1: Loop Unrolling
ØKey idea: Unroll loop to perform M iterations at once
ØLimitations: Code growth, does not handle inter-iteration
Optimized C code
VLIW Compiler
For (i=0; i<N; i+=4)
Original C code { B[i] = A[i]+C;
Loop
For (i=0; i<N; i++) B[i+1]=A[i+1]+C;
Unrollin B[i+2]=A[i+2]+C;
B[i] = A[i] + C; g… B[i+3]=A[i+3]+C;
}
Peking University
Compile Technique2: Predicated Execution
ØKey idea: Eliminate hard-to-predict branches by converting
control dependence to data dependence
(normal branch code) (predicated code)
Original C code A
T N A
if (cond) { B
b = 0; C B
C
}
else { D D
b = 1; p1 = (cond)
} A branch p1, TARGET A p1 =
(cond)
mov b, 1
B jmp JOIN B (!p1) mov b,1
TARGET:
C mov b, 0 C (p1) mov b,0
Peking University
Compile Technique2: Predicated Execution
ØKey idea: Eliminate hard-to-predict branches by converting
control dependence to data dependence
ØLimitations: Reduces perf. if misprediction cost < benefit
Predicated Execution
A
Fetch Decode Rename Schedule RegisterRead Execute
EFD
A
B
C FB
ED
C
A EFB
C
D A AEF
D
B
C CD
A
BEF BD
AC
EF FEA
C
D
B ED
B
C
FA C
D
A
EB B
C
A
D A
B
C B
A A
C B
nop
D
Branch Prediction
Fetch Decode Rename Schedule RegisterRead Execute
E
F E D B A
F
Pipeline flush!!
Peking University
CISC vs RISC vs SuperScalar vs VLIW
CISC RISC Superscalar VLIW
variable size fixed size fixed size fixed(but
fixed size sizelarge)
Instr size variable size fixed size fixed size
(but large)
variable format fixed format fixed format fixed format
Instr format variable format fixed format fixed format fixed format
Peking University
Static Schedulin
g
Peking University
Recall: Data-Dependence Stalls
ØPreviously, we tried to reduce the program execution time with
ØBut, there are limits due to data-dependency
Single-issue pipeline Multiple-issue; Superscalar
(Lecture 3) (Lecture 4)
Functional Functional
Units Low-complexity Units High-complexity
implementation implementation
Peking University
Static Scheduling
ØStatically schedule inst. from the compiler angle! (for data-dep.)
Dynamic
Static Scheduling
Compile-time Scheduling
Unscheduled
program
Static
Scheduler
Dynamic
Scheduler
Run-time
Functional Functional
Units Units
Peking University
What is Compiler?
ØCompiler translates a program written in a high-level language
into an equivalent program in a target language
Peking University
Performance Impacts of Compiler
ØCompiler optimizations may improve performance significantly
Peking University
Compiler Optimization = Graph Problem!
ØThe input of optimization process is control flow graph (CFG)
Ø A directed graph where
Ø Each node represents a statement
Ø Edges represent control flow
x:= a+b
x:= a+b
y:= a*b y:=a*b
x:= a+b;
y:= a*b; CFG Generation CFG variations
while (y>a) { y>a
a:=a+1; y>a
x:=a+b; With basic
} blocks a:=a+1
a:=a+1 x:=a+b
Basic blocks: a sequence of
Loop instructions w/ no branches
x:=a+b
into or out of the block
Peking University
Simple Loop Example
Simple loop:
for(i=1; i<=1000; i++)
1. Loop: LD F0, 0(R1)
x[i]=x[i] + s; 2. Stall
Compilation 3. ADDD F4,F0,F4
w/ a vanilla compiler 4. Stall
Loop:L.D F0, 0(R1) ;F0=array el. 5. Stall
ADD.D F4,F0,F4 ;add scalar in F2
6. SD F4,0(R1)
S.D F4 0(R1) ;store result
SUBI R1,R1,#8 ;decrement pointer 7. SUBI R1,R1,#8
BNEZ R1,R2,Loop;branch Execute in
8. Stall
machine
Our machine specification: 9. BNEZ R1,R2,Loop
Instruction Instruction Delay in 10. Stall
producing result using the result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2 10 clocks per iteration
(5 stalls)
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
=> Can we rewrite the
code to minimize stalls?
FP ALU op Branch 1
Branch 1
Peking University
Scheduled Loop Body (with CFG)
Clock Cycle
Clock Cycle
LD F0,0(R1) LD F0,0(R1)
5 stalls 1 stalls!
Utilize
SD F4,0(R1) SD F4,0(R1)
8
Compiler delaye
SUBI R1,R1,#8
optimization
SUBI R1,R1,#8 d
branch
slot
BNEZ BNEZ
R1,R2,Loop R1,R2,Loop
Peking University
Goal of Multi-Issue Scheduling
ØPlace as many independent instructions in sequence
• “as many” up to execution bandwidth
• Don’t need 7 independent instructions on a 3-wide machine
• Avoid pipeline stalls
Peking University
Why this Should work?
ØCompiler has “all the time in the world” to analyze instructions
Ø Hardware must do it in < 1ns
ØCompiler can “see” a lot more
Ø Compiler can do complex inter-procedural analysis, understand high-level behavior of code
and programming language
Ø Hardware can only see a small number of instructions at a time
Static detection & resolution Dynamic detection & resolution
of dependencies (Compiler) of dependencies (HW)
Input code
Input code
Software window
window window
Execution Issue
Instruction
ILP issue unit
compiler of the
processor
Peking University
Why might this not work?
Ø Can’t always schedule around branches
• limited access to dynamic information (profile-based info)
• Perhaps none at all, or not representative
• Ex. Branch T in 1st ½ of program, NT in 2nd ½, looks like 50-50 branch in profile
Ø Not all stalls are predicable
• Cannot react to dynamic events like data cache misses
Peking University
Conventional Op
timization Techn
iques
Execution Time = IC * CPI * CCT
IC: Instruction Count
CPI: Cycles per Instruction
CCT: Clock Cycle Time We mainly focus on
Instruction Count!
Peking University
Technique1: Register Renaming
ØObservation1: weird register allocation
• Largely limited by architected registers
• Could possibly cause more spills/fills
ØObservation2: Dynamic Dead Code (branches)
• Code motion may be limited
R8
R1=R2+R3 Need to allocate
BEQZ R9 registers differently
Causes unnecessary
R8 R1=LOAD execution of LOAD
R5=R1-R4 when branch goes left
0[R6]
Peking University
Register Renaming & Scheduling
ØSame functionality, no stalls Schedule to remove stall
No need to same
A: R1 = R2 + R3 A: R1 = R2 + R3
No need to same
B: R4 = R1 – R5 B:
C’:R4
R8==R1 – R50[R7]
LOAD
C: R1 = LOAD 0[R7] Renaming C’: R8 == R1
B: R4 LOAD
– R50[R7]
D: R2 = R1 + R6 D’: R2 = R3
E’: R9 R8 + R5
R6
E: R6 = R3 + R5 E’:
D’: R9
R2 = R3
R8 + R5
R6
F: R5 = R6 – R4 F’: R5 = R9 – R4
CFG A A
vie Renaming Scheduling
A C’
w B C B C
B E’
D E D E
D’ F
F F
Peking University
Technique2: Loop Unrolling
ØTransforms an M-iterations loop into a loop with M/N iterations
ØWe say that the loop has been unrolled N times
ØSome compilers can do this (gcc –funroll-loops) or you can do it
manually (above)
for(i=0;i<100;i+=4){
a[i]*=2;
for(i=0;i<100;i++) a[i+1]*=2;
a[i]*=2; a[i+2]*=2;
a[i+3]*=2;
}
Peking University
Why Loop Unrolling? (1)
ØGet rid of small loops
a[0]*=2;
for(i=0;i<4;i++) a[1]*=2;
a[i]*=2; a[2]*=2;
a[3]*=2;
for(0)
for(1)
for(2)
for(3)
Difficult to schedule/hoist
insts from bottom block to Easier: no branches in the way
top block due to branches
Peking University
Why Loop Unrolling? (2)
ØLess loop overhead
ØAllow better scheduling of instructions 4 branches -> 1
L.D F0,0(R1)
ADD.D F0,F0,F2 L.D branches
F0,0(R1)
S.D F0,0(R1) ADD.D F0,F0,F2
DADDUI R1,R1,#-8 S.D F0,0(R1)
BNE R1, R2, Loop L.D F0,-8(R1)
ADD.D F0,F0,F2
S.D F0,-8(R1)
L.D F0,0(R1)
ADD.D F0,F0,F2 Unroll L.D F0,-16(R1)
ADD.D F0,F0,F2
S.D F0,0(R1)
S.D F0,-16(R1)
DADDUI R1,R1,#-8
L.D F0,-24(R1)
BNE R1, R2, Loop
ADD.D F0,F0,F2
S.D F0,-24(R1)
L.D F0,0(R1) DADDUI R1,R1,#-32
ADD.D F0,F0,F2
BNE R1, R2, Loop
S.D F0,0(R1)
DADDUI R1,R1,#-8
BNE R1, R2, Loop
L.D F0,0(R1)
ADD.D F0,F0,F2
S.D F0,0(R1)
DADDUI R1,R1,#-8
BNE R1, R2, Loop Peking University
Loop Unrolling: Problems
ØProgram size becomes larger
(code bloat)
Q1. What if N is not a multiple of M?
Q2. Or What if N is unknown at compiler time?
Q3. Or What if it is a while loop? j1=j-j%4;
for(i=0;i<j1;i+=4)
{
for(i=0;i<j;i++) a[i]*=2;
Unroll until
a[i]*=2; a[i+1]*=2; value`i` is
a[i+2]*=2; multiple of 4
a[i+3]*=2;
}
Remained for for(i=j1;i<j;i++)
another for loop a[i]*=2;
Peking University
Technique3: Function Inlining
ØGoal: sort of like “unrolling” a function
ØProblems: primarily code bloat
Peking University
Technique4: Tree Height Reduction
ØGoal: shorten critical path(s) using associativity law
ØLimitations: not all math operations are associative!
ØC defines L-to-R semantics for most arithmetic
Associativity
R8=((R2+R3)+R4)+R5 R8=(R2+R3)+(R4+R5)
I1
I1 I2
I1:ADD R6,R2,R3 ADD R6,R2,R3
I2: ADD R7,R6,R4 I2 ADD R7,R4,R5
I3: ADD R8,R7,R5 ADD R8,R7,R6
I3 I3
Peking University
Optimization Tech
niques for ILP
Execution Time = IC * CPI * CCT
IC: Instruction Count
CPI: Cycles per Instruction
CCT: Clock Cycle Time We mainly focus on
Cycles/Instruction!
Peking University
Scheduling for Inst-Level Parallelism
ØGoal: Schedule instructions to finish whole program as soon as possible
ØTwo types of static ILP scheduling
Global
Software Pipelining
Scheduling
Target: DAG (Directed Target: loop in any
Acyclic Graph) of a code
general purpose integer
program with many
conditional branches
Peking University
Technique1: Global Scheduling
Q. What is the general size of basic block?
A. In general, the basic block size of non-numeric
computation program is 5~20 instructions [REMIND] Basic blocks: a
è No good enough # of instructions which can be processedsequence
in parallel
of instructions w/ no
branches into or out of the block
Q. Why global scheduling? x:= a+b
A. As the basic block size is too small to find out y:=a*b
Peking University
Trace-based Scheduling
Ø This is one technique for global scheduling
• Works on all code, not just loops
• Take an execution trace of the common case
• Schedule code as if it had no branches
• Check branch condition when convenient
• If mispredicted, clean up the mess
Peking University
Example of Trace Scheduling
a=log(x); a=log(x);
if(b>0.01) c=a/b; 90%
{ y=sin(c);
90% c=a/b; if(b<=0.01) 10%
}else{ goto fixit;
10% c=0; fixit:
} c=0;
y=sin(c); y=0; // sin(0)
Peking University
Pay Attention to Cost of Fixing
Ø[REMIND] Amdahl’s law
1 = System performance
� P: Fraction of enhanced component
1−� +
� S: Speedup of enhanced component
• Assume the code for b > 0.01 • But, fix-up code may cause the
accounts for 80% of the time remaining 20% of the time to be
• Optimized trace runs 15% even slower!
faster • Assume fixup code is 30% slower
1 1
= 1.117 = 1.046
�. � �. �
1 − �. � + 11.7% 1 − �. � ∗ �. � + 4.6%
�. �� �. ��
Epilog
Ø Only the shaded part, the loop kernel, involves executing the full width of the VLIW instruction.
– The loop prolog and epilog contain only a subset of the instructions.
• “ramp up” and “ramp down” of the parallelism.
Peking University
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: ld f1, 0(r1)
Int1 Int 2 M1 M2 FP+ FPx
ld f2, 8(r1)
loop: ld f1 Assumption:
ld f3, 16(r1)
ld f2
ld f4, 24(r1) ld: 2 cycle
ld f3
add r1, 32
add r1 ld f4 fadd f5
fadd f5, f0, f1
fadd f6, f0, f2
Schedule fadd f6
fadd f7
fadd f7, f0, f3
fadd f8
fadd f8, f0, f4 sd f5
sd f5, 0(r2) sd f6 Assumption:
sd f6, 8(r2) sd f7 fadd: 3 cycle
sd f7, 16(r2) add r2 bne sd f8
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop
Peking University
Software Pipelining Int1 Int 2 M1 M2 FP+ FPx
Unroll 4 ways first
loop: ld f1
loop: ld f1, 0(r1) ld f2
ld f2, 8(r1) ld f3
ld f3, 16(r1) add r1 ld f4
ld f4, 24(r1) ld f1 fadd f5
add r1, 32 ld f2 fadd f6
fadd f5, f0, f1 Schedule ld f3 fadd f7
fadd f6, f0, f2 add r1 ld f4 fadd f8
fadd f7, f0, f3 ld f1 sd f5 fadd f5
fadd f8, f0, f4 ld f2 sd f6 fadd f6
sd f5, 0(r2) add r2 ld f3 sd f7 fadd f7
sd f6, 8(r2) bne ld f4 sd f8 fadd f8
sd f7, 16(r2) sd f5 fadd f5
sd f8, 24(r2) sd f6 fadd f6
add r2, 32 add r2 sd f7 fadd f7
bne r1, r3, loop bne sd f8 fadd f8
sd f5
Peking University
Loop Unrolling vs. Software Pipelining
Startup Wind-down
overhead overhead
performance
Loop Unrolling
time
Loop Iteration
performance
Pipelining
Software
Loop Iteration
time
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
Peking University
Recall: Why Compiler Might not Work
ØCan’t always schedule around branches
• limited access to dynamic information (profile-based info)
• Perhaps none at all, or not representative
• Ex. Branch T in 1st ½ of program, NT in 2nd ½, looks like 50-50 branch in profile
Peking University
Dynamic Scheduling m
eans Out-of-Order exe
cution (OoO; O3)
Let’s think about the following que
stions
Peking University
Q1) How can O3 achieve perf. benefits?
ØHardware rearranges the instruction stream to reduce stalls
cache miss
D F
cache miss
cache miss
cache miss
D: R5 = R2 – 4
D E G
E: R7 = Load 20[R5] Execute
F: R4 = R4 – 1 E
G: BEQ R4, #0 B B C B B
5 cycles
C D F
Dependency graph
D E F G
A C D F E
G 7 cycles
F
8 cycles
G
B E G
10 cycles
Peking University
Q2) Any problems of O3?
ØRecall: Hazards! Especially for register dependencies
True dependency Anti-dependency Output dependency
Read-After-Write Write-After-Read Write-After-Write
A: R1 = R2 + R3 A: R1 = R3 / R4 A: R1 = R2 + R3
B: R4 = R1 * R4 B: R3 = R2 * R4 B: R1 = R3 * R4
A B
R1 5 7 7 R1 5 A 3 B 3 R1 5 A 7 B 27
In-Order
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9 9 9 R3 9 9 -6 R3 9 9 9
R4 3 3 21 R4 3 3 3 R4 3 3 3
B A
R1 5 5 7 R1 5 B 5 A -2 R1 5 B 27 A 7
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
OoO
R3 9 9 9 R3 9 -6 -6 R3 9 9 9
R4 3 15 15 R4 3 3 3 R4 3 3 3
Read Read Will be
Cycle
old data future data overwritten
by legacy data
Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data
Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data
Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data
MapTable FreeList
Register Original insns. Renamed insns.
r1 r2 r3 p1,p2,p3, Dependency
Cycle
renaming
p1 p2 p3 p4,p5,p6,p7 add WAW
r1,r2,r3 isadd p1,p2,p3
removed!
p1 p2 p4 p5,p6,p7 sub r3,r2,r1 sub p4,p2,p1
p1 p2 p5 p6,p7 mul r3,r2,r3 mul p5,p2,p4
p6 p2 p5 p7 div r1,r1,4 div p6,p1,4
Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data
Ø Renaming removes WAR/WAW, but leaves RAW intact
Peking University
Q3) How does the O3 work?
ØStep1: Fetch many instructions into an instruction window
Dynamic
Instruction
Stream
Static
Program
Fetch
Peking University
Q3) How does the O3 work?
ØStep2: Rename regs. to avoid false deps. (WAW and WAR)
Dynamic Renamed
Instruction Instruction
Stream Stream
Static
Program
Rename
Fetch
Peking University
Q3) How does the O3 work?
ØStep3: Execute instructions as soon as dependencies (registers and
memory) are known
Dynamic Renamed Dynamically
Instruction Instruction Scheduled
Stream Stream Instructions
Static
Program
Schedule
Rename
Fetch
Out-of-order =
out of the original
sequential order
Peking University
Dynamic Scheduling I: Scorebo
ard
ØLet’s track the flow of the instrs, register,
and function units
Ø to check which datapath components are usi
ng / can be used
Ø to find out which instruction could be execute
d without hazards
Peking University
The CDC 6600 Projec
t [‘1964]
First implementation of Scoreboard
• 16 separate non-pipelined functi
onal units (7 int, 4FP, 5 memory)
• No register by passing
Peking University
Dynamic Scheduling: The Big Picture
ØInstructions fetch/decoded/renamed into Instruction Buffer
ØInstructions (conceptually) check ready bits every cycle
Dependency graph
I1
add r4,r2,r3 Inst1:add p4,p2,p3
sub r3,r2,r1 Inst2:sub p5,p2,p4 I2 I4
mul r3,r2,r3 Inst3:mul p6,p2,p5
div r1,4,r1 Inst4:div p7,4,p4 I3 regfile
I$ insn buffer D$
B
P
Peking University
Scoreboard Pipeline
A new kind of structural hazard
: Instruction buffer is full
regfile
I$ insn buffer
D$
B
P
Allocate a slot
in an Solve Solve WAR!
Wait until no
instruction structural
data hazards Multiple Stall until the
buffer and hazard earlier
Functional
dispatch an &WAW! instruction reads
Read units
instruction in- Check FU and the destination
operands register
order destination reg
Peking University
Scoreboard Architecture (CDC 6600)
F0 Integer (1)
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers
Scoreboard
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div
Peking University
Scoreboard’s Stage #1: Issue
F0 Integer (1)
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers
checks hazards)
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #1: Issue (ID1)
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
Scoreboard
ADDD Add F10
Div
Peking University
Stage #2: Read Operands (ID2)
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
At stage
Insn Status #2, SB reads operands
Inst I R X W FU
FU Status
B Op dst src1 src2 Q1 Q2 R1 R2
Reg Status
FU
LD#1when they are available (i.e.,
Int F0
Scoreboard
LD#2 Mult F2
MULTD
SUBD
no data
1
Mult
hazard) F4
F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #3: Execution (EX)
F0 Integer (1)
F2
Mult1 (10) Let’sInst
assume
$
each
F4
functional
LD F2 45+ R3 unit takes
Registers
Mult2 (10)
(m)subCPU
F8 F6 cycles where
F6 mul F0 F2 F4
F2
m is specified in the
F8 Add (2) div F10 F0 F6
add F6 F8 F2
figure
F10
Divide (40)
…
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #3: Execution (EX)
1
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #3: Execution (EX)
1
0
F0 Integer (1) LD F6 34+ R2 At stage #3, FU begins
F2 execution
Inst $ upon
Mult1 (10)
F4 receiving
LD F2 45+ R3operands.
Registers
Mult2 (10)
F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #3: Execution (EX)
0
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
… execution
Insn Status FU Status Reg Status
Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int F0
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #4: Write Back (WB)
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
AtInsnstage
Status #4, FU stalls until there is no
FU Status Reg Status
Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 WAR hazard
Int with any instructions F0
Scoreboard
previously issued
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Stage #4: Write Back (WB)
F0 Integer (1) LD F6 34+ R2
F2 Inst $
Mult1 (10)
F4
Registers
LD#1
FU will update
Inst I R X W FU
Int
B the register
Op dst src1 src2 with
Q1 Q2 R1 R2 FU
F0
the output of instruction
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8
Scoreboard
ADDD Add F10
Div
Peking University
Three Parts of Scoreboard
F0 Integer (1)
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers
Scoreboard
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div
Peking University
Three Parts of Scoreboard
ØThree main components
Integer (1)
F0Ø Instruction status
F2Ø Functional unit status Inst $
Mult1 (10) LD F6 34+ R2
F4Ø Register result status
Registers
Scoreboard
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No
Peking University
Part #1: Instruction Status
• Which of
F0
4 steps theInteger
instruction
(1) is in
• ID1: Issue
F2 Inst $
• ID2: Read
F4
operandsMult1 (10) LD F6 34+ R2
Registers
Peking University
Part #2: Functional Unit Status
• Indicates the state of the functional
Integer (1) unit (FU)
F0
• 9 fields for
F2
each FU Inst $
• B: Indicates whether the(10)
Mult1 unit is busy or not LD F6 34+ R2
F4
Registers
• Op: Operation to perform in the unit (e.g., + or -)
Mult2 (10) LD F2 45+ R3
• dst: F6Destination register mul F0 F2 F4
sub F8 F6 F2
• src1,src2: Source-registerAdd
numbers
(2)
F8 div F10 F0 F6
• Q1, Q2: Functional units producing source registers src1, src2
F10 add F6 F8 F2
• R1, R2: Flags being set Divide
when src1/src2
(40) is ready
…
Peking University
Part #3: Register Result Status
• Indicates F0
which functional unit will write each register, if
Integer (1)
one exits.F2 Inst $
Mult1 (10) LD F6 34+ R2
• Blank whenF4 no pending instructions will write that
Registers
Peking University
Our Example: “Simple Scoreboard”
F0 Integer (1)
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers
Peking University
Scoreboard Example: Cycle 1
F0 Integer (1)
Note that, as this is a pipelined
F2 Inst $
Mult1 (10) architecture, multiple
LD F6 34+ instructions
R2
F4
can be handledLD at a same cycle.
Registers
Peking University
Scoreboard Example: Cycle 1
F0 Integer (1)
F2 Inst $
Let’s
Mult1 start from the
(10) LD F6 34+ R2
F4
first
(10) instruction
Registers
Mult2 LD F2 45+ R3
mul F0 F2 F4
F6 “Load” sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) hazard of LD#1
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers
Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10) LD F6 34+ R2
F4
Registers
Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) LD F6 34+ R2 hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) LD F6 34+ R2 hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 2
There’s no Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
RAW hazard
F2 Inst $
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 2
1 Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Peking University
Scoreboard Example: Cycle 2
Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Peking University
Scoreboard Example: Cycle 2
1 Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
Struct
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Peking University
Scoreboard Example: Cycle 3
0
1 Execute LD#1
F0 Integer (1) LD F6 34+ R2
Struct
F2 Inst $
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 3
0 Execute LD#1
F0 Integer (1) LD F6 34+ R2
Struct LD#1 comp.
F2 Inst $
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1) LD F6 34+ R2
Struct
F2 Inst $
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1)
Struct Free FU & Reg
F2 Inst $
Mult1 (10) status of LD#1
F4
Registers
Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1)
Struct Free FU & Reg
F2 Inst $
Mult1 (10) status of LD#1
F4
Registers
Peking University
Scoreboard Example: Cycle 5
Issue LD #2
F0 Integer (1)
F2 Inst $
Mult1 (10)
F4
Registers
Peking University
Scoreboard Example: Cycle 6
There’s no Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
RAW hazard
F2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Mult2 (10)
F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Mult2 (10)
F6 LD#1 mul F0 F2 F4 Check struct
sub F8 F6 F2 hazard of MULTD
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “MULTD”
is empty
Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers
Mult2 (10)
F6 LD#1 mul F0 F2 F4 Check struct
sub F8 F6 F2 hazard of MULTD
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2 Issue MULTD
Divide (40)
…
Peking University
Scoreboard Example: Cycle 7
10 Execute LD#2
F0 Integer (1) LD F2 45+ R3
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 How about
F4
Registers
Scoreboard
Add No F10
ADDD Div No
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
RAW LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT can’t read
How about
F4
Registers
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers
Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers
Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “DIVD”
is empty
Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4 Issue DIVD
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
…
Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4 Issue DIVD
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
RAW
F10 add F6 F8 F2
Divide (40) div F10 F0 F6 NOTE) There is
… RAW due to Mult1
Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) LD F2 45+ R3 Writeback LD#2
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) Writeback LD#2
F2 LD#2 Inst $ Free FU & Reg
Mult1 (10) mul F0 F2 F4 status of LD#2
F4
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) Writeback LD#2
RAW
F2 LD#2 Inst $ Free FU & Reg
Mult1 (10) mul F0 F2 F4 status of LD#2
F4
Registers
Mult2 (10)
F6 LD#1 RAW
Now, F2
register is ready
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6 There’s no
…
RAW hazard
Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers
Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers
Mult2 (10)
F6 LD#1 Check hazard of
Add (2) sub F8 F6 F2
SUB’s operands
F8
F10 add F6 F8 F2
Divide (40) div F10 F0 F6 There’s no
…
RAW hazard
Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers
Mult2 (10)
F6 LD#1 Check hazard of
2 SUB’s operands
F8 Add (2) sub F8 F6 F2
Peking University
Scoreboard Example: Cycle 9
F0 Integer (1)
F2 LD#2 10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
2
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Mult2 (10)
F6 LD#1
2
F8 Add (2) sub F8 F6 F2
RAW
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 9
DIVD cannot be
F0 Integer (1)
issued ∵ RAW
F2 LD#2 Inst $ hazard
10
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
2
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
… FU of “ADDD” is busy
(Structural hazard)
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
DIVD 8
Scoreboard
9 Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes
Peking University
Scoreboard Example: Cycle 10
Integer (1) Calculating….
F0
F2 LD#2 9
10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
1
2
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 11
Integer (1) Calculating….
F0
F2 LD#2 98
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
0
1
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 11
Integer (1) Calculating….
F0
SUB comp.
F2 LD#2 8
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
0
F8 Add (2) sub F8 F6 F2
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
F2 LD#2 8
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 Struct
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4 Free FU & Reg
F4
Registers
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4 Free FU & Reg
F4
Registers
Peking University
Scoreboard Example: Cycle 13
Calculating….
F0 Integer (1)
F2 LD#2 86
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 SUBD Add (2)
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 13
Calculating….
F0 Integer (1)
Issue ADD
F2 LD#2 86
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 SUBD Add (2)
F10 add F6 F8 F2
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 14
Calculating….
F0 Integer (1)
F2 LD#2 85
6
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 14
Calculating….
F0 Integer (1)
Read operands
F2 LD#2 8
5
Inst $ of ADD
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
2
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 15
Calculating….
F0 Integer (1)
F2 LD#2 84
5
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
12
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 16
Calculating….
F0 Integer (1)
F2 LD#2 83
4
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
01
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 16
Calculating….
F0 Integer (1)
ADD comp.
F2 LD#2 8
3
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
0
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 17
Calculating….
F0 Integer (1)
F2 LD#2 32
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 17
Calculating….
F0 Integer (1)
Oops! ADD cant
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
32 write because of
F4 DIVD. WAR
Registers
Mult2 (10)
hazard!
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 18
Calculating….
F0 Integer (1)
F2 LD#2 21
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 19
Calculating….
F0 Integer (1)
F2 LD#2 10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 19
Calculating….
F0 Integer (1)
MULTD comp.
F2 LD#2 0
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)
Free FU & Reg
F2 LD#2 Inst $
Mult1 (10) status of MULTD
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)
Free FU & Reg
F2 LD#2 Inst $
Mult1 (10) status of MULTD
F4
Registers
Peking University
Scoreboard Example: Cycle 21
Read operands
F0 MULT Integer (1)
of DIVD
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1 WAR
F10 40
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 21
Read operands
F0 MULT Integer (1)
of DIVD
F2 LD#2 Inst $ WAR hazard is
Mult1 (10)
F4 removed!
Registers
Mult2 (10)
F6 LD#1 WAR
F10 40
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2
F10 40
39
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)
Writeback ADD
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
ADD
F10 39
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)
Writeback ADD
F2 LD#2 Inst $
Mult1 (10) Free FU & Reg
F4
Registers
F10 39
Divide (40) div F10 F0 F6
…
Peking University
Faster than light computation
(skip a couple of cycles)
Peking University
Scoreboard Example: Cycle 61
Calculating….
F0 MULT Integer (1)
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
MULT
F10 1
0
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 61
F0 MULT Integer (1)
DIVD comp.
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
MULT
F10 0
Divide (40) div F10 F0 F6
…
Peking University
Scoreboard Example: Cycle 62
Integer (1) DON
F0 MULT
E!!!!!
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
ADD
F10 DIV
Divide (40) div F10 F0 F6
…
Peking University
Review of Cycle 62
F0 MULT Integer (1)
F2 LD#2 Inst $
Mult1 (10)
F4
Registers
Mult2 (10)
F6 LD#1
ADD In-order issue
F8 SUBD Out-of-order
Add (2) execution
F10 DIV
Divide (40)
&
… Out-of-order commit
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6
SUBD 7 9 11 12 2 F8
DIVD 8 21 61 62
Scoreboard
Add No F10
ADDD 13 14 16 22 Div No
Peking University
Scoreboard Summary
ØThe good
• + Cheap hardware
• * InsnStatus + FuStatus + RegStatus ~ 1FP unit in area
• + Pretty good performance
• * 1.7X for FORTRAN (scientific array) programs
ØThe less good
- No bypassing
• * Is this a fundamental problem?
- Limited scheduling scope
• * Structural/WAW hazards delay dispatch
- Slow issue of truly-dependent (RAW) instructions
• * WAR hazards delay writeback
- Fix with hardware register renaming
Peking University
Backup
Peking University
Note that O3 means Out-of-Order Completion
In-Order Issue; Instructions are issued in order!
In-order
Inst.
Stream
Execution
Begins
In-order
Peking University