0% found this document useful (0 votes)
21 views174 pages

4 MultiIssue 2024

Uploaded by

woshijuruo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views174 pages

4 MultiIssue 2024

Uploaded by

woshijuruo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 174

Multi-issue Processor

Jie Zhang
[Adapted from EE488, Myoungsoo Jung, KAIST]
[email protected]
Remind: Performance w/ Pipelining
ØWe finish one instruction per cycle
• After the initial warm-up period
ØInstruction throughput
• Latch at end of each stage adds latency
• Longest stage determines clock cycle time
• Example:

IF 1.0 ns Design Cycle time Inst/Cycle


ID 0.6 ns Single 4.1 ns (sum) 1
cycle 1.2 ns (max)
EX 0.9 ns
Pipeline 1
MEM 1.2 ns Speedup due to pipelining: 3.42
WB 0.4 ns Note: ideally it would be 5

Peking University
Then, How about using Deeper pipelines?
Ø Called Super-pipelining
Ø Goal: Speedup ~ number of stages
Ø Idea: let’s have a lot of stages
• Vaguely defined by deep pipelining
• How about > 30? Think Pentium 4.
Ø What does this mean by?
• While splitting a stage into multiple sub-stages, the machine can issue a new instructio
n every minor cycle
• If one splits the stage into “m” sub-stages, the clock cycle period “T” should be reduce
d by “T/m”
T
Fetch Decode Inst
F1 F2 D1 D2 E1 E2
T/m

Peking University
Ideal Case of Super-pipelining
ØThe super-pipeline produces the result every T/m clock cycle

Instr. i Fetch Decode Inst


Instr. i+1 Fetch Decode Inst
Pipelining Instr. i+2 Fetch Decode Inst
Instr. i+3 Fetch Decode Inst

T/m

Instr. i F1 F2 D1 D2 E1 E2

Instr. i+1 F1 F2 D1 D2 E1 E2
Super-
Pipelining Instr. i+2 F1 F2 D1 D2 E1 E2
Gain
Instr. i+3 F1 F2 D1 D2 E1 E2

Cycle Cycle Cycle Cycle Cycle


1 2 3 4 5
Peking University
Problems of Super-pipelining
1. Require a really high clock frequency!
2. Instruction dependences (similar to pipelining, but much more critical)

Control dependences Data dependences


Result: Branch!
Value from memory arrives
comparison
beq 1 2 3 4 5 6 32 LW R1,0(R2) 1 2 3 4 5 40
Instr1 1 2 3 4 5 6 7 ADD R2,R3,R1 1 2 3 25
Instr2 1 2 3 4 5 6
Value of R1 needed
Instr3 1 2 3 4 5
Stall 15 cycles!
Flush 31 Instr.! Instr31 1

Peking University
There is an Optimal Number of Stages
ØFor a given architecture and associated instruction set
ØAlso need to consider workload characteristics
Diminishing
Returns:
Increasing the
Around ~15 stages
number of stages
over this limit
reduces the
overall
performance

Source: “Runtime Aware Architectures”, Mateo Valero, HiPEAC CSW 2014


Peking University
Another Instruction-Level Parallelism (ILP)
Internal of operation
principle of ILP-processors

Pipelined Parallel
operation operation
EX1 EX2 EX3
Instr
EX1 EX2 EX3

Pipelined Orthogonal Superscalar


processors Design / VLIW

Peking University
Super-scalar
Dynamic multiple-issue processors (dec
ision making at run time by the hardwar
e)

IBM Power 3 Pentium 4 MIPS R10K HP PA 8500

Peking University
Dynamic Multiple Issue Machine (Superscalar)
ØSuperscalar architectures allow several instructions to be issued and complet
ed per clock cycle
ØA superscalar architecture consists of a number of pipelines that are working
in parallel (N-way Superscalar)
• Can issue up to N instructions per cycle

Clock Conceptual N-Way Superscalar

Register File
Addr Data A1
A2
PC

A3 RD1
Instruction RD4 RD1
Memory A4 A1 RD2

ALUs
A5 A2 Data
A6 RD2
RD5 Memory Multi
WD3
WD6 Multi WD1 mux
WD2
ALUs
Multi port Multi port
Peking University
Super-pipelining vs. Super-scalar
Ø Note that super-pipelining is orthogonal with super-scalar

Instr. i F1 F2 D1 D2 E1 E2
Super- Instr. i+1 F1 F2 D1 D2 E1 E2
Pipelining
Instr. i+2 F1 F2 D1 D2 E1 E2

Instr. i+3 F1 F2 D1 D2 E1 E2

Parallel execution
Instr. i Fetch Decode Inst
2-way Instr. i+1 Fetch Decode Inst
Super-
scalar Instr. i+2 Fetch Decode Inst
Gain
Instr. i+3 Fetch Decode Inst

Cycle Cycle Cycle Cycle Cycle


1 2 3 4 5
Peking University
Superscalarity is Important!
ØIdeal case of N-way Super-scalar
• All instructions were independent
• Speedup is “N” (Superscalarity)
Decod
Fetch Inst
Decod
e
Fetch Inst
Decod
e
Fetch Inst
e

§ What if all instructions are dependent?


è No speed up, super-scalar brings nothing
è(Just similar to pipelining)

Decod
Fetch Inst
e Decod
Fetch Inst
e Decod
Fetch Inst
e

Peking University
Pipelining vs. Super-scaler
ØSuperscalarity ex.: sum of array elements
(assume that load-to-use dependence only take a cycle for the sake of brevity)
Assembly Code Cannot execute
C Program
loop: ld $r2, 10($r1)
in parallel

Compile add $r3, $r3, $r2


sum += a[i--] sub $r1, $r1, 1
Parallel
execution bne $r1, $r0, loop
Cannot execute
in parallel
Decod
Pipelining Fetch ld
e Decod
architecture Fetch add
e Decod
Fetch sub
e Decod
Fetch bne
e
$r2 $r1
Decod
Superscaler Fetch ld
e Decod
architecture Fetch add $r3
Decod
e
Fetch sub
We should find out e Decod
Fetch bne
independent instructions! e

Peking University
From Sequential Instructions to Parallel Execution
Q1. How’s the ability of a superscalar processor to execute instructions in parallel deter
mined?
A. The number and nature of parallel pipelines
B. The mechanism that the processor uses finds independent instructions (which can be executed in paralle
l)

The policies used for the instruction execution are characterized


by the following factors
è The order in which instructions are issued for execution (In-Order/Out-Order)
è The order in which instructions are completed (In-Order/Out-Order)

Q. BTW, what’s the meaning of controlling instruction completion?


A. Determine when the instruction write results into registers and
memory locations

Peking University
Execute in Parallel
But Make Sure Sequential Order

Ø Execute and complete instructions in their sequential order (but a little chance to execu
te in parallel)
Ø To improve parallelism, superscalar has to look ahead and try to find independent instru
ctions to execute in parallel

Instructions will be executed in an order different


from the strictly sequential one, with the
restriction that the result must be correct
è Ideas for the execution policies; we will learn more details with
specific techniques for this, partially today and more in future

Peking University
Example Scenario of Parallel Execution Policy
Ø We consider the following instruction sequence:
I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.)
I2: ADD R1,R8,R9 R1 ← R8 + R9
I3: MUL R4,R2,R3 R4 ← R2 * R3
I4: MUL R5,R6,R7 R5 ← R6 * R7
I5: ADD R10,R5,R7 R10 ← R5 + R7
I6: ADD R11,R2,R3 R11 ← R2 + R3
Ø Assumption:
• I1 requires two cycles to execute
• I3 and I4 are in conflict for the same functional unit
• I5 depends on the value produced by I4 (we have a true data dependency between I4
and I5);
• I2, I5 and I6 are in conflict for the same functional unit

Peking University
Example Scenario of Parallel Execution Policy
Ø Parallel execution policy 1
• Issue: In-Order & Completion: In-Order
• Instructions are issued in the exact order that would correspond to sequential executi
on; results are written (completion) in the same order
ØParallel execution policy 2
• Issue: Out-of-Order & Completion: Out-of-Order
• Out-of-order issue takes the set of decoded instructions the processor looks
ahead and issues any instruction, in any order, as long as the program
execution is correct

Peking University
Example Scenario of Parallel Execution Policy
Ø Parallel execution policy 1
• Issue: In-Order & Completion: In-Order
• Instructions are issued in the exact order that would correspond to sequential executi
on; results are written (completion) in the same order
ØParallel execution policy 2
• Issue: Out-of-Order & Completion: Out-of-Order
• Out-of-order issue takes the set of decoded instructions the processor looks
ahead and issues any instruction, in any order, as long as the program
execution is correct

Peking University
Parallel Execution Policy1
In-Order Issue with In-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I3: MUL $R4, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
I5: ADD $R10, $R5, $R7
I6: ADD $R11, $R2, $R3 : True data dependency
An instruction
completes only
after the
Cycle Decode/Issue Execute Writeback/Complete
previous one
1 I1 I2 has completed
2 I3 I4 I1 I2
3 I5 I6 I1
4 I3 I1 I2
5 I4 I3
An instruction
6 cannot be I5 I4
7 issued before I6 I5
8 the previous I6
one has been ADDF ADD MUL
issued unit unit unit Peking University
Parallelism Depends on the Program
In-Order Issue with In-Order Completion

Ø The processor detects and handles (by stalling) true data dependencies and resource co
nflicts.
Ø As instructions are issued and completed in their strict order, the resulting parallelism is
very much dependent on the way where the program is written/compiled.

I1: ADDF $R12, $R13, $R14 Switch


I1: ADDF $R12, $R13, $R14
I2: ADD $R1, $R8, $R9 I3, I6
I2: ADD $R1, $R8, $R9
I3: MUL $R4, $R2, $R3 position
I6: ADD $R11, $R2, $R3
I4: MUL $R5, $R6, $R7 I4: MUL $R5, $R6, $R7
I5: ADD $R10, $R5, $R7 I5: ADD $R10, $R5, $R7
I6: ADD $R11, $R2, $R3 I3: MUL $R4, $R2, $R3
Parallel
execution

Peking University
Rewrite Code for Better Parallelism
In-Order Issue with In-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I6: ADD $R11, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
I5: ADD $R10, $R5, $R7
I3: MUL $R4, $R2, $R3 : True data dependency

Cycle Decode/Issue Execute Writeback/Complete


1 I1 I2
2 I6 I4 I1 I2
3 I5 I3 I1
4 I6 I4 I1 I2
5 I5 I3 I6 I4
6 I5 I3
7
8
ADDF ADD MUL
unit unit unit Peking University
Example Scenario of Parallel Execution Policy
Ø Parallel execution policy 1
• Issue: In-Order & Completion: In-Order
• Instructions are issued in the exact order that would correspond to sequential executi
on; results are written (completion) in the same order
ØParallel execution policy 2
• Issue: Out-of-Order & Completion: Out-of-Order
• Out-of-order issue takes the set of decoded instructions the processor looks
ahead and issues any instruction, in any order, as long as the program
execution is correct

Peking University
Parallel Execution Policy2
Out-of-Order Issue with Out-of-Order Completion
I1: ADDF $R12, $R13, $R14 Consideration
I2: ADD $R1, $R8, $R9 :s Two cycle to execute
I3: MUL $R4, $R2, $R3 : I2, I5, I6 same functional unit
I4: MUL $R5, $R6, $R7 : I3, I4 same functional unit
Out-of-Order Similar to Issue,
I5: ADD $R10, $R5, $R7
Issue does not it does not
I6: ADD $R11, $R2, $R3 : True data dependency
need to wait Need to need to wait
until I1 is remove true until I1 is
Cycle Decode/Issue
executed Execute
dependency Writeback/Complete
completed
1 I1 I2 issue!
2 I3 I4 I1 I2
3 I5 I6 I1 I3 I2
4 I6
I5 I4 I1 I3
5 I5 I4 I6
6 I5
7
8
ADDF ADD MUL
unit unit unit Peking University
Challenges and Considerations of Superscaler
ØNo free lunch! There are dependencies!
ØMust check dependencies for all instructions, which are
ØSimultaneously decoded
ØIn-progress in the pipeline (e.g., previously issued)
Dependences
(constraints)

Data Control Resource


dependences dependences dependences
AriseNaïve
fromsolution
precedence Arise from Arise from
(Stalls at decode by comparing)
requirements concerning Stall everything saves these limited
conditional as well, but
1) Rsreferenced data
vs. outstanding targets do we have something
statements better
resources
2) Rt(RAW,
vs. outstanding
WAR, andtargets + sources
WAW) dependencies (this would introduce
3) No hardware resources
longer latency)?
Peking University
A Set of Better Solutions for Superscalar
ØTo address the dependency issues w/ in-flight instructions
• Control dependency
• Loop unrolling of static scheduling
• Branch prediction (learned)
• Dependencies
• Scoreboard
• Static scheduling
• Hardware register renaming of Tomasulo
• Regulation of instruction ordering
• Reservation station
• Reorder buffer/ROB
• Exception handling

Peking University
Superscalar Architecture
ØPutting all together; More detailed Superscalar Architecture

Instr. buffer Floating


Point
unit

(queues, reservation
Decode & Rename&

Instruction issuing
Instr. Window
stations, etc.)
Addr. Calc. &
Branch pred.

Dispatch

Memory
Integer
Fetch &
Instr$

unit

Integer
unit

Register
Files Commit

Peking University
Ancient Superscalar Architecture
Ø PowerPC 6XX
• Six independent execution units:
• Branch execution unit
• Load/store unit
• Three integer units
• Floating point unit
• Out-of-order issue
ØPentium I/II
• P-I: Three independent units
• P-II: out-of-order, five instructions can be issued in a cycle

Peking University
VLIW
Static multiple-issue processors (de
cision making at compile time by t
he compiler)

Intel Itanium series for the IA-64 ISA


– EPIC (Explicit Parallel Instruction Computer)

Peking University
VLIW: Very Long Instruction Word
ØKey Idea: Replace a traditional sequential ISA with a new ISA that enables
the compiler to encode instruction-level parallelism (ILP) directly in the
hardware/software interface
VLIW Compiler VLIW Processor

C Program
VLIW
for(i=0;i<n;i++) ISA
dest[i]=
src[i]*coeff;
Find
independent Schedule
operations operatio Direct execution
ns
ØSub-instructions within a long instruction must be independent
ØMultiple “sub-instructions” can be packed into one long instruction
ØEach “slot” in a VLIW instruction for a specific functional unit

Peking University
VLIW Hardware (TinyRV1 VLIW Processor)
Y-pipe X-pipe L-pipe S-pipe
VLIW lw x6, sw x6,
Instruction mul x1,x2,x3 add x4, x1, x5 0(x7) 0(x8)

mul add lw sw
Y0 Y1 Y2 Y3

X0
F D W
4 4 4
L0 L1

S0 S1

Ø TinyRV1 VLIW ISA


• 4-cycle Y-pipe , 1-cycle X-pipe, 2-cycle L-pipe , 2-cycle S-pipe
• No hazard checking, assume ISA enables setting bypass muxing
Peking University
VLIW Software (Compilation Techniques)
Ø Key Questions:
• How do we find independent instructions to fetch/execute?
• How to enable more compiler optimizations?

Ø Key Ideas:
• Get rid of control flow
• Predicated execution, loop unrolling
• Optimize frequently executed code-paths
• Trace scheduling
• Others: Software pipelining

Peking University
Compile Technique1: Loop Unrolling
ØKey idea: Unroll loop to perform M iterations at once
ØLimitations: Code growth, does not handle inter-iteration
Optimized C code
VLIW Compiler
For (i=0; i<N; i+=4)
Original C code { B[i] = A[i]+C;
Loop
For (i=0; i<N; i++) B[i+1]=A[i+1]+C;
Unrollin B[i+2]=A[i+2]+C;
B[i] = A[i] + C; g… B[i+3]=A[i+3]+C;
}

Expose more Instruction-level Parallelism!

Peking University
Compile Technique2: Predicated Execution
ØKey idea: Eliminate hard-to-predict branches by converting
control dependence to data dependence
(normal branch code) (predicated code)

Original C code A
T N A
if (cond) { B
b = 0; C B
C
}
else { D D
b = 1; p1 = (cond)
} A branch p1, TARGET A p1 =
(cond)
mov b, 1
B jmp JOIN B (!p1) mov b,1

TARGET:
C mov b, 0 C (p1) mov b,0

Peking University
Compile Technique2: Predicated Execution
ØKey idea: Eliminate hard-to-predict branches by converting
control dependence to data dependence
ØLimitations: Reduces perf. if misprediction cost < benefit
Predicated Execution
A
Fetch Decode Rename Schedule RegisterRead Execute

EFD
A
B
C FB
ED
C
A EFB
C
D A AEF
D
B
C CD
A
BEF BD
AC
EF FEA
C
D
B ED
B
C
FA C
D
A
EB B
C
A
D A
B
C B
A A
C B
nop

D
Branch Prediction
Fetch Decode Rename Schedule RegisterRead Execute
E
F E D B A
F
Pipeline flush!!

Credits: MICRO’06 Diverge-Merge Processor (DMP) UT-Austin


Peking University
Advantages & Disadvantages of VLIW
Ø Advantages
• Simpler hardware (potentially less power hungry)
• Potentially more scalable
• Allow more instr’s per VLIW bundle and add more FUs
Ø Disadvantages
• Programmer/compiler complexity and longer compilation times
• Deep pipelines and long latencies can be confusing (making peak performance elusive)
• Lock step operation, i.e., on hazard all future issues stall until hazard is resolved
(hence need for predication)
• Object (binary) code incompatibility
• Needs lots of program memory bandwidth
• Code bloat
• Noops are a waste of program memory space
• Loop unrolling to expose more ILP uses more program memory space

Peking University
CISC vs RISC vs SuperScalar vs VLIW
CISC RISC Superscalar VLIW
variable size fixed size fixed size fixed(but
fixed size sizelarge)
Instr size variable size fixed size fixed size
(but large)
variable format fixed format fixed format fixed format
Instr format variable format fixed format fixed format fixed format

few, some special many GP GPGP


andand rename
rename many, many GP
Registers few, some special many GP many, many GP
(RUU) (RUU)
embedded in many
Memory reference load/store load/store load/store
instr’s
decode complexity data forwarding, hardware dependency (compiler) code
hardware
decode data forwarding, (compiler) code
Key Issues hazards resolution
dependency scheduling
complexity hazards scheduling
resolution
IF ID EX M WB IF ID EX M WB
IF ID EX M WB EX M WB
IF ID EX M WB
Instruction flow IF ID EX M WB
IF ID EX M WB IF ID EX M WB
IF ID EX M WB IF ID EX M WB EX M WB

Peking University
Static Schedulin
g

Peking University
Recall: Data-Dependence Stalls
ØPreviously, we tried to reduce the program execution time with
ØBut, there are limits due to data-dependency
Single-issue pipeline Multiple-issue; Superscalar
(Lecture 3) (Lecture 4)

Load-Use Limits on WAW


(RAW)
What else? What else?
ØWhen no bypassing exists ØInstructions executing in a same cycle
ØLong-latency instructions shouldn't have RAW/WAR
Peking University
Solution: Instruction Scheduling
Dynamic
Static Scheduling
Compile-time Scheduling
Unscheduled
program
Static
Scheduler Conservative High-quality
schedule schedule
Dynamic
Scheduler
Run-time

Functional Functional
Units Low-complexity Units High-complexity
implementation implementation
Peking University
Static Scheduling
ØStatically schedule inst. from the compiler angle! (for data-dep.)
Dynamic
Static Scheduling
Compile-time Scheduling
Unscheduled
program
Static
Scheduler
Dynamic
Scheduler
Run-time

Functional Functional
Units Units
Peking University
What is Compiler?
ØCompiler translates a program written in a high-level language
into an equivalent program in a target language

ChaseLab.c Compiler ChaseLab.exe


void main() 11011001
{ 01000100
00010000
printf(“Yay!”); 10101011
}
Source file Machine code

Peking University
Performance Impacts of Compiler
ØCompiler optimizations may improve performance significantly

Performance can be Adding OpenMP


10x
improved 10x by advanced Compiler
compiler optimization! directives,
minor
source
O3 – Enable
advanced 4x
changes
Original optimization
source, s
Legacy default 2.5x
system options
(8
1x
CPUs) 1.2x
Compiler Compiler Compiler Compiler

Source: Intel Corp. 2006 (Itanium® 2 4 CPU)


Peking University
What is Compiler Optimization?
ØThe most common requirement is to minimize the time taken to execute
a program with the restriction that the result must be correct
Conventional
Execution Time = IC * CPI
optimization * CCT
techniques IC: Instruction Count
CPI: Cycles per Instruction
Would be
Optimization CCT: Clock Cycle Time
improved with
techniques for
ILP compiler
optimizations

Peking University
Compiler Optimization = Graph Problem!
ØThe input of optimization process is control flow graph (CFG)
Ø A directed graph where
Ø Each node represents a statement
Ø Edges represent control flow
x:= a+b

x:= a+b
y:= a*b y:=a*b
x:= a+b;
y:= a*b; CFG Generation CFG variations
while (y>a) { y>a
a:=a+1; y>a
x:=a+b; With basic
} blocks a:=a+1
a:=a+1 x:=a+b
Basic blocks: a sequence of
Loop instructions w/ no branches
x:=a+b
into or out of the block

Peking University
Simple Loop Example
Simple loop:
for(i=1; i<=1000; i++)
1. Loop: LD F0, 0(R1)
x[i]=x[i] + s; 2. Stall
Compilation 3. ADDD F4,F0,F4
w/ a vanilla compiler 4. Stall
Loop:L.D F0, 0(R1) ;F0=array el. 5. Stall
ADD.D F4,F0,F4 ;add scalar in F2
6. SD F4,0(R1)
S.D F4 0(R1) ;store result
SUBI R1,R1,#8 ;decrement pointer 7. SUBI R1,R1,#8
BNEZ R1,R2,Loop;branch Execute in
8. Stall
machine
Our machine specification: 9. BNEZ R1,R2,Loop
Instruction Instruction Delay in 10. Stall
producing result using the result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2 10 clocks per iteration
(5 stalls)
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
=> Can we rewrite the
code to minimize stalls?
FP ALU op Branch 1
Branch 1

Peking University
Scheduled Loop Body (with CFG)

Clock Cycle

Clock Cycle
LD F0,0(R1) LD F0,0(R1)

5 stalls  1 stalls!

ADDD F4,F0,F2 ADDD F4,F0,F2

Utilize
SD F4,0(R1) SD F4,0(R1)
8
Compiler delaye
SUBI R1,R1,#8
optimization
SUBI R1,R1,#8 d
branch
slot
BNEZ BNEZ
R1,R2,Loop R1,R2,Loop

Peking University
Goal of Multi-Issue Scheduling
ØPlace as many independent instructions in sequence
• “as many”  up to execution bandwidth
• Don’t need 7 independent instructions on a 3-wide machine
• Avoid pipeline stalls

ØIf compiler is really good, we should be able to get high performance on


an in-order superscalar processor
• In-order superscalar provides execution bandwidth, compiler provides dependence
scheduling

Peking University
Why this Should work?
ØCompiler has “all the time in the world” to analyze instructions
Ø Hardware must do it in < 1ns
ØCompiler can “see” a lot more
Ø Compiler can do complex inter-procedural analysis, understand high-level behavior of code
and programming language
Ø Hardware can only see a small number of instructions at a time
Static detection & resolution Dynamic detection & resolution
of dependencies (Compiler) of dependencies (HW)

Input code
Input code
Software window

window window
Execution Issue
Instruction
ILP issue unit
compiler of the
processor

Peking University
Why might this not work?
Ø Can’t always schedule around branches
• limited access to dynamic information (profile-based info)
• Perhaps none at all, or not representative
• Ex. Branch T in 1st ½ of program, NT in 2nd ½, looks like 50-50 branch in profile
Ø Not all stalls are predicable
• Cannot react to dynamic events like data cache misses

Although there are limits of static scheduling (done


by compiler), there is still a room to reap the
benefits.
Let’s check the detail techniques!

Peking University
Conventional Op
timization Techn
iques
Execution Time = IC * CPI * CCT
IC: Instruction Count
CPI: Cycles per Instruction
CCT: Clock Cycle Time We mainly focus on
Instruction Count!

Peking University
Technique1: Register Renaming
ØObservation1: weird register allocation
• Largely limited by architected registers
• Could possibly cause more spills/fills
ØObservation2: Dynamic Dead Code (branches)
• Code motion may be limited
R8
R1=R2+R3 Need to allocate
BEQZ R9 registers differently

Causes unnecessary
R8 R1=LOAD execution of LOAD
R5=R1-R4 when branch goes left
0[R6]

Peking University
Register Renaming & Scheduling
ØSame functionality, no stalls Schedule to remove stall

No need to same
A: R1 = R2 + R3 A: R1 = R2 + R3

No need to same
B: R4 = R1 – R5 B:
C’:R4
R8==R1 – R50[R7]
LOAD
C: R1 = LOAD 0[R7] Renaming C’: R8 == R1
B: R4 LOAD
– R50[R7]
D: R2 = R1 + R6 D’: R2 = R3
E’: R9 R8 + R5
R6
E: R6 = R3 + R5 E’:
D’: R9
R2 = R3
R8 + R5
R6
F: R5 = R6 – R4 F’: R5 = R9 – R4

CFG A A
vie Renaming Scheduling
A C’
w B C B C
B E’
D E D E
D’ F
F F

Peking University
Technique2: Loop Unrolling
ØTransforms an M-iterations loop into a loop with M/N iterations
ØWe say that the loop has been unrolled N times
ØSome compilers can do this (gcc –funroll-loops) or you can do it
manually (above)

for(i=0;i<100;i+=4){
a[i]*=2;
for(i=0;i<100;i++) a[i+1]*=2;
a[i]*=2; a[i+2]*=2;
a[i+3]*=2;
}

Peking University
Why Loop Unrolling? (1)
ØGet rid of small loops
a[0]*=2;
for(i=0;i<4;i++) a[1]*=2;
a[i]*=2; a[2]*=2;
a[3]*=2;

for(0)
for(1)
for(2)
for(3)
Difficult to schedule/hoist
insts from bottom block to Easier: no branches in the way
top block due to branches

Peking University
Why Loop Unrolling? (2)
ØLess loop overhead
ØAllow better scheduling of instructions 4 branches -> 1
L.D F0,0(R1)
ADD.D F0,F0,F2 L.D branches
F0,0(R1)
S.D F0,0(R1) ADD.D F0,F0,F2
DADDUI R1,R1,#-8 S.D F0,0(R1)
BNE R1, R2, Loop L.D F0,-8(R1)
ADD.D F0,F0,F2
S.D F0,-8(R1)
L.D F0,0(R1)
ADD.D F0,F0,F2 Unroll L.D F0,-16(R1)
ADD.D F0,F0,F2
S.D F0,0(R1)
S.D F0,-16(R1)
DADDUI R1,R1,#-8
L.D F0,-24(R1)
BNE R1, R2, Loop
ADD.D F0,F0,F2
S.D F0,-24(R1)
L.D F0,0(R1) DADDUI R1,R1,#-32
ADD.D F0,F0,F2
BNE R1, R2, Loop
S.D F0,0(R1)
DADDUI R1,R1,#-8
BNE R1, R2, Loop

L.D F0,0(R1)
ADD.D F0,F0,F2
S.D F0,0(R1)
DADDUI R1,R1,#-8
BNE R1, R2, Loop Peking University
Loop Unrolling: Problems
ØProgram size becomes larger
(code bloat)
Q1. What if N is not a multiple of M?
Q2. Or What if N is unknown at compiler time?
Q3. Or What if it is a while loop? j1=j-j%4;
for(i=0;i<j1;i+=4)
{
for(i=0;i<j;i++) a[i]*=2;
Unroll until
a[i]*=2; a[i+1]*=2; value`i` is
a[i+2]*=2; multiple of 4
a[i+3]*=2;
}
Remained for for(i=j1;i<j;i++)
another for loop a[i]*=2;
Peking University
Technique3: Function Inlining
ØGoal: sort of like “unrolling” a function
ØProblems: primarily code bloat

• Remove function call


overhead
Normal Function Inline Function
• CALL/RETN (and possible
Start
branch misprds) Start
• Argument/ret-val passing,
call myfunc()
stack allocation, and Main function body
associated spills/fills
{ of {
myfunc();
caller/calle-save regs myfunc()
// body
• Larger block of} }
instructions for
return
scheduling
Stop Stop

Peking University
Technique4: Tree Height Reduction
ØGoal: shorten critical path(s) using associativity law
ØLimitations: not all math operations are associative!
ØC defines L-to-R semantics for most arithmetic

Associativity
R8=((R2+R3)+R4)+R5 R8=(R2+R3)+(R4+R5)

I1
I1 I2
I1:ADD R6,R2,R3 ADD R6,R2,R3
I2: ADD R7,R6,R4 I2 ADD R7,R4,R5
I3: ADD R8,R7,R5 ADD R8,R7,R6
I3 I3

Peking University
Optimization Tech
niques for ILP
Execution Time = IC * CPI * CCT
IC: Instruction Count
CPI: Cycles per Instruction
CCT: Clock Cycle Time We mainly focus on
Cycles/Instruction!

Peking University
Scheduling for Inst-Level Parallelism
ØGoal: Schedule instructions to finish whole program as soon as possible
ØTwo types of static ILP scheduling

Static ILP Scheduling

Global
Software Pipelining
Scheduling
Target: DAG (Directed Target: loop in any
Acyclic Graph) of a code
general purpose integer
program with many
conditional branches

Peking University
Technique1: Global Scheduling
Q. What is the general size of basic block?
A. In general, the basic block size of non-numeric
computation program is 5~20 instructions [REMIND] Basic blocks: a
è No good enough # of instructions which can be processedsequence
in parallel
of instructions w/ no
branches into or out of the block
Q. Why global scheduling? x:= a+b
A. As the basic block size is too small to find out y:=a*b

independent instructions, let’s schedule the


instructions that exist across all other basic y>a
blocks (e.g., in global)
Global Scheduling a:=a+1
Will be covered x:=a+b
In this lecture
Trace-based Scheduling DAG-based schdeuling

Peking University
Trace-based Scheduling
Ø This is one technique for global scheduling
• Works on all code, not just loops
• Take an execution trace of the common case
• Schedule code as if it had no branches
• Check branch condition when convenient
• If mispredicted, clean up the mess

Q. How do we find the “common


case”
A. Program analysis or profiling

Peking University
Example of Trace Scheduling

a=log(x); a=log(x);
if(b>0.01) c=a/b; 90%
{ y=sin(c);
90% c=a/b; if(b<=0.01) 10%
}else{ goto fixit;
10% c=0; fixit:
} c=0;
y=sin(c); y=0; // sin(0)

Suppose profile says Now, we have a larger basic


that b>0.01 block for the trace
90% of the time scheduling & optimizations

Peking University
Pay Attention to Cost of Fixing
Ø[REMIND] Amdahl’s law
1 = System performance
� P: Fraction of enhanced component
1−� +
� S: Speedup of enhanced component
• Assume the code for b > 0.01 • But, fix-up code may cause the
accounts for 80% of the time remaining 20% of the time to be
• Optimized trace runs 15% even slower!
faster • Assume fixup code is 30% slower
1 1
= 1.117 = 1.046
�. � �. �
1 − �. � + 11.7% 1 − �. � ∗ �. � + 4.6%
�. �� �. ��

Around 2/3 of the benefit removed


Peking University
Technique2: Software Pipelining
ØDesign: overlapping different iterations by starting the next iteration before t
he current iteration in the loop ends
Ø Consider a graphical view of the overlay of iterations:
interation1
interation2
Prolog
interation3
interation4
Inst1 Inst2 Inst3 Inst4 interation5
Kernel Inst1 Inst2 Inst3 Inst4 interation6
Inst1 Inst2 Inst3 Inst4

Epilog

Ø Only the shaded part, the loop kernel, involves executing the full width of the VLIW instruction.
– The loop prolog and epilog contain only a subset of the instructions.
• “ramp up” and “ramp down” of the parallelism.

Peking University
Scheduling Loop Unrolled Code
Unroll 4 ways
loop: ld f1, 0(r1)
Int1 Int 2 M1 M2 FP+ FPx
ld f2, 8(r1)
loop: ld f1 Assumption:
ld f3, 16(r1)
ld f2
ld f4, 24(r1) ld: 2 cycle
ld f3
add r1, 32
add r1 ld f4 fadd f5
fadd f5, f0, f1
fadd f6, f0, f2
Schedule fadd f6
fadd f7
fadd f7, f0, f3
fadd f8
fadd f8, f0, f4 sd f5
sd f5, 0(r2) sd f6 Assumption:
sd f6, 8(r2) sd f7 fadd: 3 cycle
sd f7, 16(r2) add r2 bne sd f8
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop

Peking University
Software Pipelining Int1 Int 2 M1 M2 FP+ FPx
Unroll 4 ways first
loop: ld f1
loop: ld f1, 0(r1) ld f2
ld f2, 8(r1) ld f3
ld f3, 16(r1) add r1 ld f4
ld f4, 24(r1) ld f1 fadd f5
add r1, 32 ld f2 fadd f6
fadd f5, f0, f1 Schedule ld f3 fadd f7
fadd f6, f0, f2 add r1 ld f4 fadd f8
fadd f7, f0, f3 ld f1 sd f5 fadd f5
fadd f8, f0, f4 ld f2 sd f6 fadd f6
sd f5, 0(r2) add r2 ld f3 sd f7 fadd f7
sd f6, 8(r2) bne ld f4 sd f8 fadd f8
sd f7, 16(r2) sd f5 fadd f5
sd f8, 24(r2) sd f6 fadd f6
add r2, 32 add r2 sd f7 fadd f7
bne r1, r3, loop bne sd f8 fadd f8
sd f5

Peking University
Loop Unrolling vs. Software Pipelining
Startup Wind-down
overhead overhead
performance

Loop Unrolling

time
Loop Iteration

performance
Pipelining
Software

Loop Iteration
time
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration

Peking University
Recall: Why Compiler Might not Work
ØCan’t always schedule around branches
• limited access to dynamic information (profile-based info)
• Perhaps none at all, or not representative
• Ex. Branch T in 1st ½ of program, NT in 2nd ½, looks like 50-50 branch in profile

ØNot all stalls are predicable


• Cannot react to dynamic events like data cache misses

Peking University
Dynamic Scheduling m
eans Out-of-Order exe
cution (OoO; O3)
Let’s think about the following que
stions

Q1) How can O3 achieve performance benefits?


Q2) Is everything okay? Any problems of O3?
Q3) How does O3 work?

Peking University
Q1) How can O3 achieve perf. benefits?
ØHardware rearranges the instruction stream to reduce stalls

Instruction streams 1-wide 2-wide 1-wide 2-wide


In-Order In-Order Out-of-Order Out-of-Order
A: R1 = Load 16[R2]
B: R3 = R1 + R4 A A A A C
C: R6 = Load 8[R9] C

cache miss
D F

cache miss

cache miss
cache miss
D: R5 = R2 – 4
D E G
E: R7 = Load 20[R5] Execute
F: R4 = R4 – 1 E
G: BEQ R4, #0 B B C B B

5 cycles
C D F
Dependency graph
D E F G
A C D F E
G 7 cycles
F
8 cycles
G
B E G
10 cycles

Peking University
Q2) Any problems of O3?
ØRecall: Hazards! Especially for register dependencies
True dependency Anti-dependency Output dependency
Read-After-Write Write-After-Read Write-After-Write
A: R1 = R2 + R3 A: R1 = R3 / R4 A: R1 = R2 + R3
B: R4 = R1 * R4 B: R3 = R2 * R4 B: R1 = R3 * R4
A B
R1 5 7 7 R1 5 A 3 B 3 R1 5 A 7 B 27
In-Order

R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
R3 9 9 9 R3 9 9 -6 R3 9 9 9
R4 3 3 21 R4 3 3 3 R4 3 3 3
B A
R1 5 5 7 R1 5 B 5 A -2 R1 5 B 27 A 7
R2 -2 -2 -2 R2 -2 -2 -2 R2 -2 -2 -2
OoO

R3 9 9 9 R3 9 -6 -6 R3 9 9 9
R4 3 15 15 R4 3 3 3 R4 3 3 3
Read Read Will be
Cycle
old data future data overwritten
by legacy data
Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data

Q) How does the register renaming work?


A) Allocate new location by noting in
map-table

MapTable FreeList Original insns. Renamed insns.


r1 r2 r3 p1,p2,p3,
Cycle

p1 p2 p3 p4,p5,p6,p7 add r1,r2,r3 add p1,p2,p3


sub r3,r2,r1 sub p2,p4,p5
mul r3,r2,r3 mul p2,p5,p6
div r1,r1,4 div p4,4,p7

Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data

Q) How does the register renaming work?


A) Allocate new location by noting in
map-table
Register
MapTable FreeList
renaming Original insns. Renamed insns.
Dependency
r1 r2 r3 p1,p2,p3, WAR is removed!
Cycle

p1 p2 p3 p4,p5,p6,p7 add r1,r2,r3 add p1,p2,p3


p1 p2 p4 p5,p6,p7 sub r3,r2,r1 sub p4,p2,p1
p1 p2 p5 p6,p7 mul r3,r2,r3 mul p5,p2,p4
p6 p2 p5 p7 div r1,r1,4 div p6,p1,4

Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data

Q) How does the register renaming work?


A) Allocate new location by noting in
map-table

MapTable FreeList
Register Original insns. Renamed insns.
r1 r2 r3 p1,p2,p3, Dependency
Cycle

renaming
p1 p2 p3 p4,p5,p6,p7 add WAW
r1,r2,r3 isadd p1,p2,p3
removed!
p1 p2 p4 p5,p6,p7 sub r3,r2,r1 sub p4,p2,p1
p1 p2 p5 p6,p7 mul r3,r2,r3 mul p5,p2,p4
p6 p2 p5 p7 div r1,r1,4 div p6,p1,4

Peking University
Solution: Register Renaming
ØSolution for WAR, WAW hazards; done in hardware
Ø Dependency is on name/location, not on data
Ø Renaming removes WAR/WAW, but leaves RAW intact

Q) How does the register renaming work?


A) Allocate new location by noting it in
map-table

MapTable FreeList Original insns. Renamed insns.


Dependency
r1 r2 r3 Register
p1,p2,p3, is removed!
Cycle

p1 p2 p3 p4,p5,p6,p7 add WAW


r1,r2,r3 add p1,p2,p3
renaming
p1 p2 p4 p5,p6,p7 sub r3,r2,r1 sub p4,p2,p1
p1 p2 p5 p6,p7 mul r3,r2,r3 mul p5,p2,p4
p6 p2 p5 p7 div r1,r1,4 WAR div p6,p1,4

Peking University
Q3) How does the O3 work?
ØStep1: Fetch many instructions into an instruction window

Dynamic
Instruction
Stream
Static
Program

Fetch

Today’s machines: 100+ instructions scheduling window


Use branch prediction to speculate past branches

Peking University
Q3) How does the O3 work?
ØStep2: Rename regs. to avoid false deps. (WAW and WAR)

Dynamic Renamed
Instruction Instruction
Stream Stream
Static
Program

Rename
Fetch

Peking University
Q3) How does the O3 work?
ØStep3: Execute instructions as soon as dependencies (registers and
memory) are known
Dynamic Renamed Dynamically
Instruction Instruction Scheduled
Stream Stream Instructions
Static
Program

Schedule
Rename
Fetch

Out-of-order =
out of the original
sequential order

Peking University
Dynamic Scheduling I: Scorebo
ard
ØLet’s track the flow of the instrs, register,
and function units
Ø to check which datapath components are usi
ng / can be used
Ø to find out which instruction could be execute
d without hazards

Peking University
The CDC 6600 Projec
t [‘1964]
First implementation of Scoreboard
• 16 separate non-pipelined functi
onal units (7 int, 4FP, 5 memory)
• No register by passing

Peking University
Dynamic Scheduling: The Big Picture
ØInstructions fetch/decoded/renamed into Instruction Buffer
ØInstructions (conceptually) check ready bits every cycle
Dependency graph
I1
add r4,r2,r3 Inst1:add p4,p2,p3
sub r3,r2,r1 Inst2:sub p5,p2,p4 I2 I4
mul r3,r2,r3 Inst3:mul p6,p2,p5
div r1,4,r1 Inst4:div p7,4,p4 I3 regfile

I$ insn buffer D$
B
P

Time Ready Table Issue order


P2 P3 P4 P5 P6 P7
Yes Yes
Yes Yes Yes I1
Yes Yes Yes Yes Yes
Yes Yes Yes Yes Yes Yes
I2 I4
I3
Peking University
Stage Splitting w/ Instruction Buffer
• The key idea of Scoreboard

Fetch: Accumulate insn buffer


decoded instrs in
Issue&Read Operands:
buffer in-order regfile Buffer sends instrs down
I$ WBrest of pipeline out-of-
D$ order
B
P

Iftech Decode Execute Mem

• Conventional 5 pipeline stages


IftechDecodeExecuteMemoryWriteback
Ø Let’s split Decode stage into two sub stages
Issue (S) & Read Operand (R)

Peking University
Scoreboard Pipeline
A new kind of structural hazard
: Instruction buffer is full

regfile

I$ insn buffer
D$
B
P

Dispatch Read Excute


Issue(S) WB
(D) Op(R) (Ex)

Allocate a slot
in an Solve Solve WAR!
Wait until no
instruction structural
data hazards Multiple Stall until the
buffer and hazard earlier
Functional
dispatch an &WAW! instruction reads
Read units
instruction in- Check FU and the destination
operands register
order destination reg

Peking University
Scoreboard Architecture (CDC 6600)
F0 Integer (1)

F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int F0
LD#1

Scoreboard
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard’s Stage #1: Issue
F0 Integer (1)

F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status At stage #1, Scoreboard (SB) issues


FU Status Reg Status
Inst I R X W
LD#1
instructions
FU
Int
in
B Op dst src1 the program
src2 Q1 Q2 R1 order
R2
F0
(and
FU

checks hazards)
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #1: Issue (ID1)
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

But, SB does not issue if
Insn Status FU Status Reg Status
Inst I R X W FU B Op
there is a structural
dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int
hazard! F0
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #2: Read Operands (ID2)
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

At stage
Insn Status #2, SB reads operands
Inst I R X W FU
FU Status
B Op dst src1 src2 Q1 Q2 R1 R2
Reg Status
FU
LD#1when they are available (i.e.,
Int F0

Scoreboard
LD#2 Mult F2
MULTD
SUBD
no data
1
Mult
hazard) F4
F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #3: Execution (EX)
F0 Integer (1)

F2
Mult1 (10) Let’sInst
assume
$
each
F4
functional
LD F2 45+ R3 unit takes
Registers

Mult2 (10)
(m)subCPU
F8 F6 cycles where
F6 mul F0 F2 F4
F2

m is specified in the
F8 Add (2) div F10 F0 F6
add F6 F8 F2
figure
F10
Divide (40)

Insn Status FU Status Reg Status


Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int F0

Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #3: Execution (EX)
1
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 This
NOTE) F6 F2
is
F8 Add (2) div F10 F0 F6
remaining time
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int F0

Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #3: Execution (EX)
1
0
F0 Integer (1) LD F6 34+ R2 At stage #3, FU begins
F2 execution
Inst $ upon
Mult1 (10)
F4 receiving
LD F2 45+ R3operands.
Registers

Mult2 (10)
F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int F0

Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #3: Execution (EX)
0
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 At stage #3, when the result
mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) is ready, FU informs the SB
div F10 F0 F6
F10
Divide (40)
that it has completed the
add F6 F8 F2

… execution
Insn Status FU Status Reg Status
Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 Int F0

Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #4: Write Back (WB)
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

AtInsnstage
Status #4, FU stalls until there is no
FU Status Reg Status
Inst I R X W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
LD#1 WAR hazard
Int with any instructions F0

Scoreboard
previously issued
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Stage #4: Write Back (WB)
F0 Integer (1) LD F6 34+ R2

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

If there is no WAR hazard,
Insn Status FU Status
then Reg Status

LD#1
FU will update
Inst I R X W FU
Int
B the register
Op dst src1 src2 with
Q1 Q2 R1 R2 FU
F0
the output of instruction
Scoreboard
LD#2 Mult F2
MULTD 1 F4
SUBD Mult F6
DIVD 2 F8

Scoreboard
ADDD Add F10
Div

Peking University
Three Parts of Scoreboard
F0 Integer (1)

F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
add F6 F8 F2
F10 Again, the
Dividekey
(40) behind the runtime out-of-order

executions of Scoreboard is a bookkeeping technique
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int F0
LD#1

Scoreboard
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Three Parts of Scoreboard
ØThree main components
Integer (1)
F0Ø Instruction status
F2Ø Functional unit status Inst $
Mult1 (10) LD F6 34+ R2
F4Ø Register result status
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1

Scoreboard
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Part #1: Instruction Status
• Which of
F0
4 steps theInteger
instruction
(1) is in
• ID1: Issue
F2 Inst $
• ID2: Read
F4
operandsMult1 (10) LD F6 34+ R2
Registers

Mult2 (10) LD F2 45+ R3


• EX: Execute stage completion mul F0 F2 F4
F6
• WB: Write result toAdd
registers
(2)
sub F8 F6 F2
F8 div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Part #2: Functional Unit Status
• Indicates the state of the functional
Integer (1) unit (FU)
F0
• 9 fields for
F2
each FU Inst $
• B: Indicates whether the(10)
Mult1 unit is busy or not LD F6 34+ R2
F4
Registers
• Op: Operation to perform in the unit (e.g., + or -)
Mult2 (10) LD F2 45+ R3
• dst: F6Destination register mul F0 F2 F4
sub F8 F6 F2
• src1,src2: Source-registerAdd
numbers
(2)
F8 div F10 F0 F6
• Q1, Q2: Functional units producing source registers src1, src2
F10 add F6 F8 F2
• R1, R2: Flags being set Divide
when src1/src2
(40) is ready

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Part #3: Register Result Status
• Indicates F0
which functional unit will write each register, if
Integer (1)
one exits.F2 Inst $
Mult1 (10) LD F6 34+ R2
• Blank whenF4 no pending instructions will write that
Registers

Mult2 (10) LD F2 45+ R3


register F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Our Example: “Simple Scoreboard”
F0 Integer (1)

F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 1
F0 Integer (1)
Note that, as this is a pipelined
F2 Inst $
Mult1 (10) architecture, multiple
LD F6 34+ instructions
R2
F4
can be handledLD at a same cycle.
Registers

Mult2 (10) F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int F0
LD#1
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 1
F0 Integer (1)

F2 Inst $
Let’s
Mult1 start from the
(10) LD F6 34+ R2
F4
first
(10) instruction
Registers

Mult2 LD F2 45+ R3
mul F0 F2 F4
F6 “Load” sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int F0
LD#1
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) hazard of LD#1
F2 Inst $
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “Load”
is empty

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10) LD F6 34+ R2
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40) Shows in which cycle the

operation occurred.

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
1 2 B Int F0
LD#1
Mult F2
LD#2
1 F4
MULTD Mult F6
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) LD F6 34+ R2 hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40) Shows in which FU the

instruction is activated

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
1 2 B Int No LD
Yes F6 R2 yes F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 1
Check struct
F0 Integer (1) LD F6 34+ R2 hazard of LD#1
F2 Inst $ Issue LD #1
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40) Shows in which reg. the

FU result will be updated

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
1 2 B Int Yes LD F6 R2 yes F0
LD#1
Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 2
There’s no Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
RAW hazard
F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F6 R2 yes F0
LD#1 1
Mult F2
LD#2
1 F4
MULTD Mult F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 2
1 Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
22 B Int Yes LD F6 R2 yes F0
LD#1 1
Mult F2
LD#2
1 F4
MULTD Mult F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 2
Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4 How about
sub F8 F6 F2 issuing 2nd LD?
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F6 R2 yes F0
LD#1 1 2
Could we utilize
2 Mult F2
LD#2
1 F4
MULTD
SUBD pipelined system?
Mult
2
F6
F8
Int
DIVD
Scoreboard
Add F10
ADDD Div

Peking University
Scoreboard Example: Cycle 2
1 Check hazard of
F0 Integer (1) LD F6 34+ R2 LD#1’s operands
Struct
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4 How about
sub F8 F6 F2 issuing 2nd LD?
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “Load” is busy
(Structural hazard)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F6 R2 yes F0
LD#1 1 2
2 Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 3
0
1 Execute LD#1
F0 Integer (1) LD F6 34+ R2
Struct
F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F6 R2 yes F0
LD#1 1 2
Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 3
0 Execute LD#1
F0 Integer (1) LD F6 34+ R2
Struct LD#1 comp.
F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 3 B Int Yes LD F6 R2 yes F0
LD#1 1 2
Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1) LD F6 34+ R2
Struct
F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B4 Int Yes LD F6 R2 yes F0
LD#1 1 2 3
Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1)
Struct Free FU & Reg
F2 Inst $
Mult1 (10) status of LD#1
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B4 Int No LD
Yes F6 R2 yes F0
LD#1 1 2 3
Mult No F2
LD#2
1 F4
MULTD Mult No F6 Int
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 4
Writeback LD#1
F0 Integer (1)
Struct Free FU & Reg
F2 Inst $
Mult1 (10) status of LD#1
F4
Registers

Mult2 (10) LD F2 45+ R3


Integer unit free
F6 LD#1 mul F0 F2 F4
 No Hazard!
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B4 Int No F0
LD#1 1 2 3
Mult No F2
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 5
Issue LD #2
F0 Integer (1)

F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10) LD F2 45+ R3


F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No LD
Yes F2 R3 Yes F0
LD#1 1 2 3 4
5 Mult No F2 Int
LD#2
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 6
There’s no Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
RAW hazard
F2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0
LD#1 1 2 3 4
Mult No F2 Int
LD#2 5
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10)
F6 LD#1 mul F0 F2 F4
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0
LD#1 1 2 3 4
6 Mult No F2 Int
LD#2 5
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10)
F6 LD#1 mul F0 F2 F4 Check struct
sub F8 F6 F2 hazard of MULTD
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “MULTD”
is empty

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0
LD#1 1 2 3 4
6 Mult No F2 Int
LD#2 5
1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 6
1 Check hazard of
F0 Integer (1) LD F2 45+ R3 LD#1’s operands
F2 Inst $ Read operands
Mult1 (10)
F4 of LD #1
Registers

Mult2 (10)
F6 LD#1 mul F0 F2 F4 Check struct
sub F8 F6 F2 hazard of MULTD
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2 Issue MULTD
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes
No MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6
6 1 F4
MULTD Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
10 Execute LD#2
F0 Integer (1) LD F2 45+ R3

F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6
1 F4
MULTD 6 Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
7 Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6
1 F4
MULTD 6 Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 How about
F4
Registers

Mult2 (10) reading operands


F6 LD#1 of MULTD?
sub F8 F6 F2
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1
CouldNowe utilize
7 F4
MULTD 6 Mult F6
SUBD
DIVD pipelined
2
system? F8

Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
RAW LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT can’t read
How about
F4
Registers

Mult2 (10) its operands


reading (F2)
operands
F6 LD#1 because
of MULTD? LD #2
sub F8 F6 F2 hasn’t finished.
F8 Add (2) div F10 F0 F6
F10 add F6 F8 F2
Divide (40) NOTE) There is
… RAW due to LD#2

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
7 1 F4
MULTD 6 Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers

Mult2 (10) its operands


MULTD (F2)
operands
F6 LD#1 because LD #2
sub F8 F6 F2 hasn’t finished.
F8 Add (2) div F10 F0 F6
add F6 F8 F2 Check struct hazard
F10
Divide (40) FU of “SUB” of SUB
… is empty

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 2 F8
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers

Mult2 (10) its operands


MULTD (F2)
operands
F6 LD#1 because LD #2
sub F8 F6 F2 hasn’t finished.
F8 Add (2) div F10 F0 F6
add F6 F8 F2 Check struct hazard
F10 of SUB
Divide (40)

Issue SUB
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No SUB F8 F6
7 Yes F6 F2 Int Yes No
SUBD 2 F8 Add
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 7
0 Execute LD#2
F0 Integer (1) LD F2 45+ R3
LD#2 comp.
F2 Inst $
Mult1 (10) mul F0 F2 F4 MULT
Check can’t read
hazard of
F4
Registers

Mult2 (10) its operands


MULTD (F2)
operands
F6 LD#1 RAW because LD #2
hasn’t finished.
F8 Add (2) sub F8 F6 F2
div F10 F0 F6
add F6 F8 F2 Check struct hazard
F10
Divide (40)NOTE) There is of SUB
… RAW due to LD#2
Issue SUB
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No SUB F8 F6
Yes F6 F2 Int Yes No
SUBD 7 2 F8 Add
DIVD
Scoreboard
Add No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
div F10 F0 F6
F10 add F6 F8 F2
Divide (40)
… FU of “DIVD”
is empty

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
DIVD
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10
ADDD Div No

Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4 Issue DIVD
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
div F10 F0 F6
F10 add F6 F8 F2
Divide (40)

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
8 Yes DIV F10 F0 F6 Mul1 No Yes
DIVD
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10 Div
ADDD Div No

Peking University
Scoreboard Example: Cycle 8a
First half of clock cycle
0
F0 Integer (1) LD F2 45+ R3 Check struct
hazard of DIVD
F2 Inst $
Mult1 (10) mul F0 F2 F4
F4 Issue DIVD
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2
RAW
F10 add F6 F8 F2
Divide (40) div F10 F0 F6 NOTE) There is
… RAW due to Mult1

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
Yes DIV F10 F0 F6 Mul1 No Yes
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10 Div
ADDD Div No

Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) LD F2 45+ R3 Writeback LD#2
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int Yes LD F2 R3 Yes F0 Mult1
LD#1 1 2 3 4
8 Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) Writeback LD#2
F2 LD#2 Inst $ Free FU & Reg
Mult1 (10) mul F0 F2 F4 status of LD#2
F4
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No LD F2
Yes R3 yes F0 Mult1
LD#1 1 2 3 4
8 Mult Yes MUL F0 F2 F4 Int No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 8b
Second half of clock cycle
F0 Integer (1) Writeback LD#2
RAW
F2 LD#2 Inst $ Free FU & Reg
Mult1 (10) mul F0 F2 F4 status of LD#2
F4
Registers

Mult2 (10)
F6 LD#1 RAW
Now, F2
register is ready
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No LD F2
Yes R3 yes F0 Mult1
LD#1 1 2 3 4
8 Mult Yes MUL F0 F2 F4 Int Yes
No Yes F2 Int
LD#2 5 6 7
1 F4
MULTD 6 Mult No F6
Yes
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Int Yes No F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6 There’s no

RAW hazard

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers

Mult2 (10)
F6 LD#1
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
9 1 F4
MULTD 6 Mult No F6
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers

Mult2 (10)
F6 LD#1 Check hazard of
Add (2) sub F8 F6 F2
SUB’s operands
F8
F10 add F6 F8 F2
Divide (40) div F10 F0 F6 There’s no

RAW hazard

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
Check hazard of
F0 Integer (1) MULTD operands
F2 LD#2 10
Inst $ Read operands of
Mult1 (10) mul F0 F2 F4
F4 MULTD
Registers

Mult2 (10)
F6 LD#1 Check hazard of
2 SUB’s operands
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2 Read operands of


Divide (40) div F10 F0 F6 MULTD

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
9
SUBD 7 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
F0 Integer (1)

F2 LD#2 10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
2
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
9
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
Could we utilize
9
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes
pipelined system?
Peking University
Scoreboard Example: Cycle 9
DIVD cannot be
F0 Integer (1)
issued ∵ RAW
F2 LD#2 Inst $ hazard
10
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
2
F8 Add (2) sub F8 F6 F2
RAW
F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
9
DIVD 8
Scoreboard
9 Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 9
DIVD cannot be
F0 Integer (1)
issued ∵ RAW
F2 LD#2 Inst $ hazard
10
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct
2
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6
… FU of “ADDD” is busy
(Structural hazard)
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
DIVD 8
Scoreboard
9 Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 10
Integer (1) Calculating….
F0
F2 LD#2 9
10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct
1
2
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 11
Integer (1) Calculating….
F0
F2 LD#2 98
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct
0
1
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 11
Integer (1) Calculating….
F0
SUB comp.
F2 LD#2 8
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct
0
F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
11
SUBD 7 9 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
F2 LD#2 8
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct

F8 Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 11 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 Struct

F8 SUBD Add (2) sub F8 F6 F2

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
12
SUBD 7 9 11 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4 Free FU & Reg
F4
Registers

Mult2 (10) status of SUBD


F6 LD#1 Struct

F8 SUBD Add (2)

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 11 12 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 12
Integer (1) Calculating….
F0
Inst $ Writeback SUBD
F2 LD#2 8
7
Mult1 (10) mul F0 F2 F4 Free FU & Reg
F4
Registers

Mult2 (10) status of SUBD


F6 LD#1 Struct
Integer unit free
F8 SUBD Add (2)
 No Hazard!
F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 11 12 2 F8 Add
DIVD 8
Scoreboard
Add Yes SUB F8 F6 F2 Yes Yes F10 Div
ADDD Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 13
Calculating….
F0 Integer (1)

F2 LD#2 86
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 SUBD Add (2)

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add No F10 Div
ADDD Div Yes DIV F10 F0
F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 13
Calculating….
F0 Integer (1)
Issue ADD
F2 LD#2 86
7
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 SUBD Add (2)

F10 add F6 F8 F2
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No ADD F6
Yes F8 F2 Yes Yes F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
13 Add No F10 Div
ADDD Div Yes DIV F10 F0
F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 14
Calculating….
F0 Integer (1)

F2 LD#2 85
6
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 14
Calculating….
F0 Integer (1)
Read operands
F2 LD#2 8
5
Inst $ of ADD
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
2
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
14 Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 15
Calculating….
F0 Integer (1)

F2 LD#2 84
5
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
12
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 16
Calculating….
F0 Integer (1)

F2 LD#2 83
4
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
01
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 16
Calculating….
F0 Integer (1)
ADD comp.
F2 LD#2 8
3
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
0
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
16 Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 17
Calculating….
F0 Integer (1)

F2 LD#2 32
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 17
Calculating….
F0 Integer (1)
Oops! ADD cant
F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
32 write because of
F4 DIVD. WAR
Registers

Mult2 (10)
hazard!
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 18
Calculating….
F0 Integer (1)

F2 LD#2 21
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 19
Calculating….
F0 Integer (1)

F2 LD#2 10
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 19
Calculating….
F0 Integer (1)
MULTD comp.
F2 LD#2 0
Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
19 1 F4
MULTD 6 9 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)

F2 LD#2 Inst $
Mult1 (10) mul F0 F2 F4
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B
Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
20 1 F4
MULTD 6 9 19 Mult No F6 Add
SUBD 7 9 1112 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)
Free FU & Reg
F2 LD#2 Inst $
Mult1 (10) status of MULTD
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes
No MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 20
Writeback MULTD
F0 MULT Integer (1)
Free FU & Reg
F2 LD#2 Inst $
Mult1 (10) status of MULTD
F4
Registers

Mult2 (10) Now, F0 register


F6 LD#1 WAR is ready
F8 SUBD Add (2) add F6 F8 F2
RAW
F10
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0 Mult1
LD#1 1 2 3 4
Mult Yes
No MUL F0 F2 F4 Yes Yes F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6 Add
SUBD 7 9 11 12 2 F8
Yes
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Mul1 No Yes

Peking University
Scoreboard Example: Cycle 21
Read operands
F0 MULT Integer (1)
of DIVD
F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10 40
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6 Add
SUBD 7 9 11 12 2 F8
21
DIVD 8
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Yes Yes

Peking University
Scoreboard Example: Cycle 21
Read operands
F0 MULT Integer (1)
of DIVD
F2 LD#2 Inst $ WAR hazard is
Mult1 (10)
F4 removed!
Registers

Mult2 (10)
F6 LD#1 WAR

F8 SUBD Add (2) add F6 F8 F2

F10 40
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8 21
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Yes Yes

Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)

F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
F8 SUBD Add (2) add F6 F8 F2

F10 40
39
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6 Add
SUBD 7 9 11 12 2 F8
DIVD 8 21
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Yes Yes

Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)
Writeback ADD
F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
ADD

F8 SUBD Add (2) add F6 F8 F2

F10 39
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 BInt No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20
Mult No F6 Add
SUBD 7 9 11 12
2 F8
DIVD 8 21
22 Add
Scoreboard
Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 Div Yes DIV F10 F0 F6 Yes Yes

Peking University
Scoreboard Example: Cycle 22
Calculating….
F0 MULT Integer (1)
Writeback ADD
F2 LD#2 Inst $
Mult1 (10) Free FU & Reg
F4
Registers

Mult2 (10) status of ADD


F6 LD#1
ADD

F8 SUBD Add (2)

F10 39
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 BInt No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20
Mult No F6 Add
SUBD 7 9 11 12
2 F8
DIVD 8 21
Scoreboard
Add Yes ADD F6 F8 F2 Yes Yes F10 Div
ADDD 13 14 16 22 Div Yes DIV F10 F0 F6 Yes Yes

Peking University
Faster than light computation
(skip a couple of cycles)

Peking University
Scoreboard Example: Cycle 61
Calculating….
F0 MULT Integer (1)

F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
MULT

F8 SUBD Add (2)

F10 1
0
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 BInt No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20
Mult No F6
SUBD 7 9 11 12
2 F8
DIVD 8 21
Scoreboard
Add No F10 Div
ADDD 13 14 16 22 Div Yes DIV F10 F0
F6 Yes Yes

Peking University
Scoreboard Example: Cycle 61
F0 MULT Integer (1)
DIVD comp.
F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
MULT

F8 SUBD Add (2)

F10 0
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 BInt No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20
Mult No F6
SUBD 7 9 11 12
2 F8
61
DIVD 8 21
Scoreboard
Add No F10 Div
ADDD 13 14 16 22 Div Yes DIV F10 F0
F6 Yes Yes

Peking University
Scoreboard Example: Cycle 62
Integer (1) DON
F0 MULT
E!!!!!
F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
ADD

F8 SUBD Add (2)

F10 DIV
Divide (40) div F10 F0 F6

Insn Status FU Status Reg Status


Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6
SUBD 7 9 11 12 2 F8
62 No
DIVD 8 21 61
Scoreboard
Add No F10 Div
ADDD 13 14 16 22 Div Yes DIV F10 F0
F6 Yes Yes

Peking University
Review of Cycle 62
F0 MULT Integer (1)

F2 LD#2 Inst $
Mult1 (10)
F4
Registers

Mult2 (10)
F6 LD#1
ADD In-order issue
F8 SUBD Out-of-order
Add (2) execution
F10 DIV
Divide (40)
&
… Out-of-order commit
Insn Status FU Status Reg Status
Inst ID1 ID EX W FU B Op dst src1 src2 Q1 Q2 R1 R2 FU
2 B Int No F0
LD#1 1 2 3 4
Mult No F2
LD#2 5 6 7 8
1 F4
MULTD 6 9 19 20 Mult No F6
SUBD 7 9 11 12 2 F8
DIVD 8 21 61 62
Scoreboard
Add No F10
ADDD 13 14 16 22 Div No

Peking University
Scoreboard Summary
ØThe good
• + Cheap hardware
• * InsnStatus + FuStatus + RegStatus ~ 1FP unit in area
• + Pretty good performance
• * 1.7X for FORTRAN (scientific array) programs
ØThe less good
- No bypassing
• * Is this a fundamental problem?
- Limited scheduling scope
• * Structural/WAW hazards delay dispatch
- Slow issue of truly-dependent (RAW) instructions
• * WAR hazards delay writeback
- Fix with hardware register renaming

Peking University
Backup

Peking University
Note that O3 means Out-of-Order Completion
In-Order Issue; Instructions are issued in order!

In-order
Inst.
Stream

Execution
Begins
In-order

Issue stage needs to check:


1. Structural Dependence
2. RAW Hazard
3. WAW Hazard
Out-of-order
4. WAR Hazard
Completion

Peking University

You might also like