Advanced Computer Architecture
Advanced Computer Architecture
KIIT UNIVERSITY
BHUBANESWAR
1
Module Objectives
Introduction
Basic block:
Introduction
cont
An n-stage pipeline can improve performance up to n times. Does not require much investment in hardware Transparent to the programmers. Decrease average CPI, and/or Decrease clock cycle time for instructions.
4
Pipeline Hazards
A situation that prevents an instruction from executing during its designated clock cycles.
Structural Hazards
Same resource is required by two (or more) concurrently executing instructions at the same time.
IF
ID IF
WB
DM
Reg
Instruction 1
Instruction 2 Instruction 3 Instruction 4
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Time
How is it Resolved?
Load
Mem Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Reg
Bubble
Bubble
Bubble
Bubble
Bubble
Mem
Reg
ALU
DM
Reg
Time
10
in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI
11
Ignoring overhead and assuming stages are balanced: CPI unpipelined Speedup 1 pipeline stall cycles per instruction
12
13
1 Clock cycle unpipelined 1 Pipeline stall cycles per instruction Clock cycle pipelined
14
Assume:
Pipelined processor.
Data references constitute 40% of an instruction mix. Ideal CPI of the pipelined machine is 1.
Unified data and instruction cache vs. separate data and instruction cache.
An Example
cont
Data Hazards
Example:
A=B+C;
D=A+E;
WB
Read After Write (RAW) Write After Read (WAR) Write After Write (WAW)
Hazard between two instructions I & J may occur when j attempts to read some data object that has been modified by I.
j would incorrectly receive an old or incorrect value. i: ADD R1, R2, R3 j: SUB R4, R1, R6 Example:
R (I) D (J)
Instn . J Read
RAW
R (J)
i1: load r1, addr; i2: add r2, r1,r1; i1: mul r1, r4, r5; i2: add r2, r1, r1;
Program (b):
Both cases, i2 does not get operand until i1 has completed writing the result
In (a) this is due to load-use dependency In (b) this is due to define-use dependency
21
Hazard may occur when j attempts to modify (write) some data object that is going to read by I.
Instruction J tries to write its operand at destination before instruction I read it.
I would incorrectly receive a new or incorrect value. i: ADD R1, R2, R3 j: SUB R2, R4, R6 Example:
R (J) D (I)
Instn . I Read
WAR
R (I)
WAW hazard:
Both I & J wants to modify a same data object. instruction j tries to write an operand before instruction i writes it. Writes are performed in wrong order.
Example:
R (I) R (J)
Instn . J Write
WAW
D (J)
Inter-Instruction Dependences
Data dependence
r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW) Anti-dependence r3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR) Output dependence r3 r1 op r2 Write-after-Write r5 r3 op r4 (WAW) r3 r6 op r7
Control dependence
27
False Dependency
Load-Use dependency
Define-Use dependency
True dependency
Cannot be overcome
False dependency
Can be eliminated by register renaming
28
Operand forwarding
By S/W (NOP)
Reordering the instruction
29
Pipelining changes the order of read/write accesses to operands. Order differs from that of an unpipelined machine.
ADD R1, R2, R3 SUB R4, R1, R5
Example:
For MIPS, ADD writes the register in WB but SUB needs it in ID.
DM
Reg
Mem
Reg
ALU
DM
Reg
Mem
Reg
ALU
DM
Mem
Reg
Mem
Reg
ADD instruction causes a hazard in next 3 instructions because register not written until after those 3 read it.
31
ALU
Forwarding
forwarding
Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)?
Yes!
32
Forwarding
cont
Generally speaking:
Forwarding
occurs when a result is passed directly to the functional unit that requires it.
goes from output of one pipeline stage to input of another.
Result
33
Forwarding Technique
Forwarding Path
34
Mem
Reg
DM
Reg
Mem
Reg
ALU
DM
Reg
SUB gets info. from EX/MEM pipe register AND gets info. from MEM/WB pipe register OR gets info. by forwarding from register file
Mem
Reg
ALU
DM
OR R8, R1, R9
Mem
Reg
Mem
Reg
Time
If line goes forward you can do forwarding. If its drawn backward, its physically impossible.
35
ALU
Compiler introduce NOP in between two instructions NOP = a piece of code which keeps a gap between two instruction
Instruction Reordering
ADD SUB
R1 , R2 , R3 R4 , R1 , R5
Before
XOR
AND ADD XOR AND SUB
R8 , R6 , R7
R9 , R10 , R11 R1 , R2 , R3 R8 , R6 , R7 R9 , R10 , R11 R4 , R1 , R5
37
After
Control Hazards
Result from branch and other instructions that change the flow of a program (i.e. change PC).
Example:
1: If(cond){
2: 3: s2 s1}
Statement in line 2 is control dependent on statement at line 1. Until condition evaluation completes:
5: }
39
Flush Pipeline:
Redo the instructions following a branch, once an instruction is detected to be branch during the ID stage.
40
41
Overall CPI=
Two approaches:
1) Move condition comparator to ID stage:
Decide branch outcome and target address in the ID stage itself:
2)Branch prediction
43
until branch direction is clear flushing pipe Execute successor instructions in sequence as if there is no branch undo instructions in pipeline if branch actually taken
MIPS still incurs 1 cycle branch penalty even with predict taken Other machines: branch target known before branch outcome computed, significant benefits can accrue
45
length n
46
Delayed Branch
Simple idea: Put an instruction that would be executed anyway right after a branch.
Branch Delayed slot instruction Branch target OR successor IF ID IF EX MEM WB
Delayed Branch
DADD
DADD if
if R2 == 0 then
R2 == 0
delay slot
. . .
DADD
R1, R2, R3
Delayed Branch
IF
ID IF
EX ID IF
By this time, we know whether to take the branch or whether not to take it
49
Delayed Branch
Example:
DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 delay slot then
The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
50
Delayed Branch
Example:
The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
51
Delayed Branch
Yet another possibility: An instruction from inside the taken path Example:
DADD R1, R2, R3 if R1 == 0 delay slot OR R7, R8, R9 DSUB R4, R5, R6 then
The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
52
Delayed Branch
The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
53
LD
DSUBU BEQZ OR L:
R1,0(R2) R1,R1,R3
R1 != 0
LD DSUBU BEQZ
R1,L
R4,R5,R6
R1 == 0
DADDU R10,R4,R3
DADDU R7,R8,R9
DADDU R7,R8,R9 B3 1.) BEQZ is dependent on DSUBU and DSUBU on LD, 2.) If we knew that the branch was taken with a high probability, then DADDU could be moved into block B1, since it doesnt have any dependencies with block B2,
3.) Conversely, knowing the branch was not taken, then OR could be moved into block B1, since it doesnt affect anything in B3,
54
Delayed Branch
branch instruction
From
From
55
Delayed Branch
cont
Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?
Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.
56
in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI
57
58
Pipeline speed up
Pipeline speed up =
Pipeline depth
Hazards can be caused by dependences within a program. There are three main types of dependences in a program:
Data dependence Name dependence Control dependence
60
Data Dependences
Two addresses may refer to the same memory location but look different.
100(R4) and 20(R6)
62
Two instructions use the same register or memory location (called a name). There is no true flow of data between the two instructions. Example: A=B+C; A=P+Q;
Anti-Dependence or (WAR)
Original ordering must be preserved to ensure that i reads the correct value.
ADD F0,F6,F8 SUB F8,F4,F5
64
Example:
65
Exercise
66
Hazard Resolution
Name dependences:
Once identified, can be easily eliminated through simple compiler renaming techniques. Memory-related dependences are difficult to identify:
Hardware techniques (scoreboarding and dynamic instruction scheduling) are being used.
More difficult to handle. Can not be eliminated; can only be overcome! Many techniques have evolved over the years.
67
Rename Registers
i1: mul r1, r2, r3; i2: add r6, r4, r5;
Compiler can do register renaming in the register allocation process (i.e., the process that assigns registers to variables).
68
Hazards
RAW
Output
Anti
WAW
WAR Control
Control
------
Structural
69
Out-of-order Pipelining
IF ID RD EX INT Fadd1 Fadd2
LD/ST
Out-of-order WB
Ib: F1 F4 + F5 ...... Ia: F1 F2 x F3
70
WB
Example: an instruction that uses result of a LOADs destination register should not immediately follow the LOAD instruction. compiler-based pipeline instruction scheduling
71
Simple solution
Stall
pipeline
Pipeline stall:
Lets
some instruction(s) in pipeline proceed, others are made to wait for data, resource, etc.
72
In a pipeline,
All data hazards can be checked during ID phase of pipeline. If a data hazard is detected, next instruction should be stalled. Whether forwarding is needed can also be determined at this stage, control signals set. Control unit of pipeline must stall pipeline and prevent instructions in IF, ID from advancing.
73
If hazard is detected,
74
Delayed Branch
cont
Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?
Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.
75
Branch Prediction
KEY IDEA: Hope that branch assumption is correct.
If yes, then weve gained a performance improvement.
78
cont
Use result from last time this instruction executed. Even if branch is almost always taken, we will be wrong at least twice if branch alternates between taken, not taken
Problem:
We get 0% accuracy
79
1-bit Predictor
Set bit to 1 or 0:
Depending
Pipeline If
Example
81
11
T
T NT
10
NT
01
00
NT
Program assumptions:
23% loads and in of cases, next instruction uses load value 13% stores 19% conditional branches 2% unconditional branches 43% other
83
Example
Machine Assumptions:
5
cont
stage pipe
Penalty of 1 cycle on use of load value immediately after a load. Jumps are resolved in ID stage for a 1 cycle branch penalty. 75% branch prediction accuracy. 1 cycle delay on misprediction.
84
Example
cont
Loads:
Jumps:
Conditional Branches:
Total Penalty: 0.115 + 0.02 + 0.0475 = 0.1825 Average CPI: 1 + 0.1825 = 1.1825
85
On the average the size of a basic block is 7. After every 7 instructions, a branch instruction is encountered.
Loop-level Parallelism
88
A dependence within the body of the loop itself (i.e. within one iteration).
89
Loop-level Dependence
Example:
For(i=0;i<1000;i++){ a[i+1]=b[i]+c[i]
b[i+1]=a[i+1]+d[i];
}
Loop-carried dependence from one iteration to the preceding iteration. Also, loop-independent dependence on account of a[i+1]
90
91
92
Loop :
ADD.D
S.D
F4,0(R1)
; store result
; decrement ptr
93
L.D
ADD.D S.D L.D ADD.D S.D L.D
F8,-8(R1)
F10,-16(R1) F12,F10,F2 F12,-16(R1) F14,-24(R1) F16,F14,F2 F16,-24(R1)
ADD.D
S.D DADDUI BNE
R1,R1,#-32
R1,R2,Loop
94
ADD.D F8,F6,F2 n loop Bodies for n=4 S.D L.D F8,-8(R1) F10,-16(R1)
ADD.D F12,F10,F2 S.D L.D F12,-16(R1) Note the adjustments for store and load offsets (only store highlighted red)!
F14,-24(R1)
for(i=1;i<1000;i++){ a[i]=a[i]+b[i];
b[i+1]=c[i]+d[i];
} } With dependence
a[i+1]=a[i+1]+b[i+1];
b[1000]=c[999]+d[999]; Without dependence
96
Software Pipelining
Software Pipelining
cont
Each iteration is made from instructions chosen from different iterations of the original loop.
i0
i1 i2 i3 i4
i5
99
Software Pipelining
cont
each iteration of a software pipelined code, some instruction of some iteration of the original loop is executed.
100
Software Pipelining
cont
1 unroll loop body with an unroll factor of n. (we have taken n = 3 for our example) 2 select order of instructions from different iterations to pipeline 3 paste instructions from different iterations into the new pipelined loop body
101
Loop :
ADD.D
S.D
F4,0(R1)
; store result
; decrement ptr
102
Note: 1.) We are unrolling the loop Hence no loop overhead Instructions are needed! 2.) A single loop body of restructured loop would contain instructions from different iterations of the original loop body.
ADD.D
S.D Iteration i + 1: L.D
ADD.D
S.D Iteration i + 2: L.D ADD.D S.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)
103
ADD.D
S.D Iteration i + 1: L.D
Notes: 1.) Well select the following order in our pipelined loop: 2.) Each instruction (L.D ADD.D S.D) must be selected at least once to make sure that we dont leave out any instructions of the original loop in the pipelined loop.
ADD.D
S.D Iteration i + 2: L.D ADD.D S.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1) 3.)
104
ADD.D
S.D Iteration i + 1: L.D
ADD.D
L.D DADDU 3.) BNE
ADD.D
S.D Iteration i + 2: L.D ADD.D S.D
F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)
R1,R1,#-8
R1,R2,Loop
105
DADDUI R1,R1,#-8
106
L.D
BNE
F0,0(R1)
R1,R2,Loop
; M[ i 2 ]
DADDUI R1,R1,#-8
107
In more complex examples, we may need to increase the iterations between when data is read and when the results are used.
108
109
110
6600 computer. Tomasulos Algorithm: Implemented for the FP unit of the IBM 360/91 in 1966.
WAR and WAW hazards that did not exist in an in-order pipeline:
Can
113
Scoreboarding
cont
114
Decode instructions, check for structural hazards. Read operands: Wait until no data hazards, then read operands.
115
Scoreboarding
Instructions pass through the issue stage in order. Instructions can bypass each other in the read operands stage:
Then
cont
116
Scoreboarding Concepts
We had observed that WAR and WAW hazards can occur in out-oforder execution:
Instructions
are stalled, But, instructions having no dependence are allowed to continue. Different units are kept as busy as possible.
117
involved in a dependence
Scoreboarding Concepts
Essence of scoreboarding:
Execute
instructions as early as
Later instructions are issued and executed if they do not depend on any active or stalled instruction.
118
119
Scoreboarding
Achieved with multiple functional units, along with pipelined functional units.
Integer Unit
Control/ status
Scoreboard
Control/ status
121
1. Issue: when a f.u. for an instruction is free and no other active instruction has the same destination register: 2. Read operands: when all source operands are available:
Note: forwarding not used. A source operand is available if no earlier issued active instruction is going to write it. Thus resolves RAW hazards dynamically.
122
ADD.D cannot proceed to read operands until DIV.D completes; SUB.D can execute but not write back until ADD.D has read F8.
123
DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14
An Assessment of Scoreboarding
Pro: Factor of 1.7 improvement for FORTRAN and 2.5 for hand-coded assembly on CDC 6600!
Before semiconductor main memory or caches Required about as much logic as a functional unit -- quite low. Large number of buses needed:
Cons:
However, if we wish to issue multiple instructions per clock, more wires are needed in any case.
124
Con: Anti dependences and output dependences (WAR and WAW hazards) are also handled using stalls:
126
To keep the floating point pipeline as busy as possible. This led Tomasulo to try to figure out how to achieve renaming in hardware!
Reservation stations:
Single entry buffer at the head of each functional unit has been replaced by a multiple entry buffer.
Register Tags:
Connects the output of the functional units to the reservation stations as well as registers. Tag corresponds to the reservation station entry number for the instruction producing the result.
128
Reservation Stations
instruction waits in the reservation station, until its operands become available. reservation station fetches and buffers an operand as soon as it is available:
Tomasulos Algorithm
Control & buffers distributed with Function Units (FU) In the form of reservation stations associated with every function unit. Store operands for issued but pending instructions. Registers in instructions replaced by values and others with pointers to reservation stations (RS):
130
Tomasulos Algorithm
cont
Not through registers, therefore similar to forwarding. Over Common Data Bus (CDB) that broadcasts results to all FUs. Treated as FUs with RSs as well.
Tomasulos Scheme
From Instruction Unit Instruction Queue Registers Address Unit Store Buffer
Load Buffer
Adder
Reservation Stations
Multiplier
Memory Unit
CDB
132
When both operands ready then execute; if not ready, watch Common Data Bus for result
Instruction stream
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8
Tomasulo Example
k R2 R3 F4 F2 F6 F2
Busy Address
No No No
3 Load/Buffers
Op S1 Vj S2 Vk RS Qj RS Qk
Reservation Stations:
Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No
F0
F2
F4
F6
F8
F10
F12
...
F30
Busy Address
Yes No No 34+R2
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
135
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
Load2
F4
F6
Load1
F8
F10
F12
...
F30
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
R(F4) Load2
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
Mult1 Load2
Note: registers names are removed (renamed) in Reservation Stations; MULT issued 137 Load1 completing; what is waiting for Load1?
Busy Address
No Yes No 45+R3
Reservation Stations:
Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 Load2
M(A1) Add1
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
142
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
143
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
146
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
147
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
148
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
151
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
152
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Once again: In-order issue, out-of-order execution 154 and out-of-order completion.
CDB connects to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one!
Imprecise exceptions!
Interrupts/Exceptions
156
Imprecise Exceptions
processor state when an exception is raised, does not look exactly the same compared to when the instructions are executed inorder.
157
Imprecise Exceptions
For example:
A
some instructions before it may not be complete some instructions after it are already complete
floating point instruction exception could be detected after an integer instruction that is much later in the program order is complete.
158
159
SD
SUBI
F4
R1
R1 R1 #8
BNEZ
R1 Loop
161
load takes 8 clocks (L1 cache miss) 2nd load takes 1 clock (hit)
integer instructions ahead of FP Instructions.
162
Show 2 iterations
Instruction status:
ITER Instruction
Iter1 ation Count 2
2 2 1 1 LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4
Loop Example
ExecWrite j
0 F0 0 0 F0 0
k IssueCompResult
R1 F2 R1 R1 F2 R1
Busy Addr
Load1 No Load2 No Load3 No Store1 No Store2 No Store3 No
Fu
Reservation Stations:
Add1 Add2 Add3 Mult1 Mult2 R1 80 No No No No No
S1 Vj Vk
S2 Qj
RS Qk Code:
Instruction Loop
Clock
0
F0 F2 F4 F6 F8
Fu
F10 F12
...
F30 163
j
0
k
R1
Busy Addr
Yes No No No No No 80
Fu
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No R1 80 Op Vj
S1 Vk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
1
F0
Fu Load1
F2
F4
F6
F8
F10 F12
...
F30
164
j
0 F0
k
R1 F2
Busy Addr
Yes No No No No No 80
Fu
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
2
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
165
j
0 F0 0
k
R1 F2 R1
Busy Addr
Yes No No Yes No No 80
Fu
80
Mult1
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
3
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
166
j
0 F0 0
k
R1 F2 R1
Busy Addr
Yes No No Yes No No 80
Fu
80
Mult1
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
4
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
167
j
0 F0 0
k
R1 F2 R1
Busy Addr
Yes No No Yes No No 80
Fu
80
Mult1
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 72 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
5
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
168
j
0 F0 0 0
k
R1 F2 R1 R1
Busy Addr
Yes Yes No Yes No No 80 72 80
Fu
Mult1
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 72 Vj
S1 Vk
R(F2) Load1
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
6
F0
Fu Load2
F2
F4
Mult1
F6
F8
F10 F12
...
F30
169
j
0 F0 0 0 F0
k
R1 F2 R1 R1 F2
Busy Addr
Yes Yes No Yes No No 80 72 80
Fu
Mult1
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72 Vj
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
7
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
Register file completely detached from computation First and Second iteration completely overlapped
170
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
Yes Yes No Yes Yes No 80 72 80 72
Fu
Mult1 Mult2
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
8
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
171
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
Yes Yes No Yes Yes No 80 72 80 72
Fu
Mult1 Mult2
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72
S2 Qj
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
9
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
172
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No Yes No Yes Yes No 72 80 72
Fu
10
Mult1 Mult2
Reservation Stations:
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd R(F2) Load2 R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
10
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
...
F30
173
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No No Yes Yes Yes No
Fu
10
11
64 80 72
Mult1 Mult2
Reservation Stations:
3 4
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
11
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
174
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No No Yes Yes Yes No
Fu
10
11
64 80 72
Mult1 Mult2
Reservation Stations:
2 3
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
12
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
175
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No No Yes Yes Yes No
Fu
10
11
64 80 72
Mult1 Mult2
Reservation Stations:
1 2
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
13
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
176
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No No Yes Yes Yes No
Fu
11
64 80 72
Mult1 Mult2
Reservation Stations:
0 1
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64
S2 Qj
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
14
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
177
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1
Busy Addr
No No Yes Yes Yes No
Fu
64 80 72
[80]*R2 Mult2
Reservation Stations:
Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 No Mult2 Yes Multd M[72] R(F2) R1 64
RS Qk
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
15
F0
Fu Load3
F2
F4
Mult2
F6
F8
F10 F12
...
F30
178
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
No No Yes Yes Yes No
Fu
64 80 72
[80]*R2 [72]*R2
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
16
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
179
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
No No Yes Yes Yes Yes
Fu
64 80 72 64
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
17
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
180
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
No No Yes Yes Yes Yes
Fu
64 80 72 64
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
18
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
181
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
No No Yes No Yes Yes
Fu
64 72 64 [72]*R2 Mult1
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 56
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
19
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
...
F30
182
j
0 F0 0 0 F0 0
k
R1 F2 R1 R1 F2 R1 Vj
Busy Addr
Yes No Yes No No Yes 56 64
Fu
64
Mult1
Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 56
R(F2) Load3
F0 F4 F4 R1 R1
0 F0 0 R1 Loop
R1 F2 R1 #8
Clock
20
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
...
F30
Once again: In-order issue, out-of-order execution and 183 out-of-order completion.
the WAR stall that occurred in the scoreboard. Also, multiple iterations use different physical destinations facilitating dynamic loop unrolling.
184
Instructions can be passed simultaneously by broadcast on CDB. Units would have to read their results from registers .
2. Elimination of stalls for WAW and WAR hazards. 3. Possible to have superscalar execution:
KIIT UNIVERSITY
BHUBANESWAR
186
188
Superscalar processors:
Multiple
issue, dynamically scheduled, speculative execution, branch prediction More hardware functionalities and complexities.
VLIW:
Let
Superscalar Execution
True Data Dependency: The result of one operation is an input to the next. Resource constraints: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. Superscalar processor of degree m.
191
Hardware cost and complexity of superscalar schedulers is a major consideration in processor design.
VLIW processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently.
Thus the name very long instruction word. This concept is employed in the Intel IA64 processors.
192
VLIW Processors
VLIW processors deploy multiple independent functional units. Early VLIW processors operated lock step:
There
was no hazard detection in hardware at all. A stall in any functional unit causes the entire pipeline to stall.
194
VLIW Processors
It is not issued.
195
FU
FU
FU
FU
196
Issue hardware is simpler. Compiler has a bigger context from which to select co-scheduled instructions. Compilers, however, do not have runtime information such as cache misses.
Scheduling is, therefore, inherently conservative. Branch and memory prediction is more difficult.
VLIW Summary
Complier detects hazard, and determines scheduling. There is no (or only partial) hardware hazard detection:
198
VLIW vs Superscalar
199
Superscalar Processors
in the embedded processor market, dual issue superscalar pipelines are becoming common.
200
201
Maximum throughput bounded by one instruction per cycle. Inefficient unification of instructions into one pipeline:
ALU, MEM stages very diverse eg: FP If a leading instruction is stalled every subsequent instruction is stalled
202
A Rigid Pipeline
Stalled Instruction
203
diversified pipelines.
Machine Parallelism
(a) No Parallelism (Nonpipelined) (b) Temporal Parallelism (Pipelined) (c) Spatial Parallelism (Multiple units) (d) Combined Temporal and Spatial Parallelism
205
A Parallel Pipeline
Width = 3
206
207
208
209
210
A Superscalar Pipeline
A degree six superscalar pipeline
211
Instruction Flow
Data Flow
Complete
Completion Buffer
Store Buffer
Retire
212
of the instructions can be load, store, or integer ALU operations. The other can be a floating point operation.
213
M1 IF ID
M2
M3
M4
M5
M6
M7 M WB
A1
A2
A3
A4
DIV
Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Add, unpipelined DIV. In-order execution, out-of-order completion
Tomasulo w/o ROB: out-of-order execution, out-of-order completion, in- 214 order commit