CS/CoE 1541
Single and Multi-cycle
Implementations
Introduction to Computer Architecture
Typical Instruction Execution
Single Cycle Recap
1) Fetch instruction from memory
2) Decode instruction
3) If necessary, perform an ALU operation
4) If memory access, perform load/store
5) Write results back to register file and increment the PC
Introduction to Computer Architecture
Fetching Instruction
Memory to hold instructions
Program Counter (PC) to
generate address of instruction
Adder to increment the PC for
the next instructions address
Program Counter
(Register)
Next
Instruction
Memory (RAM)
Instruction
address
Instruction
Adder
Current PC
Next PC
Current PC
Introduction to Computer Architecture
Instruction Fetch Unit
Current PC
Adder
4
Program Counter
(Register)
Next
Current PC
Instruction
instruction
address
Write
Introduction to Computer Architecture
Instruction
Memory (RAM)
4
ALU Operation
Consider a basic ALU operation
add
R1,R2,R3
Requires a Register File and an ALU
Register
Numbers
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Data
ALU Operation
Read
Data 2
Write Data
Data
ALU
Register File
Introduction to Computer Architecture
ADD R2, R3, R4
000000 00011
op
rs
00100
rt
00010
rd
00000
shamt
100000
funct
Register File
Instruction
Register
Fields
ALU Operation
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Data
Introduction to Computer Architecture
Read
Data 2
Write Data
Write Enable
ALU
Accessing Memory (loads and stores)
LW R4,0x10(R2)
Instruction
Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Write Data
Introduction to Computer Architecture
Read
Data 2
ALU
Data Memory
(RAM)
Load: From Memory to Register File
LW R4,0x10(R2)
Instruction
Register File
Data Memory
(RAM)
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Read
Data 2
Write Data
ALU
Read
data
Write Enable
Introduction to Computer Architecture
Computing the Address (part 1)
LW R4,0x10(R2)
Instruction
Data Memory
(RAM)
Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Read
Data 2
Write Data
Read/Write
address
ALU
Read
data
Write Enable
Introduction to Computer Architecture
Computing the Address (part 2)
LW R4,0x10(R2)
Instruction
ALU Add
Data Memory
(RAM)
Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Data
Read/Write
address
Read
Data 2
Write Data
ALU
Read
data
Write Enable
16
Introduction to Computer Architecture
Sign
extend
32
10
Sign Extender
000000 00011
op
rs
00100
rt
00010000 00100000
immediate
Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Read
Data 2
Write Data
16 bits
Write Enable
Introduction to Computer Architecture
16 bits
15
...
ALU
16 bits
Sign extend
0
11
Store: From Register File to Memory
SW R4,0x10(R2)
ALU Add
Instruction
Data Memory
(RAM)
Register File
Read Reg 1
Read
Read Reg 2 Data 1
Write Reg
Data
Read/Write
address
Read
Data 2
Write Data
ALU
Write Data
Read
data
Write Enable
16
Introduction to Computer Architecture
Sign
extend
32
12
Branch
BEQ
R1, R2, label
Instruction
Register File
Read Reg 1
Data
Read
Read Reg 2 Data 1
Write Reg
Write Data
Introduction to Computer Architecture
Zero
Read
Data 2
To Branch
Control
Logic
ALU
13
Need Adder to Compute Branch Target
BEQ
R1, R2, label
ADDER
PC + 4 from
instruction datapath
Instruction
Branch
Target
<< 2
Register File
Read Reg 1
Data
Read
Read Reg 2 Data 1
Write Reg
Zero
Read
Data 2
Write Data
16
Introduction to Computer Architecture
To Branch
Control
Logic
ALU
Sign
extend
32
14
Data Path for Memory and R-type Instructions
Register File
Read Reg 1
Read
Read Reg 2Data 1
Instruction
Write Reg Read
Write Data Data 2
Zero
Data Memory
(RAM)
M
U
X
M
U
X
ALU
Write
Data
Sign
extend
16
Introduction to Computer Architecture
Read
data
32
15
Complete Single-cycle Datapath
Current PC
4
PC
M
U
X
Adder
Instruction
Memory (RAM)
<< 2
ADDER
Register File
Instruction
Read Reg 1
Read
Read Reg 2Data 1
Write Reg Read
Write Data Data 2
Zero
Data Memory
(RAM)
M
U
X
M
U
X
ALU
16
Introduction to Computer Architecture
Sign
exten
d
Write
Data
Read
data
32
16
Whats Wrong with a Single-Cycle
Implementation
That was a single cycle machine?
Yep! It was assumed that data flows through all parts of the datapath
in ONE clock cycle
How long is a cycle
ALU
10 ns
Register File
5 ns
Memory
10 ns
Assume everything else takes zero time
Introduction to Computer Architecture
17
Instruction Timings
Instr Type
R-format
Load
Store
Branch
Jump
InstrMem
10
10
10
10
10
Reg Read
5
5
5
5
-
ALU
10
10
10
10
-
DataMem
10
10
-
Register File
PC
Read Reg 1
Read
Read Reg 2Data 1
Write Reg Read
Write Data Data 2
Zero
Reg Write
5
5
-
Data Memory
(RAM)
M
U
X
M
U
X
ALU
16
Introduction to Computer Architecture
Sign
exten
d
Total
30 ns
40 ns
35 ns
25 ns
10 ns
Write
Data
Read
data
32
18
Whats Wrong with a Single-Cycle
Implementation
Difficult to implement variable cycle clock
Usually run the clock at the SLOWEST speed
This is called the critical path
The critical path is the path through the system which limits
performance
What if we add a floating point unit?
FP (floating point) math can take a very long time
100s of ns for multiply and divide
Lots of techniques to reduce time - will cover later on
How about breaking the machine into parts
Introduction to Computer Architecture
19
Multiple Cycle Implementation Datapath
Targ
et
30
PC[31:28]
Shift
left 2
Jump
Address
M
U
X
32
Memory
Read Reg1
Read
Read Reg2Data 1
PC
M
U
X
M
U
X
Instruction
Register
Write Reg Read
Data 2
Write Data
A
B
4
M
U
X
Zero
ALUOut
M
U
X
ALU
Write Data
MDR
M
U
X
16
Introduction to Computer Architecture
Sign
Exten
d
Shift
left 2
32
20
Full Diagram of Multi-cycle Machine
Figure 5.33
Introduction to Computer Architecture
21
Execution Steps (1)
Instruction Fetch
IR = Memory[PC];
PC = PC + 4;
Introduction to Computer Architecture
22
Execution Steps (2)
Instruction Decode and Register Fetch
A = Reg[IR[25..21]];
B = Reg[IR[20..16]];
ALUOut = PC + (signExtend(IR[15..0]) << 2);
Introduction to Computer Architecture
23
Execution Step (3)
Execution, memory address computation or branch completion
Memory Reference
ALUOut = A + signExtend(IR[15..0]);
Arithmetic/Logical Operation
ALUOut = A + B
Branch
If (A == B) PC = ALUOut;
Jump
PC = PC[31 ..28] || (IR[25..0) << 2);
Introduction to Computer Architecture
24
Execution Step (4)
Memory access or R-type instruction completion
Memory Reference
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
Arithmetic/Logical Instructions (R-type)
Reg[IR[15..11]] = ALUOut;
Introduction to Computer Architecture
25
Execution Step (5)
Memory Read completion
Reg[IR[20..16]] = MDR;
Introduction to Computer Architecture
26
Multicycle Control
MemReadMemWrite
RegWrite
IRWrite
IorD
RegDest
PC
M
U
X
ALU SelA
M
U
X
Instruction
Register
Read Reg1
Read
Read Reg2Data 1
Write Reg Read
Data 2
Write Data
A
B
4
M
U
X
Zero
ALUOut
ALU
M
U
X
Write Data
MDR
M
U
X
16
MemToReg
ALU SelB
Shift
Sign
left 2
Exten
32
d
Instruction [5:0]
ALU
Contr
ol
ALU Op
Introduction to Computer Architecture
27
Performance of Multicycle Implementation
Each type of instruction can take a variable # of cycles
Example
Assume the following instruction distributions:
loads
stores
R-type
branches
jump
5 cycles
4 cycles
4 cycles
3 cycles
3 cycles
22%
11%
49%
16%
2%
Whats the average Cycles Per Instruction (CPI)
CPI = (CPU clock cycles/Instruction Count)
CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49)
+ (3 cycles * 0.16) + (3 cycles * 0.02)
CPI = 4.04 cycles per instruction
What was the CPI for the single-cycle machine?
Single cycle implies 1 clock cycle per instruction --> CPI = 1.0
So isnt the single-cycle machine faster?
Introduction to Computer Architecture
28
CS/CoE 1541
Pipelining
Introduction to Computer Architecture
29
Looks a Lot Like a Multicycle Processor
What are the steps
Fetch an instruction
Decode the instruction
ALU OP
Memory Access
Write-back
Memory
M
U
X
M
U
X
Instruction
Register
Read Reg1
Read
Read Reg2Data 1
M
U
X
Write Reg Read
Data 2
Write Data
M
U
X
Zero
ALU
Write Data
M
U
X
Introduction to Computer Architecture
16
Sign
Exten
d
Shift
left 2
32
30
Performance of Pipelined Systems
time
Unpipelined
instructions
Pipelined
time
latency
instructions
Ideally, Speeduppipeline =
Introduction to Computer Architecture
Timesequential
Pipeline Depth
31
MIPS Pipeline Stages
Stage 1: Instruction Fetch
Stage 2: Instruction Decode
Stage 3: Execute
Stage 4: Memory Access
Stage 5: Write Back (to register file)
Introduction to Computer Architecture
32
How Do We Partition the Datapath into Stages
STAGE 1
Instruction Fetch
STAGE 3
ALU
STAGE2
Decode
STAGE 4
MemAcc
STAGE 5
Writeback
Current PC
4
Adder
PC
Register File
Read Reg 1
Read
Data
Read Reg
2 1
Instruction
Write Reg
Read
Data 2
Write Data
Instruction
Memory (RAM)
16
Introduction to Computer Architecture
Sign
exte
nd
ALU
M
U
X
Data Memory
(RAM)
Zero
M
U
X
Write
Data
Read
data
32
33
But How to We Separate the Different Stages
STAGE 1 -Instruction Fetch
STAGE2
Decode
STAGE 3
ALU
STAGE 4 STAGE 5
MemAcc Writeback
Current PC
4
PC
Adder
R
E
G
I
S
Instruction
Memory (RAM)T
E
R
S
Introduction to Computer Architecture
R
E
Register File
Read Reg 1 G
Read
Data
Read Reg
2 1I
S
Write Reg
Read
Data 2T
Write Data
E
Sign
R
exte
nd
32 S
16
M
U
X
R
E
ALU
G
I
S
T
Write E
Data R
S
R
E
Data Memory
(RAM) G
I
S
T
E
R
S
M
U
X
Read
data
34
Complete 5 Stage Pipeline
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
16
Introduction to Computer Architecture
Sign
exten
d
M
U
X
M
U
X
ALU
Read
data
32
35
Flow of Instructions Through Pipeline
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7
LW R1, 100(R0)
IM
LW R2,200(R0)
LW R3, 300(R0)
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
ALU
REG
Reg
DM
ALU
Reg
DM
Reg
36
Stage 1 - IF (Instruction Fetch)
Instruction Fetch
LW
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
16
Introduction to Computer Architecture
Sign
exte
nd
M
U
X
ALU
M
U
X
Read
data
32
37
Stage 2 - ID (Instruction Decode)
Instruction Decode
LW
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
16
Introduction to Computer Architecture
Sign
exte
nd
M
U
X
ALU
M
U
X
Read
data
32
38
Stage 3 - EX (Execution)
Execution
LW
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
16
Introduction to Computer Architecture
Sign
exte
nd
M
U
X
ALU
M
U
X
Read
data
32
39
Stage 4 - MEM (Memory)
Memory
LW
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
16
Introduction to Computer Architecture
Sign
exte
nd
M
U
X
ALU
M
U
X
Read
data
32
40
Stage 5 - WB (Write Back)
WriteBack
LW
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
41
Clock Speed
If a single-cycle machine is broken into 2 pipeline stages,
how much faster can the clock run?
Latency is time from start to completion of instruction
100 nsecs
Instructions
Result
Instructions
Result
Introduction to Computer Architecture
42
How Far Can We Go?
Latency is time from start to completion of instruction
100 nsecs
Instructions
Result
Instructions
Result
Instructions
Introduction to Computer Architecture
Result
43
5 Stage Pipeline
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
44
Pipeline Control
M
U
X
Current PC
4
W
B
M
C
O
N
T
R
O
L
IF/ID
W
B
M
EX
EX/MEM
ID/EX
Adder
<< 2
ADDER
PC
ALU Control
Register File
Write RegRead
Write DataData 2
16
Introduction to Computer Architecture
Sign
exte
nd
Data Memory
(RAM)
Zero
Read Reg 1
Read
Read RegData
2 1
Instruction
Memory (RAM)
W
B
MEM/WB
M
U
X
M
U
X
ALU
Read
data
32
45
Flow of Instructions Through Pipeline
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R2,R3,R1
IM
SUB R5,R6,R7
ADD R10,R11,R12
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
ALU
REG
Reg
DM
ALU
Reg
DM
Reg
46
Contention at the Register File
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R10, R11, R12
IM
ADD R17, R0, R0
ADD R16, R0, R0
SUB R20, R21, R22
ADD R30, R17, R18
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
IM
ALU
REG
IM
REG
DM
ALU
REG
REG
DM
REG
ALU
DM
47
Oops - Sometimes Results Are Not Ready
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R2,R3,R1
IM
SUB R5,R6,R7
REG
IM
ALU
REG
Reg
DM
ALU
DM
Reg
Writeback
Result into R10
ADD R10,R11,R12
ADD R12,R10,R11
IM
REG
IM
ALU
REG
DM
ALU
Reg
DM
Reg
Read value out of R10
Introduction to Computer Architecture
48
Data Hazards
Programs assume instructions are executed sequentially with one
instruction completing before the next one begins
Usually the compiler assumes the single machine model
Pipelining violates this assumption
Dependencies can occur between instructions executing concurrently
within the pipeline - if the dependencies are based on data
requirements, we call them Data Hazards
Types of data hazards
Read-after-write (RAW)
A true dependency
Write-after-read (WAR)
Artificial dependency due to register assignment
Write-after-write (WAW)
Artificial dependency due to register assignment
Introduction to Computer Architecture
49
Coping with Data Hazards
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R10, R11, R12
IM
REG
ADD R12, R10, R11
IM
ADD R11, R10, R12
Introduction to Computer Architecture
ALU
REG
IM
DM
ALU
REG
Reg
DM
Reg
ALU
DM
Reg
50
Solution 1 : Stall
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R10, R11, R12
IM
REG
ADD R12, R10, R11
IM
ADD R11, R10, R12
Introduction to Computer Architecture
ALU
DM
bubble bubble
Reg
REG
IM
ALU
REG
DM
ALU
51
Recall the Registers Between Pipeline Stages
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
52
Stall Conditions
Need to detect data hazard
Occurs when one instruction tries to read result from previous
instruction that hasnt completed yet.
Specifically,
When Instruction in Execute stage tries to read a register that an
instruction in the MemAcc or WB stages will write back to the
Register File
H&P Notation
ID/EX.RegisterRs refers to the number of the first source register
found in the pipeline register ID/EX.
ID/EX. RegisterRt refers to the number of the second source register
found in the pipeline register ID/EX.
Introduction to Computer Architecture
53
Recall What an Instruction Looks Like
add R8, R17, R18
is stored in binary format as
00000010
00110010
01000000 00100000
MIPS lays out instructions into fields
31 26 25
21 20 16 15 11 10 6
000000 10001 10010
01000 00000
op
rs
rt
rd
shamt
5
0
100000
funct
op
operation of the instruction
rs
first register source operand
rt second register source operand
rd
register destination operand
shamt
shift amount
funct
function (select type of operation)
Introduction to Computer Architecture
54
Remember the Registers In Between Each
Stage
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
55
Data Hazard Stall Conditions
1a EX/MEM. RegisterRd
1b EX/MEM.RegisterRd
IF/ID
==
==
ID/EX
Rs
Rt
Rd
Current PC
4
ID/EX.RegisterRs
ID/EX. RegisterRt
EX/MEM
Rs =? Rd
Rt =? Rd Rd
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2
Instruction
Memory (RAM)
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
56
Data Hazard Stall Conditions (cont)
1a
1b
2a
2b
EX/MEM. RegisterRd
EX/MEM.RegisterRd
MEM/WB. RegisterRd
MEM/WB. RegisterRd
IF/ID
==
==
==
==
ID/EX
EX/MEM
Rs =? Rd
Rt =? RdRd
Rs
Rt
Rd
Current PC
4
ID/EX.RegisterRs
ID/EX. RegisterRt
ID/EX. RegisterRs
ID/EX. RegisterRt
MEM/WB
Rd
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2
Instruction
Memory (RAM)
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
57
Data Hazard Logic
Data Hazard Logic
Rs =? Rd
Rt =? Rd
between ID/EX, EX/MEM, and MEM/WB Stages
IF/ID
ID/EX
Rs
Rt
Rd
Current PC
4
EX/MEM
MEM/WB
Rd
Rd
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2
Instruction
Memory (RAM)
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
58
Example
sub
and
or
add
sw
R2, R1, R3
R12, R2, R5
R13, R6, R2
R14, R2, R2
R15, 100(R2)
Rd = R2
Rd = R12
Rd = R13
Rd = R14
Rd = R15
Rs = R1
Rs = R2
Rs = R6
Rs = R2
Rs = R2
Rt = R3
Rt = R5
Rt = R2
Rt = R2
Rt = XX
SUB-AND Hazard
EX/MEM.RegisterRd
== ID/EX. RegisterRs
== R2
== ID/EX. RegisterRt
== R2
SUB-OR Hazard
MEM/WB.RegisterRd
Do we care about the interaction between sub (instruction 1) and add
(instruction 4)?
Introduction to Computer Architecture
59
Example (cont)
Data Hazard Logic
Current PC
4
Adder
ID/EX
EX/MEM
MEM/WB
Rs =
Rt =
Rd =
Rd =
Rd =
SUB R2, R1, R3
IF/ID
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Write Reg Read
Write Data Data 2
Instruction
Memory (RAM)
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
60
Example (cont)
Data Hazard Logic
PC
Adder
SUB R2, R1, R3
Current PC
ID/EX
AND R12, R2, R5
IF/ID
Rs = R3
Rt = R1
Rd = R2
<< 2
Rd =
Rd =
Data Memory
(RAM)
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Instruction
Memory (RAM)
MEM/WB
ADDER
Register File
Write Reg Read
Write Data Data 2
EX/MEM
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
61
Example (cont)
Data Hazard Logic
EX/MEM.RegisterRD = R2 != ID/EX.RegisterRs = R5
EX/MEM.RegisterRD = R2 == ID/EX.RegisterRt = R2
Adder
PC
EX/MEM
SUB R2, R1, R3
OR R13, R6, R2
Current PC
ID/EX
AND R12, R2, R5
IF/ID
Rs = R5
Rt = R2
Rd = R12
<< 2
ADDER
Rd =
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Instruction
Memory (RAM)
Rd = R2
Data Memory
(RAM)
Register File
Write Reg Read
Write Data Data 2
MEM/WB
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
62
Example (cont)
Data Hazard Logic
PC
EX/MEM
AND R12, R2, R5
Adder
ADD
OR R13, R6, R2
Current PC
ID/EX
R14, R2, R2
IF/ID
ID/EX.RegisterRs = R6
ID/EX.RegisterRt = R2
ID/EX.RegisterRs = R6
ID/EX.RegisterRt = R2
Rs = R6
Rt = R2
Rd = R13
<< 2
ADDER
Zero
Read Reg 1
Read
Data 1
Read Reg 2
Instruction
Memory (RAM)
Rd = R2
Data Memory
(RAM)
Register File
Write Reg Read
Write Data Data 2
Rd = R12
MEM/WB
SUB R2, R1, R3
EX/MEM.RegisterRD = R12 !=
EX/MEM.RegisterRD = R12 !=
MEM/WB.RegisterRD = R2 !=
MEM/WB.RegisterRD = R2 ==
M
U
X
ALU
M
U
X
Read
data
Sign
extend
Introduction to Computer Architecture
16
32
63
No Dependence Between Instruction 1 and 4
Program
Execution
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
SUB R2, R1, R3 IM
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8
DM
REG
ALU
DM
REG
ALU
DM
REG
IM
REG
ALU
REG
DM
REG
64
How Do We Stall the Pipeline?
Compiler can insert nops
Program
Execution
Time
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
ADD R10, R11, R12
IM
nop
nop
ADD R12, R10, R11
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
ALU
REG
IM
Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8
Reg
DM
Reg
ALU
DM
REG
ALU
Reg
DM
Reg
65
Hardware Can Simulate NOPS
Program
Execution
Time
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
ADD R10, R11, R12
IM
stall
stall
ADD R12, R10, R11
Introduction to Computer Architecture
REG
IM
ALU
DM
Clock
Clock Clock
Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8
Reg
bubble bubble bubble
IM
bubble bubble
IM
REG
bubble
bubble bubble
ALU
DM
Reg
66
Reducing Data Hazards: Forwarding
Data may be already computed - just not in the Register File
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
ADD R10, R11, R12
IM
REG
ADD R12, R10, R11
IM
Introduction to Computer Architecture
ALU
REG
DM
Reg
ALU
DM
Reg
67
Additions to the Datapath for Forwarding
M
U
X
Current PC
4
ADD R12, R11, R10 ADD R10,R11, R12
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
M
U
X
ALU
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
68
Forwarding Continued
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle
ADD R10, R11, R12
IM
REG
ADD R12, R10, R11
IM
ADD R4, R5, R10
Introduction to Computer Architecture
ALU
REG
IM
DM
Reg
ALU
DM
REG
ALU
Reg
DM
Reg
69
More Additions to the Datapath
M
U
X
Current PC
4
ADD R4, R5, R10
IF/ID
ID/EX
ADD R12, R11, R10 ADD R10,R11, R12
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
M
U
X
ALU
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
70
Forwarding Doesnt Always Work
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
LW R10, 0x00(R4)IM
REG
ADD R12, R10, R11
IM
Introduction to Computer Architecture
ALU
REG
DM
Reg
ALU
DM
Reg
71
Loads and Stores Require a Load Delay Slot
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
LW R10, 0x00(R4)IM
nop
ADD R12, R10, R11
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
Reg
ALU
DM
REG
ALU
Reg
DM
Reg
72
MIPS Load-Delay Slot
MIPS exposed the load-delay slot to the compiler
This makes it part of the architecture, not just an implementation
detail
Therefore, its up to the compiler (or assembly code writer) to
make sure that the instruction after a load does not depend on
the result of the load
An alternative would have been to force the hardware to
detect the data hazard and stall the pipeline
Most of todays architectures detect the hazard and stall
Introduction to Computer Architecture
73
3 Types of Data Hazards
Read-after-write (RAW)
a true dependency
Example
ADD R1, R2, R3
SUB R6, R7,R1
Write-after-read (WAR)
artificial dependency due to register assignment
Example
LW R1,0(R2)
ADD R2, R6, R3
Write-after-write (WAW)
artificial dependency due to register assignment
Example
LW R1, 0(R2)
ADD R1, R3, R4
Introduction to Computer Architecture
74
Taxonomy of Hazards
Data Hazards are just one type of hazard that can occur
in a machine. There are actually 3 basic types of hazards
Hazard Taxonomy
Data hazards
Instruction depends on result of prior computation which is not ready
yet
Structural hazards
HW cannot support a combination of instructions
Control hazards
pipelining of branches and other instructions which change the PC
Introduction to Computer Architecture
75
Structural Hazards
Structural hazards
HW cannot support a combination of instructions
Occurs when two or more instructions want to use the same
hardware resource in the same cycle
Causes bubble (stall) in pipelined machines
Overcome by replicating hardware resources
Examples
Multiple accesses to the register file
Branch adder and ALU
Multiple accesses to memory
Introduction to Computer Architecture
76
Structural Hazard Example 1
M
U
X
Current PC
4
W/out adder, both the address computation and the
arithmetic computation would require access to the ALU in
the same cycle
beq
r1,r2, offset
IF/ID
; if r1 == r2, then PC <-- PC + offset
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
77
Structural Hazard Example 2
Two instructions need access to memory in Clock
Cycle 4.
If there is only one memory port, then only one
instruction can read/write memory at a time
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
LW R2, 0x10(R4)
IM
SUB R5,R6,R7
ADD R10,R11,R12
ADD R12, R10, R11
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
ALU
Reg
DM
REG
ALU
IM
REG
Reg
DM
ALU
Reg
DM
Reg
78
Structural Example 2 (cont)
Two instructions need access to memory in Clock Cycle 4.
If there is only one memory port, then only one instruction can
read/write memory at a time
Program
Execution
Time
Clock
Clock Clock Clock
Clock
Clock Clock
Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8
LW R2, 0x10(R4)
IM
SUB R5,R6,R7
ADD R10,R11,R12
Stall
ADD R12, R10, R11
Introduction to Computer Architecture
REG
IM
ALU
REG
IM
DM
ALU
REG
Reg
DM
ALU
Reg
DM
Reg
bubble bubble bubble bubble bubble
IM
REG
ALU
DM
79
Control Hazards - Branches
Example code
Address
36
40
44
48
52
56
60
64
68
72
76
Instruction
NOP
ADD R30,R30,R30
BEQ R1, R3, 24
<- this branchs to address 72
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Introduction to Computer Architecture
80
Branch Hazards
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
44 BEQ R1, R3,IM
24
48 AND R12, R2, R5
REG
ALU
IM
52 OR R13, R6, R2
56 ADD R14, R2, R2
60 or 72 (depending on branch)
Introduction to Computer Architecture
REG
IM
DM
Reg
ALU
DM
REG
IM
ALU
REG
IM
Reg
DM
ALU
REG
Reg
DM
ALU
Reg
Reg
DM
81
Always Stalling hurts the No-branch case
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
44 BEQ R1, R3,IM
24
stall
REG
IM
stall
stall
48 AND R12, R2, R5
Introduction to Computer Architecture
ALU
DM
Reg
bubble bubble bubble
IM
bubble bubble
IM
bubble
bubble bubble
bubble bubble bubble bubble
IM
REG
ALU
DM
Reg
82
Solution: Assume Branch Not Taken
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Clock
Clock Clock Clock
Clock
Clock Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
44 BEQ R1, R3,IM
24
48 AND R12, R2, R5
REG
IM
52 OR R13, R6, R2
56 ADD R14, R2, R2
ALU
REG
IM
DM
Reg
ALU
DM
REG
IM
60 or 72 (depending on outcome of branch)
Introduction to Computer Architecture
ALU
REG
IM
Reg
DM
ALU
REG
Reg
DM
ALU
Reg
DM
Reg
83
What Happens When the Branch IS Taken
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24
48 AND R12, R2, R5
REG
IM
52 OR R13, R6, R2
56 ADD R14, R2, R2
72 LW R4, 50(R7)
Introduction to Computer Architecture
ALU
REG
IM
Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
DM
Reg
ALU
DM
REG
IM
ALU
REG
IM
Reg
DM
ALU
REG
Reg
DM
ALU
Reg
Reg
DM
84
Move the Branch Computation Forward
M
U
X
Current PC
4
IF/ID
ID/EX
EX/MEM
MEM/WB
Adder
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
85
Branch with New Datapath
Reducing penalty 1 cycle
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24
48 AND R12, R2, R5
REG
IM
52 OR R13, R6, R2
72 LW R4, 50(R7)
Introduction to Computer Architecture
ALU
REG
IM
Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
DM
Reg
ALU
DM
REG
IM
ALU
REG
Reg
DM
ALU
Reg
DM
Reg
86
Move the Branch Computation Further Forward
M
U
X
Current PC
4
Compare Controls MUX Selext
IF/ID
ADDER
ID/EX
EX/MEM
MEM/WB
Adder
Compare
<< 2
ADDER
PC
Data Memory
(RAM)
Register File
Zero
Read Reg 1
Read
Read RegData
2 1
Write RegRead
Instruction
Memory (RAM)
Write DataData 2
M
U
X
ALU
M
U
X
Read
data
Sign
extend
16
Introduction to Computer Architecture
32
87
Another New and Improved Datapath
Voila - the branch delay slot
Clock
Clock Clock Clock
Cycle 1 Cycle 2 Cycle 3Cycle 4
44 BEQ R1, R3,IM
24
48 AND R12, R2, R5
REG
IM
72 LW R4, 50(R7)
Introduction to Computer Architecture
ALU
REG
IM
Clock
Clock Clock Clock Clock
Cycle 5 Cycle 6Cycle 7 Cycle 8 Cycle 9
DM
Reg
ALU
DM
REG
ALU
Reg
DM
Reg
88
Rewriting the Code for a Branch Delay Slot
Without Branch Delay Slot
Address
36
40
44
48
52
56
60
64
68
72
76
Instruction
NOP
ADD R30,R30,R30
BEQ R1, R3, 24
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...
With Branch Delay Slot
Address
36
40
44
48
52
56
60
64
68
72
76
Instruction
NOP
BEQ R1, R3, 28
ADD R30, R30, R30
AND R12, R2, R5
OR R13, R6, R2
ADD R14, R2, R2
...
...
...
LW R4, 50(R7)
...
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Introduction to Computer Architecture
89
Performance of Pipelined Systems
Stalls due to data and branch hazards make performance
less than one instruction per cycle
Compiler is critical in determining overall performance
Compiler generates code that avoids stalls
Example
lw R15, 0x00(R2)
add R14, R15, R15
lw R16, 0x04(R2)
Might become:
lw R15, 0x00(R2)
lw R16, 0x04(R2)
add R14, R15, R15
Introduction to Computer Architecture
90
Performance of Pipelined Systems
time
Unpipelined
instructions
time
Pipelined
latency
instructions
Ideally, Throughputpipeline =
Introduction to Computer Architecture
Timesequential
Pipeline Depth
91
Pipeline Speedup and Throughput
Assume instruction execution takes N stages
s1, s2, ... sn take time t1, t2, ... tn
Without pipelining
Throughput = 1/ ti (for i = 1 to n)
Latency = 1/throughput
With pipelining
Throughput = 1/max ti <= n/ ti
Latency = n/throughput
Speedup = ti / max ti <= n
Introduction to Computer Architecture
(for i = 1 to n)
92
What Makes Pipelines Hard to Implement?
Detecting and resolving hazards
Exceptions and Interrupts
Instruction Set Architecture
CISC instructions are difficult to pipeline
Example:
stringMov from 0x1234, to 0x4000, 0x1000 bytes
Introduction to Computer Architecture
93