CA07 2022S3 New
CA07 2022S3 New
IF ID EX MEM WB
I-MEM Reg Read ALU D-MEM Reg W
180 ps 100 ps 160 ps 200 ps 100 ps
longest stage
LW SW
waste
❑ Multicycle
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IF ID Exec Mem Wr IF ID Exec Mem IF
LW SW BEQ
❑ Can we do better?
➢ Pipelining: employs more
concurrency (i.e., more
“work” done in 1 cycle)
➢ Laundry analogy:
▪ 4 loads → speedup =
8/3.5 = 2.3.
▪ 𝑛 → ∞ loads: speedup =
2𝑛
≈ 4
0.5𝑛+1.5
▪ In the limit, speedup =
number of stages.
Single-cycle vs multi-cycle vs pipeline
❑ Five stages, one step per stage
➢ Each step requires 1 clock cycle → steps enter/leave pipeline at the rate of
one step per clock cycle
Single-cycle Implementation:
Cycle 1 Cycle 2
Clk
lw sw Waste
Multicycle Implementation:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
lw sw
IF ID EX MEM WB IF ID EX MEM IF
Pipeline Implementation:
pipeline clock same
lw IF ID EX MEM WB as multi-cycle clock
sw IF ID EX MEM WB
R-type IF ID EX MEM WB
Pipeline performance
❑ Ideal pipeline assumptions
➢ Identical operations, e.g. four laundry steps are repeated for all loads
➢ Independent operations, e.g. no dependency between laundry steps
➢ Uniformly partitionable suboperations (that do not share resources), e.g.
laundry steps have uniform latency.
✓ Latency = execution time (delay or response time) = the total time from start to
finish of ONE instruction
✓ Throughput (or execution bandwidth) = the total amount of work done in a given
amount of time
Example
❑ Assume the execution time for stages in a RISC-V datapath are
✓ 100ps for register read or write
✓ 200ps for other stages
lw sw waste
Pipelined (Tc=200ps)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
lw IF ID EX MEM WB
sw IF ID EX MEM WB
R-type IF ID EX MEM WB
IF ID EX MEM WB
Pipeline’s fill time IF ID EX MEM WB
❑ Time btw 1st and 5th instructions: single cycle = 3200ps (4 x 800ps) vs pipelined
= 800ps (4 x 200ps) → speedup = 4.
➢ Execution time for 5 instructions: 4000ps vs 1800ps ≈ 2.22 times speedup
→ Why shouldn't the speedup be 5 (#stages)? What’s wrong?
➢ Think of real programs which execute billions of instructions.
Symbolic representation of 5 stages
IF ID EX MEM WB
Instruction Fetch Instruction Decode Execute Memory Access Write back to register
(IMEM Read) (Reg Read) ALU (DMEM) (Reg write)
Symbolic representation of pipelined
RISC-V datapath
tinstruction = 1000 ps
Resource use in a
add t0, t1, t2 particular time slot
or t3, t4, t5
Resource use of
instruction over time
instruction sequence
sw t0, 4(t3)
lw t0, 8(t3)
tcycle= 200 ps
Pipelined datapath design
lw t0, 8(t3) sw t0, 4(t3) slt t6, t0, t3 or t3, t4, t5 add t0, t1, t2
Instruction Fetch Instruction Decode ALU Execution Memory Access Write Back
pc+4
+4 wb
Reg[] pc alu
DataD 1
1 Reg[rs1] 2
alu
pc inst[11:7] AddrD 0 ALU DMEM 1
pc+4
0 IMEM Reg[rs2] Addr DataR
wb
inst[19:15] AddrA DataA Branch 0 0
Comp. DataW mem
inst[24:20] AddrB DataB 1
inst[31:7]
Imm. imm[31:0]
Gen
lw t0, 8(t3) sw t0, 4(t3) slt t6, t0, t3 or t3, t4, t5 add t0, t1, t2
Instruction Fetch Instruction Decode ALU Execution Memory Access Write Back
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB
Imm.
Gen
❖ Now, let’s check the flow of instructions through the pipeline cycle-by-cycle!
IF for Load
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0
pc+4 IMEM AddrD 1
ALU Addr wb
DataA Branch DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB
Imm.
Instruction word is
fetched from memory, Gen
and stored in the IF/ID
buffer because it will be
needed in the next stage.
ID for Load
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB
12-bit immediate is fetched from IF/ID buffer, rs1 and rs2 values are
then sign-extended, then stored in the ID/EX fetched & stored in ID/EX
buffer for use in a later stage. buffer.
EX for Load
ALU result is stored in
rs1 value is taken EX/MEM buffer for use
lw t0, 8(t3) as memory address in
from ID/EX buffer
& passed to ALU. ALU Execution the next stage.
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA
Comp. DataW
DataB 1
AddrB
Imm.
Gen
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA Comp.
1 DataW
DataB
AddrB
Imm.
Wrong Gen
register
number
Value from data memory is
selected and passed back to
register file.
Corrected Datapath for Load
lw t0, 8(t3)
Write Back
pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA Comp.
1 DataW
DataB
AddrB
Imm.
Gen
The problem is fixed by passing the Write register number through the various inter-
stage buffers & feed it back just in time → adding 5 more bits to the last three buffers.
Pipelined control signals
❑ Control signals derived from instruction & determined during ID
as in single-cycle implementation.
➢ As the instruction moves → pipeline the control signals → extend the pipeline
registers to include the control signals
➢ Each stage uses some of the control signals
ALU
add t0, t1, t2 IM Reg DM Reg
instruction sequence
ALU
IM Reg DM Reg
or t3, t4, t5
ALU
IM Reg DM Reg
slt t6, t0, t3
ALU
sw t0, 4(t3) IM Reg DM Reg
ALU
IM Reg DM Reg
lw t0, 8(t3)
Data hazard example
❑ If the same register is written and read in one cycle
➢ WB must write value before ID reads new value
➢ Not structural hazard since separate ports allow simultaneous R/W.
ALU
add t0, t1, t2 IM Reg DM Reg
instruction sequence
ALU
IM Reg DM Reg
or t3, t4, t5
ALU
IM Reg DM Reg
slt t6, t0, t3
ALU
sw t0, 4(t3) IM Reg DM Reg
ALU
IM Reg DM Reg
lw t0, 8(t3)
Control hazard example
❑ If the beq branch is taken, wrong instructions would have been
fetched as the decision is made only in MEM stage.
beq
ALU
IM Reg DM Reg
Branch outcome
I is ready
n
Ins 1
ALU
s IM Reg DM Reg fetched
t regardless
r.
of branch
Inst 3
ALU
IM Reg DM Reg outcome!
O
r
d
e
ALU
Inst 4 IM Reg DM Reg
r
PC updated reflecting
ALU
Inst 5 IM Reg DM Reg
branch outcome
Summary
❑ Pipelined processor
➢ Speedup is due to increased throughput, latency does not decrease.
➢ Implemented by adding state registers to the single-cycle datapath.
➢ The pipeline registers are also extended to include the control signals.
❑ The basic idea of pipelining is easy, but the devil is in the details
➢ Hazard: a situation in which a planned instruction cannot execute in the
“proper” clock cycle
▪ Structural hazard
▪ Data hazard
▪ Control hazard
➢ Pipeline hazards are serious problems that cannot be ignored