Slide 1
Slide 1
LW SW
waste
❑ Multicycle
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IF ID Exec Mem Wr IF ID Exec Mem IF
LW SW Instr
➢ Shorter clock cycle time: constrained by longest step, not longest instruction
➢ Higher overall performance: simpler instructions take fewer cycles, less waste
Single cycle vs. Multicycle
❑ Assume the following operation times for components:
➢ Instruction and data memories: 200 ps
➢ LU and adders: 180 ps
➢ Decode and Register file access (read or write): 150 ps
➢ Ignore the delays in PC, mux, extender, and wires
❑ Can we do better?
➢ Yes: More concurrency → Higher instruction throughput (i.e., more “work”
completed in one cycle)
❑ Idea: when an instruction is using some resources in its
processing phase, process other instructions on idle resources
➢ E.g., when an instruction is being decoded, fetch the next instruction
➢ E.g., when an instruction is being executed, decode another instruction
➢ E.g., when an instruction is accessing data memory (lw/sw), execute the
next instruction
➢ E.g., when an instruction is writing its result into the register file, access data
memory for the next instruction
A laundry analogy
❑ Sequential laundry: wash-dry-fold-put away cycle
❑ Pipelined laundry: start the next load at each step completion
➢ Parallelism improves performance. How much?
❖ Four loads
➢ Speedup = 8/3.5 =
2.3
❖ Non-stop (𝑛 → ∞ loads)
2𝑛
➢ Speedup =
0.5𝑛+1.5
≈ 4
= number of stages
Single-cycle vs multi-cycle vs pipeline
❑ Five stages, one step per stage
➢ Each step requires 1 clock cycle → steps enter/leave pipeline at the rate of
one step per clock cycle
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk
lw sw Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
lw sw R-type
IF ID EX MEM WB IF ID EX MEM IF
sw IF ID EX MEM WB
R-type IF ID EX MEM WB
Pipeline performance
❑ Ideal pipeline assumptions
➢ Identical operations, e.g. four laundry steps are repeated for all loads
➢ Independent operations, e.g. no dependency between laundry steps
➢ Uniformly partitionable suboperations (that do not share resources), e.g.
laundry steps have uniform latency.
✓ Latency = execution time (delay or response time) = the total time from start to finish
of ONE instruction
✓ Throughput (or execution bandwidth) = the total amount of work done in a given
amount of time
Example: An MIPS pipelined processor
performance
❑ Assume time for stages is
✓ 100ps for register read or write
✓ 200ps for other stages
lw sw waste
Pipelined (Tc=200ps)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
lw IF ID EX MEM WB
sw IF ID EX MEM WB
R-type IF ID EX MEM WB
IF ID EX MEM WB
Pipeline’s fill time IF ID EX MEM WB
❑ Time btw 1st and 5th instructions: single cycle = 3200ps (4 x 800ps) vs pipelined
= 800ps (4 x 200ps) → speedup = 4.
➢ Execution time for 5 instructions: 4000ps vs 1800ps ≈ 2.22 times speedup
→ Why shouldn't the speedup be 5 (#stages)? What’s wrong?
➢ Think of real programs which execute billions of instructions.
MIPS ISA supports for pipelining
❑ What makes it easy
➢ All instructions are 32-bits
• Easier to fetch and decode in one cycle: fetch in the 1st stage and
decode in the 2nd stage
• c.f. x86: 1- to 17-byte instructions
➢ Few and regular instruction formats
• Can decode and read registers in one step
➢ Memory operations occur only in loads and stores
• Can calculate address in 3rd stage, access memory in 4th stage
➢ Operands must be aligned in memory
• Memory access takes only one cycle
➢ Each instruction writes at most one result (i.e., changes the machine state)
and does it in the last few pipeline stages (MEM or WB)
▪ 5 stages → on any
given cycle up to 5
instructions will be
in various points of
execution.
Instruction word is fetched from memory, and stored in the IF/ID buffer because it will be needed in the next stage.
ID for Load, Store, …
PC+4 is passed
forward to
ID/EX buffer
RR #1 & #2 contents
are fetched & stored
in ID/EX buffer
RR#1 contents
are taken from
ID/EX buffer &
passed to ALU
16-bit literal is
provided to ALU as
second operand
So we fix the problem by passing the Write register number from the load
instruction through the various inter-stage buffers, and then feed it back, just in
time → adding five more bits to the last three pipeline registers.
Multi-Cycle Pipeline Diagram (1)
❑ Shows the complete execution of instructions in a single figure
➢ Instructions are listed in instruction execution order from top to bottom
➢ Clock cycles move from left to right
➢ Figure shows the use of resources at each stage and each cycle
Multi-Cycle Pipeline Diagram (2)
❑ Can help with answering questions like:
➢ How many cycles does it take to execute this code?
➢ What is the ALU doing during cycle 4?
➢ Is there a hazard, why does it occur, and how can it be fixed? (later)
Pipelined control: control points
❑ Same control points as in the single-cycle datapath
Pipelined control: settings (1)
❑ Control signals derived from instruction & determined during ID
➢ As the instruction moves → pipeline the control signals → extend the pipeline
registers to include the control signals
➢ Each stage uses some of the control signals
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipelined control: complete
Can Pipelining Get Us Into Trouble?
❑ Yes - instruction pipeline is not an ideal pipeline
➢ different instructions → not all need the same stages: some pipe stages idle
for some instructions → external fragmentation
➢ different pipeline stages → not the same latency: some pipe stages are too
fast but all take the same clock cycle time → internal fragmentation
➢ instructions are not independent of each other → pipeline stalls: pipeline is
not always moving
ALU
I add $1,$2,$3 IM Reg DM Reg
n
s
ALU
t Inst 1 IM Reg DM Reg
r.
ALU
O Inst 2 IM Reg DM Reg
r
d
e
ALU
add $2,$1,$3 IM Reg DM Reg
r