0% found this document useful (0 votes)
3 views29 pages

CA07 2022S3 New

The lecture discusses the design and implementation of pipelined processors, highlighting the drawbacks of single-cycle and multicycle processors. Pipelining improves performance by allowing multiple instructions to be processed simultaneously across different stages, thus increasing throughput. The document also includes examples comparing execution times and speedups between single-cycle, multicycle, and pipelined architectures.

Uploaded by

Huy Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

CA07 2022S3 New

The lecture discusses the design and implementation of pipelined processors, highlighting the drawbacks of single-cycle and multicycle processors. Pipelining improves performance by allowing multiple instructions to be processed simultaneously across different stages, thus increasing throughput. The document also includes examples comparing execution times and speedups between single-cycle, multicycle, and pipelined architectures.

Uploaded by

Huy Hoang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

ELT3047 Computer Architecture

Lecture 7: Pipelined processor

Hoang Gia Hung


Faculty of Electronics and Telecommunications
University of Engineering and Technology, VNU Hanoi
Last lecture review
❑ Quizz
❑ Control Unit design
➢ Control signals can be derived manually for a particular instruction.
➢ To design the Control Unit, we must generate the control signals for every
instruction in the ISA.
➢ Needs only 9 bits from the instruction encoding to determine the instruction
type.

❑ Control Unit implementation


➢ ROM implementation
➢ Combinatorial logic implementation
▪ Multilevel implementation: simplify the design process, reduce the size of
the main controller,and potentially speedup the circuit

❑ Today lecture: Pipelined processor


Drawbacks of Single Cycle Processor
❑ All instructions take as much time as the slowest instruction
➢ Not all instructions need all 5 stages
Multicycle Implementation
❑ Can we improve single-cycle processor performance?
➢ Reduce cycle time to accomodate one stage per clock cycle.
➢ Clock cycle time is now constrained by longest stage.

IF ID EX MEM WB
I-MEM Reg Read ALU D-MEM Reg W
180 ps 100 ps 160 ps 200 ps 100 ps

longest stage

➢ Introducing a little bit more complexity to the datapath so that simpler


instructions take fewer cycles.
Single cycle vs. multicycle
❑ Single cycle
Cycle 1 Cycle 2

LW SW
waste
❑ Multicycle

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IF ID Exec Mem Wr IF ID Exec Mem IF
LW SW BEQ

✓ Shorter clock cycle time


✓ Less waste → higher overall performance: less waste
 Design and implementation is more complicated
Example
❑ Assume the following operation times for components:
➢ Instruction and data memories: 200 ps
➢ LU and adders: 180 ps
➢ Decode and Register file access (read or write): 150 ps
➢ Ignore the delays in PC, mux, extender, and wires

❑ Assume the following instruction mix:


➢ 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

❑ Which of the following would be faster and by how much?


➢ Single-cycle implementation for all instructions
➢ Multicycle implementation optimized for every class of instructions
Example solution

❑ For fixed single-cycle implementation:


➢ Clock cycle = 880 ps determined by longest delay (load instruction)

❑ For multi-cycle implementation:


➢ Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)
➢ Average CPI = 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8

❑ Speedup = 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16


The idea of pipelining
❑ Limitations of the multicycle design
➢ Some HW resources are idle during different phases of the instruction cycle,
e.g. “Fetch” logic is idle when an instruction is being “decoded” or “executed”
➢ Most of the datapath is idle when a memory access is happening.

❑ Can we do better?
➢ Pipelining: employs more
concurrency (i.e., more
“work” done in 1 cycle)
➢ Laundry analogy:
▪ 4 loads → speedup =
8/3.5 = 2.3.
▪ 𝑛 → ∞ loads: speedup =
2𝑛
≈ 4
0.5𝑛+1.5
▪ In the limit, speedup =
number of stages.
Single-cycle vs multi-cycle vs pipeline
❑ Five stages, one step per stage
➢ Each step requires 1 clock cycle → steps enter/leave pipeline at the rate of
one step per clock cycle
Single-cycle Implementation:
Cycle 1 Cycle 2
Clk

lw sw Waste

Multicycle Implementation:

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
lw sw
IF ID EX MEM WB IF ID EX MEM IF

Pipeline Implementation:
pipeline clock same
lw IF ID EX MEM WB as multi-cycle clock

sw IF ID EX MEM WB

R-type IF ID EX MEM WB
Pipeline performance
❑ Ideal pipeline assumptions
➢ Identical operations, e.g. four laundry steps are repeated for all loads
➢ Independent operations, e.g. no dependency between laundry steps
➢ Uniformly partitionable suboperations (that do not share resources), e.g.
laundry steps have uniform latency.

❑ Ideal pipeline speedup


Time between instructions𝑛𝑜𝑛𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑
➢ Time between instructions𝑝𝑖𝑝𝑒𝑙𝑖𝑛𝑒𝑑 =
Number of stages
➢ Speedup is due to increased throughput (*); latency (*) does not decrease.

❑ Speedup for non-ideal pipelines is less


➢ External/internal fragmentation, pipeline stalls.

✓ Latency = execution time (delay or response time) = the total time from start to
finish of ONE instruction
✓ Throughput (or execution bandwidth) = the total amount of work done in a given
amount of time
Example
❑ Assume the execution time for stages in a RISC-V datapath are
✓ 100ps for register read or write
✓ 200ps for other stages

Register Memory Register


Instr Instr fetch ALU op Total time
read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

❑ Compare pipelined datapath with single-cycle datapath


➢ Clock rates
➢ Execution time & speedup
Example solution
Single cycle 1 (Tc=800ps) Single cycle 2

lw sw waste

Pipelined (Tc=200ps)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9

lw IF ID EX MEM WB
sw IF ID EX MEM WB
R-type IF ID EX MEM WB
IF ID EX MEM WB
Pipeline’s fill time IF ID EX MEM WB

❑ Time btw 1st and 5th instructions: single cycle = 3200ps (4 x 800ps) vs pipelined
= 800ps (4 x 200ps) → speedup = 4.
➢ Execution time for 5 instructions: 4000ps vs 1800ps ≈ 2.22 times speedup
→ Why shouldn't the speedup be 5 (#stages)? What’s wrong?
➢ Think of real programs which execute billions of instructions.
Symbolic representation of 5 stages

IF ID EX MEM WB

ID, WB stages respectively read and write


the same hardware element, Reg (RegFile).

Instruction Fetch Instruction Decode Execute Memory Access Write back to register
(IMEM Read) (Reg Read) ALU (DMEM) (Reg write)
Symbolic representation of pipelined
RISC-V datapath
tinstruction = 1000 ps
Resource use in a
add t0, t1, t2 particular time slot

or t3, t4, t5
Resource use of
instruction over time
instruction sequence

slt t6, t0, t3

sw t0, 4(t3)

lw t0, 8(t3)

addi t2, t2, 1

tcycle= 200 ps
Pipelined datapath design
lw t0, 8(t3) sw t0, 4(t3) slt t6, t0, t3 or t3, t4, t5 add t0, t1, t2
Instruction Fetch Instruction Decode ALU Execution Memory Access Write Back

pc+4
+4 wb
Reg[] pc alu
DataD 1
1 Reg[rs1] 2
alu
pc inst[11:7] AddrD 0 ALU DMEM 1
pc+4
0 IMEM Reg[rs2] Addr DataR
wb
inst[19:15] AddrA DataA Branch 0 0
Comp. DataW mem
inst[24:20] AddrB DataB 1

inst[31:7]
Imm. imm[31:0]

Gen

❑ Think of the datapath as a linear sequence of stages, each


stage operates on different instruction.
➢ On any given cycle up to 5 instructions will be in various points of execution.
➢ How can we operate the stages independently, i.e. moving the current
instruction to the next stage before taking in the next instruction?
Pipeline registers
❑ Add state registers between each pipeline stage.
➢ To isolate information between cycles, hold data for each instruction in flight.

lw t0, 8(t3) sw t0, 4(t3) slt t6, t0, t3 or t3, t4, t5 add t0, t1, t2

Instruction Fetch Instruction Decode ALU Execution Memory Access Write Back

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB

Imm.
Gen

❖ Now, let’s check the flow of instructions through the pipeline cycle-by-cycle!
IF for Load

PC+4 is computed, stored


lw t0, 8(t3) back into the PC, then
stored in the IF/ID buffer.
Instruction Fetch

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0
pc+4 IMEM AddrD 1
ALU Addr wb
DataA Branch DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB

Imm.
Instruction word is
fetched from memory, Gen
and stored in the IF/ID
buffer because it will be
needed in the next stage.
ID for Load

lw t0, 8(t3) PC+4 is passed forward


Instruction Decode to ID/EX buffer

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB

Bits of load instruction


are taken from IF/ID Imm.
buffer, while new Gen
instruction is being
fetched

12-bit immediate is fetched from IF/ID buffer, rs1 and rs2 values are
then sign-extended, then stored in the ID/EX fetched & stored in ID/EX
buffer for use in a later stage. buffer.
EX for Load
ALU result is stored in
rs1 value is taken EX/MEM buffer for use
lw t0, 8(t3) as memory address in
from ID/EX buffer
& passed to ALU. ALU Execution the next stage.

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA
Comp. DataW
DataB 1
AddrB

Imm.
Gen

32-bit literal is rs2 value is passed forward


provided to ALU as to EX/MEM buffer (but won't
second operand be needed, though)
MEM for Load
ALU result is taken from
lw t0, 8(t3)
EX/MEM buffer & passed
to data memory, also Memory Access
passed to MEM/WB buffer

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA Branch ALU DataR 0
0
AddrA Comp.
1 DataW
DataB
AddrB

Imm. Value on Read


Gen data port of the
data memory is
stored in
rs2 is passed from
MEM/WB buffer
EX/MEM buffer to
Write data port of
data memory.
WB for Load
lw t0, 8(t3)
Write Back

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA Comp.
1 DataW
DataB
AddrB

Imm.
Wrong Gen
register
number
Value from data memory is
selected and passed back to
register file.
Corrected Datapath for Load
lw t0, 8(t3)
Write Back

pc+4
+4 Reg[] pc alu
1 1
alu pc wb 2
DataD DMEM
0 0 1
pc+4 IMEM AddrD Addr wb
DataA
ALU DataR 0
Branch 0
AddrA Comp.
1 DataW
DataB
AddrB

Imm.
Gen

The problem is fixed by passing the Write register number through the various inter-
stage buffers & feed it back just in time → adding 5 more bits to the last three buffers.
Pipelined control signals
❑ Control signals derived from instruction & determined during ID
as in single-cycle implementation.
➢ As the instruction moves → pipeline the control signals → extend the pipeline
registers to include the control signals
➢ Each stage uses some of the control signals

9 control bits 5 control bits 2 control bits


RISC-V ISA supports for pipelining
❑ What makes it easy
➢ All instructions are 32-bits
• Easier to fetch and decode in one cycle: fetch in the 1st stage and
decode in the 2nd stage
• c.f. x86: 1- to 17-byte instructions
➢ Few and regular instruction formats
• Can decode and read registers in one step
➢ Load/store addressing
• Can calculate address in 3rd stage, access memory in 4th stage

❑ What makes it hard?


➢ Pipeline hazards
Three Types of Pipeline Hazards
❑ A hazard is a situation in which a planned instruction cannot
execute in the “proper” clock cycle.
1. Structural hazard
• Attempt to use the same resource by two different instructions at the
same time while hardware does not support multiple access.
2. Data hazard
• Attempt to use data before it is ready because instructions have data
dependency.
3. Control hazard
• Attempt to make a decision about program control flow before the
condition has been evaluated and the new PC target address
calculated.

❑ Pipeline hazards are serious problems that cannot be ignored


Structural hazard example
❑ RegFile must serve
▪ Up to 2 oprerand reads in ID stage & up to 1 operand write in WB stage
➢ Structural hazard occurs if RegFile HW does not support simultaneous
read/write!
Time (clock cycles)

ALU
add t0, t1, t2 IM Reg DM Reg
instruction sequence

ALU
IM Reg DM Reg
or t3, t4, t5

ALU
IM Reg DM Reg
slt t6, t0, t3

ALU
sw t0, 4(t3) IM Reg DM Reg

ALU
IM Reg DM Reg
lw t0, 8(t3)
Data hazard example
❑ If the same register is written and read in one cycle
➢ WB must write value before ID reads new value
➢ Not structural hazard since separate ports allow simultaneous R/W.

Time (clock cycles)

ALU
add t0, t1, t2 IM Reg DM Reg
instruction sequence

ALU
IM Reg DM Reg
or t3, t4, t5

ALU
IM Reg DM Reg
slt t6, t0, t3

ALU
sw t0, 4(t3) IM Reg DM Reg

ALU
IM Reg DM Reg
lw t0, 8(t3)
Control hazard example
❑ If the beq branch is taken, wrong instructions would have been
fetched as the decision is made only in MEM stage.

beq

ALU
IM Reg DM Reg
Branch outcome
I is ready
n
Ins 1

ALU
s IM Reg DM Reg fetched
t regardless
r.
of branch
Inst 3

ALU
IM Reg DM Reg outcome!
O
r
d
e

ALU
Inst 4 IM Reg DM Reg
r

PC updated reflecting
ALU
Inst 5 IM Reg DM Reg
branch outcome
Summary
❑ Pipelined processor
➢ Speedup is due to increased throughput, latency does not decrease.
➢ Implemented by adding state registers to the single-cycle datapath.
➢ The pipeline registers are also extended to include the control signals.

❑ The basic idea of pipelining is easy, but the devil is in the details
➢ Hazard: a situation in which a planned instruction cannot execute in the
“proper” clock cycle
▪ Structural hazard
▪ Data hazard
▪ Control hazard
➢ Pipeline hazards are serious problems that cannot be ignored

❑ Next lecture: Hazard handling methods.

You might also like