Lec20 PDF
Lec20 PDF
N. B. Dodge 9/15
N. B. Dodge 9/15
N. B. Dodge 9/15
N. B. Dodge 9/15
Pipeline Architecture
A pipelined computer executes instructions concurrently.
Hardware units are organized into stages:
Execution in each stage takes exactly 1 clock period.
Stages are separated by pipeline registers that preserve and pass
partial results to the next stage.
N. B. Dodge 9/15
1
Instruc.
Fetch
lw $t0, 16($a3)
Reg.
Fetch
lw $t1, 32($a3)
ALU
Process
10
4 clock cycles
Reg.
Fetch
lw $t2, 48($a3)
Timeline
(clock
cycles)
ALU
Process
4 clock cycles
Reg. AL
Fetch Proc
etc.
10
5 clock cycles
lw $t0, 16($a3)
lw $t1, 32($a3)
lw $t2, 48($a3)
Instruc.
Fetch
Reg.
Fetch
Instruc.
Fetch
ALU
Process
Reg.
Fetch
Instruc.
Fetch
etc.
N. B. Dodge 9/15
=
SP
ETs
=
ETP
ns
s + (n 1)
n>> s
s
N. B. Dodge 9/15
Pipeline Stages
Clock cycles
0
IF
ID/
RF
ALU
MEM
WB
N. B. Dodge 9/15
IF
ID/
RF
ALU
MEM
WB
IF
ID/
RF
ALU
MEM
WB
Instruction 2
IF
ID/
RF
ALU
MEM
WB
Instruction 3
Instruction 1
N. B. Dodge 9/15
Single-Cycle Datapath
Reg. Dest.
32
32
ADD
+4
Instruction
Address
P
C
6 (Bits 26-31)
5
5
Inst.
0-31
Instruction
Memory
Control
Branch
Mem. Read
Mem. To Reg.
ALU Op.
Mem. Write
ALU Srce.
Reg. Write
M
32
U
X
32
ADD
Left
shift
2
32
Rs
Read
Data 1
Rt
M 5
U
Rd
X
Write
Data
Read
Data 2
32
ALU
32
Reg. Block
16 (Bits 0-15)
Write
Sign 32
Extend
M
32
U
X
32
Data
Address
Write
Data
Read
Mem./Reg.
Select
Read 32
Data
32
M 32
U
X
Data
Memory
ALU
6 (Bits 0-5) Control
N. B. Dodge 9/15
M
U
X
ADD
+4
Memory
P
C
Instruction
Address
Reg. Block
Compare
result
Rs
Inst.
0-31
Read
Data 1
Rd
Read
Data 2
Memory
ALU
IF/ID
Data
Address
M
U
X
Read
Data
M
U
X
Write
Data
16
Slave side
of register
Note: Control lines and
logic not shown for clarity
Rt
Write
Data
Master side
of register
11
ADD
Left
shift
2
Sign 32
Extend
ID/EX
EX/MEM
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
MEM/WB
N. B. Dodge 9/15
ADD
+4
Memory
P
C
Instruction
Address
ADD
Left
shift
2
Reg. Block
Compare
result
Rs
Inst.
0-31
Rt
Read
Data 1
Rd
Read
Data 2
Memory
ALU
Write
Data
Stage 1: Instruction
loaded into IF/ID
register, PCPC+4
IF/ID
Data
Address
M
U
X
M
U
X
Write
Data
16
Sign 32
Extend
ID/EX
EX/MEM
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
12
Read
Data
MEM/WB
N. B. Dodge 9/15
Stage 2: Instruction
decoded, register data
accessed, immediates
sign-extended
ADD
+4
Memory
P
C
Instruction
Address
ADD
Left
shift
2
Reg. Block
Compare
result
Rs
Inst.
0-31
Rt
Read
Data 1
Rd
Read
Data 2
Memory
ALU
Write
Data
M
U
X
Sign 32
Extend
ID/EX
EX/MEM
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
13
Read
Data
Write
Data
16
IF/ID
Data
Address
M
U
X
MEM/WB
N. B. Dodge 9/15
M
U
X
ADD
+4
Memory
P
C
Instruction
Address
ADD
Left
shift
2
Reg. Block
Compare
result
Rs
Inst.
0-31
Rt
Read
Data 1
Rd
Read
Data 2
Memory
ALU
Write
Data
M
U
X
Sign 32
Extend
ID/EX
EX/MEM
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
14
Read
Data
Write
Data
16
IF/ID
Data
Address
M
U
X
MEM/WB
N. B. Dodge 9/15
M
U
X
ADD
+4
Memory
P
C
Instruction
Address
ADD
Left
shift
2
Reg. Block
Compare
result
Rs
Inst.
0-31
Rt
Read
Data 1
Rd
Read
Data 2
Memory
ALU
Write
Data
M
U
X
Sign 32
Extend
ID/EX
EX/MEM
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
15
Read
Data
Write
Data
16
IF/ID
Data
Address
M
U
X
MEM/WB
N. B. Dodge 9/15
ADD
+4
Memory
P
C
Instruction
Address
Reg. Block
Compare
result
Rs
Inst.
0-31
Rt
Read
Data 1
Rd
Read
Data 2
Memory
ALU
Read
Data
M
U
X
Write
Data
16
IF/ID
Data
Address
M
U
X
Write
Data
16
ADD
Left
shift
2
Sign 32
Extend
ID/EX
Stage 5: Result
write-back to
dest. register
EX/MEM
MEM/WB
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
N. B. Dodge 9/15
Lecture #20: The Pipeline MIPS Processor
Adding Control
Control information must be carried along as a part of
the instruction, since this information is required at
different stages of the pipeline.
This can be done by adding more inter-stage storage
register bits to forward control data yet to be used.
The result is very large inter-stage registers. For
example, the storage capacity required between the
instruction decode and ALU execution stages (ID/EX
register) is more than 120 bits.
The resulting processor with full control functionality
is shown on the next slide
17
N. B. Dodge 9/15
Memory
P
C
Instruction
Address
Inst.
0-31
Control
Decode
Reg. Block
Rs
18
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
Rd
Read
Data 2
Write
Data
Full Pipeline
Design with
Control Lines
MEM/WB
Bits 0-15
ALU
M
U
X
Read
Data
M
U
X
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
Bits 11-15
Data
Address
Memory/ALU Result
ADD
+4
IF/ID
EX/MEM
Memory Write
Memory Read
M
U
X
Register Write
ID/EX
M
U
X
Memory
ALU Op
Reg. Dst.
N. B. Dodge 9/15
ID/RF: Idle
EX: Idle
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
MEM/WB
Control
Decode
ADD
Left
shift
2
Reg. Block
Rs
Branch
ALU Srce
Rt
Read
Data 1
Rd
Read
Data 2
ALU
M
U
X
Write
Data
Bits 0-15
Sign
Extend
Bits 11-15
Data
Address
Read
Data
M
U
X
Write
Data
32
ALU
Cont.
Bits 16-20
20
WB:
Idle
Memory/ALU Result
M
U
X
Register Write
ID/EX
MEM: Idle
Memory Write
Memory Read
IF: Idle
M
U
X
Memory
ALU Op
Reg. Dst.
ID/RF: Idle
EX: Idle
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
MEM/WB
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
Rd
Read
Data 2
Write
Data
Bits 0-15
ALU
M
U
X
Bits 11-15
Data
Address
Read
Data
M
U
X
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
21
WB:
Idle
Memory/ALU Result
M
U
X
Register Write
ID/EX
MEM: Idle
Memory Write
Memory Read
M
U
X
Memory
ALU Op
Reg. Dst.
EX: Idle
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
ADD
Left
shift
2
Rs
Branch
ALU Srce
$ t0
Rt
Read
[ $t1 ]
Data 1
Rd
Read
Data 2
Bits 0-15
Bits 16-20
Bits 11-15
ALU
X
M
U
X
Write
Data
22
MEM/WB
Control
Decode
Reg. Block
$t 1
EX/MEM
WB:
Idle
Read
Data
M
U
X
Write
Data
Sign
Extend 0x14
$ t0
Data
Address
Memory/ALU Result
M
U
X
MEM: Idle
Memory Write
Memory Read
ALU
Cont.
M
U
X
Memory
ALU Op
Reg. Dst.
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
$t3
ADD
Left
shift
2
Rs
Branch
ALU Srce
Rt
Read [ $t2 ]
Data 1
Rd
Read [ $t3 ]
Data 2
[ $t1 ]
ALU
M
U
X
Write
Data
Bits 0-15
Bits 16-20
Bits 11-15
23
MEM/WB
Control
Decode
Reg. Block
$t2
EX/MEM
WB:
Idle
Sign X
Extend
0x14
$ t0
$ t4
0x14
Data
Address
Read
Data
add
Memory/ALU Result
M
U
X
MEM: Idle
Memory Write
Memory Read
M
U
X
Write
Data
Memory
ALU
Cont.
M
U
X
ALU Op $ t0
Reg. Dst.
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
$t6
ADD
Left
shift
2
Rs
Branch
ALU Srce
Rt
Read [ $t5 ]
Data 1
Rd
Read [ $t6]
Data 2
[ $t2 ]
Bits 0-15
Bits 16-20
Bits 11-15
ALU
[$t3]
M
U
X
Write
Data
24
MEM/WB
Control
Decode
Reg. Block
$t5
WB:
Idle
[ $t3 ]
sub
Read
Data
M
U
X
Write
Data
Sign X
Extend
Memory
ALU
Cont.
X
$ t7
Data
Address
Memory/ALU Result
M
U
X
EX/MEM
Memory Write
Memory Read
$ t4
M
U
X
ALU Op $ t4
$ t0
Reg. Dst.
EX: and $t7, $t5, $t6 MEM: sub $t4,$t2,$t3 WB: lw $t0,
20($t1)
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
Control
Decode
$s0
Rs
Branch
ALU Srce
Rt
Read [$s1]
Data 1
$t0 Rd
Read [$s0]
Data 2
[ $t5 ]
Bits 0-15
Bits 16-20
Bits 11-15
ALU
[$t6]
M
U
X
Write
Data
25
ADD
Left
shift
2
Reg. Block
$s1
MEM/WB
[ $t6 ]
and
Read
Data
M
U
X
Write
Data
Sign X
Extend
Memory
ALU
Cont.
X
$s2
Data
Address
Memory/ALU Result
M
U
X
EX/MEM
Memory Write
Memory Read
$ t7
M
U
X
ALU Op $ t7
$ t4
$ t0
Reg. Dst.
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
Control
Decode
$s4
Rs
Branch
ALU Srce
Rt
Read
[$s3]
Data 1
$t4 Rd
Read [$s4]
Data 2
[$s1]
Bits 0-15
Bits 16-20
Bits 11-15
ALU
[$s0]
M
U
X
Write
Data
26
ADD
Left
shift
2
Reg. Block
$s3
MEM/WB
Read
Data
M
U
X
Write
Data
Sign X
Extend
ALU
Cont.
X
$s5
[$s0] or
Data
Address
Memory/ALU Result
M
U
X
EX/MEM
Memory Write
Memory Read
IF: Idle
$s2
M
U
X
ALU Op $s2
Memory
$ t7
$ t4
Reg. Dst.
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
$t7 Rd
Read
Data 2
[$s3]
Bits 0-15
ALU
[$s4]
M
U
X
Write
Data
Bits 11-15
[$s4]
Data
Address
Read
Data
add
M
U
X
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
27
MEM/WB
Memory/ALU Result
M
U
X
Register Write
ID/EX
Memory Write
Memory Read
IF: Idle
$s5
M
U
X
ALU Op $s5
Memory
$s2
$ t7
Reg. Dst.
EX: Idle
Register Write
ID/EX
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
$s2
Rt
Read
Data 1
Rd
Read
Data 2
Write
Data
Bits 0-15
ALU
M
U
X
Bits 11-15
Data
Address
Read
Data
M
U
X
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
28
MEM/WB
Memory/ALU Result
M
U
X
Memory Write
Memory Read
IF: Idle
M
U
X
ALU Op
Memory
$s5
$s2
Reg. Dst.
EX: Idle
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
MEM/WB
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
$s5 Rd
Read
Data 2
Write
Data
Bits 0-15
ALU
M
U
X
Bits 11-15
Data
Address
Read
Data
M
U
X
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
29
WB: add
$s5, $s3, $s4
Memory/ALU Result
M
U
X
Register Write
ID/EX
MEM: Idle
Memory Write
Memory Read
IF: Idle
M
U
X
ALU Op
Memory
$s5
Reg. Dst.
ID/RF: Idle
EX: Idle
ADD
+4
Memory
P
C
Instruction
Address
Inst.
0-31
IF/ID
EX/MEM
MEM/WB
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
Rd
Read
Data 2
Write
Data
Bits 0-15
ALU
M
U
X
Bits 11-15
Data
Address
Read
Data
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
30
WB:
Idle
M
U
X
ALU Op
Reg. Dst.
Memory
Memory/ALU Result
M
U
X
Register Write
ID/EX
MEM: Idle
Memory Write
Memory Read
IF: Idle
M
U
X
N. B. Dodge 9/15
Exercise 1
On the diagram on the next page, identify the
following:
1. Highlight all the control lines that must be active during a load
word instruction.
2. As in our exercise in Lecture 20, identify the decoder
locations.
3. The ID/EX Register interface stores the most bits of any of the
pipeline section interfaces. Approximately how many bits is
that, according to the diagram?
32
N. B. Dodge 9/15
Memory
P
C
Instruction
Address
Inst.
0-31
MEM/WB
Control
Decode
Reg. Block
Rs
ADD
Left
shift
2
Branch
ALU Srce
Rt
Read
Data 1
Rd
Read
Data 2
Write
Data
Bits 0-15
ALU
M
U
X
Read
Data
Write
Data
Sign 32
Extend
ALU
Cont.
Bits 16-20
Bits 11-15
Data
Address
M
U
X
ALU Op
Reg. Dst.
Memory
Memory/ALU Result
ADD
+4
IF/ID
EX/MEM
Memory Write
Memory Read
M
U
X
Register Write
ID/EX
M
U
X
Hazards
Hazards occur because data required for executing the
current instruction may not be available.
An instruction in the register fetch cycle may need
data from a register whose value will be changed by an
instruction downstream but still in process in the
pipeline (in the ALU, memory/memory bypass or
writeback cycle).
Thus an upstream instruction could access a register
and get incorrect data because the register data has not
yet been updated by a downstream instruction.
35
N. B. Dodge 9/15
Hazards (2)
There are two types of hazards, data hazards, and
control hazards.
Both occur because an instruction in the ID/RF stage of
the MIPS pipeline needs register data that will be
shortly updated by instructions in the EX or
MEM/Bypass, or WB stage.
Data hazards occur when an instruction needs register
contents for an arithmetic/ logical/memory instruction.
Control hazards occur when a branch instruction is
pending and the data necessary to initiate/bypass the
branch is not yet available in the same sort of scenario.
36
N. B. Dodge 9/15
Timeline
(clock
cycles)
10
5 clock cycles
Instruc.
Fetch
Reg.
Fetch
Instruc.
Fetch
ALU
Process
Reg.
Fetch
Instruc.
Fetch
N. B. Dodge 9/15
10
5 clock cycles
Instruc.
Fetch
Reg.
Fetch
Instruc.
Fetch
ALU
Process
Reg.
Fetch
Instruc.
Fetch
N. B. Dodge 9/15
IF
2
ID/
RF
ALU
MEM
WB
IF
ID/
RF
ALU
MEM
WB
N. B. Dodge 9/15
Rs
EX/MEM
M
U
X
Rt
Forward A
Rd
Write
Data
MEM/WB
Data
Address
ALU
Read
Data 2
M
U
X
Reg. Block
Read
Data
M
U
X
Write
Data
Forward B
Memory
Rs
Rt
Rd
M
U
X
EX/MEM Register Rd
Forwarding
Unit
MEM/WB Register Rd
After David A. Patterson and John L. Hennessy, Computer Organization and Design, 2nd Edition
40
N. B. Dodge 9/15
Forwarding
Unit
N. B. Dodge 9/15
Stalls
Forwarding will not always solve the problems of data hazards.
For example, suppose an add instruction follows a load word (lw),
and the add involves the register that receives the memory data.
In this case, forwarding will not work.
The reason is that the data must be read from memory, and so it
will not be available until the end of the MEM cycle. Thus the
required data is not available for a forward, and the add
instruction. if it proceeds, will process the wrong data.
A solution to this problem is the stall.
A stall halts the instruction awaiting data, while the key
instruction (a lw in this case) proceeds to the end of the MEM
cycle, after which the desired data is available to the add.
42
N. B. Dodge 9/15
Timeline
(clock
cycles)
10
5 clock cycles
lw $2, 32($3)
add $14, $6, $2
sw $15, 80($2)
Instruc.
Fetch
Reg.
Fetch
Instruc.
Fetch
ALU
Process
Reg.
Fetch
Instruc.
Fetch
Timeline
(clock
cycles)
10
5 clock cycles
lw $2, 32($3)
Instruc.
Fetch
Reg.
Fetch
ALU
Process
Instruc.
Fetch
Reg.
Fetch
Instruc.
Fetch
ALU
Process
Reg.
Fetch
With the delay, the lw result feeds the ALU input stage
of the add instruction, and the fetch stage of the sw.
Note that forwarding in still required (this time from
the MEM/WB interface, not the ALU output).
However, in addition to forwarding, instructions
following a lw must also be delayed for one clock N.cycle.
B. Dodge 9/15
44
In either case, the wrong instructions are in the pipe and they must
be eliminated (flushed). How can this problem be prevented?
A few approaches to the problem are shown in the following slides.
45
N. B. Dodge 9/15
ID/RF
ALU/EX
(Branch)
MEM/
Bypass
WB
46
N. B. Dodge 9/15
ID/RF
Branch
ALU/EX
MEM/
Bypass
WB
Branch
Comparator
N. B. Dodge 9/15
ID/RF
Branch
ALU/EX
MEM/
Bypass
WB
Branch
History
N. B. Dodge 9/15
Exercise 2
1. Explain forwarding in your own words.
2. Why doesnt forwarding always work? How can this
problem be solved?
3. Why could 2-bit dynamic branch prediction work to
ensure about a 1% error rate in branch prediction in
a subroutine that loops about 100 times before
completion? Hint: Assume that the subroutine is
called frequently, and that it always executes 100 or
more loop traversals before returning to the calling
program.
49
N. B. Dodge 9/15
Summary
The pipeline approach to CPU design provides a significant speed
increase over a single-cycle design, up to several hundred percent.
The improvement is so dramatic that today, all general-purpose
processor units (now usually clustered in groups of 2, 4, 6, 8 or
more CPUs) are designed using a pipeline approach.
However, this performance improvement must be paid for with
increased (1) price, and (2) complexity. Pipelines introduce
processing problems, including:
Hazards
Incorrect branch prediction
N. B. Dodge 9/15