CSE332 / EEE336 Computer Organization & Architecture Pipelining I
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
Rashadul Kabir
North South University
Summer 2020
Outline of this Lecture
Processor Implementation Styles
Pipelining
2
Processor Implementation Styles
Single Cycle Implementation
Performs each instruction in 1 clock cycle
Clock cycle must be long enough for slowest instruction; therefore,
Disadvantage: only as fast as slowest instruction
Multi-Cycle Implementation
Breaks fetch/execute cycle into multiple steps
Performs 1 step in each clock cycle
Advantage: each instruction uses only as many cycles as it needs
Pipelined Implementation
Executes each instruction in multiple steps
Performs 1 step / instruction in each clock cycle
Processes multiple instructions in parallel – assembly line
3
Assembly line
4
Two important terms!
Throughput is the amount of processing that can be
accomplished during a given interval of time.
5
Pipelining using Laundry Analogy
Time
6 PM 7 8 9 10 11 12 1 2 AM
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA
Task
B
order
A
C
D
B
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A
Task
- 4 loads of laundry in parallel
order B
A
- no additional resources
C
B
- throughput increased by 4
D
C
- latency per load is the same
D 6
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In
Time
Task
6 PM 7 8 9 10 11 12 1 2 AM
Practice order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
B
Task
order
C
A
D
B
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA
Task B
order
C
A
D
B
C
the slowest step decides throughput
D
7
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining
Pipelining exploits the potential of parallelism among
instructions. This parallelism is called instruction-level
parallelism (ILP).
Pipelining does not reduce latency of a single task, it
increases throughput of entire workload.
Pipeline rate limited by longest stage / slowest pipeline stage
Potential speedup = number of pipeline stages
Unbalanced lengths of pipe stages reduces speedup.
Time to “fill” pipeline and time to “drain” it – when there is
slack in the pipeline – reduces speedup.
8
Ideal Pipelining
9
More Realistic Pipeline: Throughput
Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
T ps
T/k T/k
ps ps
10
More Realistic Pipeline: Cost
Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
G gates
G/k G/k
11
Instruction execution overview
Executing a MIPS instruction can take up to five steps.
12
Datapath broken into 5 stages
Each stage has its own functional units.
Each stage can execute in .2 ns. Is this the right partitioning?
Why not 4 or 6?
Just like a multi-cycle implementation.
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
ignore
u
x
1 for now
Add
4 Add Add
result
Shift
left 2
Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Instruction
Write
register
data 2
0
M
u
result Address
Data
Read
data
1
M
RF
memory Write x u
data 1
Write
memory x
0
write
data
16 32
Sign
extend
PCE+4
nPCM
Add
Add
Add
44 Add Add
Add result
result
ShiftShift
leftleft
22
Read
Read
Instruction
Address register
register 11
AE
PCPC
PCF
Address Read
Read
AoutM
data
data 11
Read
Read
register
22 Zero
Zero
MDRW
Instruction register
Instruction Registers Read
Registers Read ALU ALU
ALU ALU
IRD
AoutW
BM
ImmE
1616 3232
Sign
Sign
extend
extend
T/k T/k
ps T ps
15
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Write 0
Write data
data
16 32
16 32Sign
lw All instruction classes must follow the same path and timing
Instruction fetch through thelw pipeline stages.
lw
Any performance impact?
lw
00
00
lw
M
0
MM
Instruction decode Execution Memory
uuu
x
xxx Write back
111
IF/ID
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add
444
4 Add Add
Add
Add
Add
Add
Add
Add
result
result
result
result
Shift
Shift
Shift
Shift
left
left 22
left 22
left
Read
Read
Read
Read
Instruction
Read
Instruction
Instruction
Instruction
Instruction
PC
PC Address register
register111
register Read
PC Address
Address
Address Read
Read
Read
Read
Read data
data111
data
data
data 1
Read
Read
Read
Read
register
register222
register 2 Zero
Zero
Zero
Zero
Instruction
Instruction
Instruction register
Registers Read
Registers Read
Registers ALU
ALU ALU
ALU
ALU
ALU ALU
ALU
memory
memory
memory Write Read
Read 00
000 ALU
ALU Read
Write
Write data
data222 result
result Address
Address
Address Read
Read
Read 11
register
register
data
data M
result
result
result Address
Address data
data
data
data 11
register
register MMM Data
Data data M
uuuu Data
Data
Data M MM
Write
Write
Write xxxxx
memory
memory uuuu
memory
memory
memory x
xxx
data
data
data 11
11
Write
Write 0000
Write
Write
Write
data
data
data
16
16
16 32
32
32
Sign 32
Sign
Sign
extend
extend
extend
lw
0
0 M Instruction decode lw
M
u
u
x 16
Based on original figure from [P&H
x CO&D,
1 COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Write back
data
register 1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11
00
Clock 1
Clock
Clock 5 3
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
0
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
MM
uuu Execution Memory
Memory Write back
Write back
xxx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add AddAdd
Add
44 Add
Add result
result
result
Shift
Shift
Shift
left 22
left
left 2
Read
Read
Read
Instruction
Instruction
PC Address
Address register 11
register
register 1 Read
PC
PC Address Read
Read
Read data 11
data
data 1
Read
Read Zero
Instruction register 22
register
register 2 Zero
Zero
Instruction
Instruction Registers Read ALU ALU
memory Registers
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write
Write 2 00 result Address Read
Read 1
data 22
data result
result Address
Address data 11
register
register
register M
MM data
data
M
M
uuu Data
Data
Data
Data u
Write
Write xxx uu
Write memory
memory
memory xxx
data
data
data 1
11 0
00
Write
Write
Write
data
data
data
16
16
16 32
32
Sign 32
Sign
Sign
extend
extend
extend
extend
Clock
Clock
Clock56 21 43
Clock
Clock
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
IF ID
IF
18
Illustrating Pipeline Operation: Resource
View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
MEM I0 I1 I2 I3 I4 I5 I6 I7
WB I0 I1 I2 I3 I4 I5 I6
19
Suggested readings
Chapter 4, Computer Organization and Design (Fifth
Edition) - D. A. Patterson and J. L. Hennesey
Section 6.2, Computer Architecture and Implementation –
H. G. Cragon
Section 7.8, Digital Design and Computer Architecture (2nd
edition) – D. Harris, S. Harris
20
Thank you!
21