Comp Architecture Chapter 4 - Pipelining
Comp Architecture Chapter 4 - Pipelining
Computer Architecture - II
Pipelining
Parallel processing
A Processor Unit
A memory unit
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Facts Multiple tasks operating
simultaneously
Pipelining doesnt help
6 PM latency of single task, it
7 8 9
helps throughput of entire
Time workload
T
a 30 40 40 40 40 20 Pipeline rate limited by
s slowest pipeline stage
k A
Potential speedup =
O Number of pipe stages
r B
d Unbalanced lengths of pipe
e The washer stages reduces speedup
r C waits for the
dryer for 10
minutes Time to fill pipeline and
D time to drain it reduces
speedup
Building a Car
Unpipelined Start and finish a job before moving to the next
Parallelism = 1 car
24 hrs.
Latency= 24 hrs.
Throughput = 1/24 hrs.
24 hrs.
Jobs
24 hrs.
Time
Latency the amount of time that a single operation takes to execute
Throughput the rate at which operations get executed (generally
expressed as operations/second or operations/cycle)
The Assembly Line
Pipelined Break the job into smaller stages
Eng. Body Paint 8h
A B C Parallelism = 3 cars
8h Eng. Body Paint Latency= 24 hrs.
A B C
Throughput = 1/8 hrs.
Eng. Body Paint
A B C
Jobs
Eng. Body Paint
3X
A B C
Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE
Jobs
Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C
Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE
Jobs
Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1 Clock Speed = 1/1ns = 1 GHz
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C
1ns
3 ns
Time
Pipelining
Stage 1 Stage 2
Clocks and Latches
Stage 1 L Stage 2 L
Clk
Clocks and Latches
Stage 1 L Stage 2 L
Clk
Input S1 R1 S2 R2 S3 R3 S4 R4
Example
Assume a 2 ns flip-flop delay
Characteristics Of Pipelining
Decomposes a sequential process into segments.
stage k
All pipe stages take the same amount of time; called
k segments
Pipeline Performance
n:instructions n is equivalent to number of loads in
the laundry example
k: stages in
k is the stages (washing, drying and
pipeline folding.
: clock cycle Clock cycle is the slowest task time
Tk: total time
Tk (k (n 1))
n
T1 nk
Speedup
Tk k (n 1) k
Efficiently scheduled laundry: Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Speedup
Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)
S = k * n / (k + n - 1 )
Speedup
S = k * n / (k + n - 1 )
S~k
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
Ai * Bi + Ci for i =1,2,3,,7
Example of Pipelining
The sub-operations performed in each
segment of the pipeline are as follows:
R1 Ai, R2 Bi
R3 R1 * R2 R4 Ci
R5 R3 + R4
Example of Pipelining
Ai Bi Ci
R1 Ai , R2 Bi R1 R2
Input Ai and Bi
R3 R1 * R2, R4 Ci
Multiplier
Multiply and input Ci
R5 R3 + R4 R3 R4
Add Ci to product
Adder
R5
Content of registers in pipeline example
Clock
Pulse
Segment1 Segment2 Segment3
number R1 R2 R3 R4 R5
Ai*Bi + Ci*Di+ Ei
is executed using a pipeline
Arithmetic Pipeline
From the early times of computing arithmetics withheld an
important aspect, yet arithmetic operations happen to
consume much of the time with in the arithmetic and logic
unit.
X = A * 2a
X = A * 2a
Y = B * 2b
A floating point adder can be executed via 4 simple
sub operations
X = 0.9832* 103
Y = 0.8929* 102
X = 0.9832* 103
Y = 0.8929* 102
Exponents
Mantissas
a b
A B
R
R
Compare
Difference
Segment 1 Exponent
By subtraction
Align mantissas
R
R
Segment 2 Choose exponent
Add or subtract
Segment 3 mantissas
R R
Normalize
Segment 4 Adjust
Exponent result
R R
Arithmetic Pipeline for Floating Point Adder
Instruction Pipeline
time.
Data dependency conflicts
occur when an instruction is dependent of a result of a
previous instruction which is not available yet
Branch difficulties conflicts
when branching and other instructions that change the
value of the PC.
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Fetch instruction
Segment 1 from memory
Decode instruction
And calculate
Segment 2
Effective address
yes
Branch?
no
Fetch operand
Segment 3 From memory
Execute
Segment 4
instruction
Interrupt yes
handling Interrupt?
no
Update PC
Empty pipe
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Timing of Instruction Pipeline
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction: 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI -- -- FI DA FO EX
5 -- -- -- FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Four-segment CPU pipeline for overcome
Pipeline Conflicts
The four segments illustrated in above table have the following
meanings: