CSC 313 Module 3 Pipelining
CSC 313 Module 3 Pipelining
Module 3
1
CPU OPERATION
Pipelining
2
What is Pipelining?
3
The Laundry Analogy
4
If we do laundry sequentially...
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T Time
a A
s
k
B
O
r
d C
e
r D
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 Time
T
a
s A
k
O
B
r
d C
e
r D
6
What is Pipelining
◼ Process of creating a queue of fetched,
decoded, and executed instructions
7
Example
Instruction 1 Instruction 2
X X
Instruction 4 Instruction 3
X X
Four sample instructions, executed linearly
9
5
IF ID EX M W
1
IF ID EX M W
1
IF ID EX M W
1
IF ID EX M W
12
Instruction Decode
13
Execution
14
Memory and IO
15
Write Back
16
Advantages
instructions
17
Disadvantages
18
Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle
❑ Data hazards: Instruction depends on result of prior instruction
20
Stalling
STALL IF ID EX M WB
STALL IF ID EX M WB
STALL IF ID EX M WB
𝑇𝑤𝑜
𝑆= (1)
𝑇𝑤
24
Example
25
Given
o clock period, τ, (no of clock cycles per instruction, CPI), and
26
Example
Suppose we wish to estimate the speedup obtained by
replacing a CPU having an average CPI of 5 with
another CPU having an average CPI of 3.5, with the
clock period increased from 100 ns to 120 ns.
5×100−3.5×120
𝑆= × 100= 19%
3.5×120
27
Exercise
1 Calculate the speedup that can be expected if a 200 MHz
Pentium chip is replaced with a 300 MHz Pentium chip, if all
other parameters remain unchanged.
29
Measures of Performance of Pipelined Architectures
✓parallel time
✓Speedup
✓Efficiency
✓Throughput
30
Measures of Performance
1 1
𝑆= 0.9 ≅ 5.3 𝑆= 0.9 = 10
0.1+ 10 0.1+ ∞
p = 10 processors p = ∞ processors
33
Efficiency
o the ratio of speedup to the number of
processors used.
5.3
= .53, 𝑜𝑟 53%
10
34
Measures of Performance
▪ If we double the number of processors to 20, then the speedup
increases to 6.9 but the efficiency reduces to 34%.
35
Examples (Throughput/Performance)
36
Amdahl’s Law
◼ Mainly used to predict the theoretical maximum speedup
for program processing using multiple processors.
37
Assignment/Seminar Topics
• Superscalar
• Super-pipelining
• In Order and Out of Order Execution
• VLIW.
• Vector vs Scalar Processors.
• CISC vs RISC
38
Superscalar Machines
▪ with separate execution units, several issuing more than 1 instruction per cycle (3 or 4 is
common)
39
Computer Science Dept., UI 40
Superpipelined
◼ Many pipeline stages need less than half a clock cycle
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
43
Superscalar vs. Superpipelining
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
44
Superscalar v
Superpipeline
45
Superscalar vs. Superpipelining
46
In-Order Issue
In-Order Completion
◼ Issue instructions in the order they occur
◼ Not very efficient
◼ May fetch >1 instruction
◼ Instructions must stall if necessary
52
In-Order Issue In-Order Completion
(Diagram)
53
In-Order Issue
Out-of-Order Completion
◼ Output dependency
❑ R3:= R3 + R5; (I1)
❑ R4:= R3 + 1; (I2)
❑ R3:= R5 + 1; (I3)
❑ I2 depends on result of I1 - data dependency
❑ If I3 completes before I1, the result from I1 will be wrong -
output (read-write) dependency
54
In-Order Issue Out-of-Order Completion
(Diagram)
55
Out-of-Order Issue Out-of-Order
Completion (Diagram)
56
VLIW Machines
57
Very Long Instruction Word (VLIW)
Processors
• The hardware cost and complexity of the superscalar
scheduler is a major consideration in processor design.
• To address this issues, VLIW processors rely on compile
time analysis to identify and bundle together instructions
that can be executed concurrently.
• These instructions are packed and dispatched together,
and thus the name very long instruction word.
• This concept was used with some commercial success
in the Multiflow Trace machine (circa 1984).
• Variants of this concept are employed in the Intel IA64
processors.
58
Computer Science Dept., UI
Very Long Instruction Word (VLIW)
Processors: Considerations
• Issue hardware is simpler.
• Compiler has a bigger context from which to select co-
scheduled instructions.
• Compilers, however, do not have runtime information
such as cache misses. Scheduling is, therefore,
inherently conservative.
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler.
A number of techniques such as loop unrolling,
speculative execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way
parallelism.
59
Computer Science Dept., UI
The VLIW Architecture
60
Computer Science Dept., UI
VLIW Architecture
67
Computer Science Dept., UI
Comparison: CISC, RISC,
VLIW