0% found this document useful (0 votes)
2 views

CSC 313 Module 3 Pipelining

The document discusses CPU operation with a focus on pipelining, which overlaps the execution of multiple instructions to speed up processing. It uses a laundry analogy to illustrate the efficiency gained through pipelining, reducing the time taken from 8 hours to 3.5 hours for four loads. Additionally, it covers the advantages and disadvantages of pipelining, various hazards that can disrupt execution, and performance measures such as speedup and efficiency in parallel architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CSC 313 Module 3 Pipelining

The document discusses CPU operation with a focus on pipelining, which overlaps the execution of multiple instructions to speed up processing. It uses a laundry analogy to illustrate the efficiency gained through pipelining, reducing the time taken from 8 hours to 3.5 hours for four loads. Additionally, it covers the advantages and disadvantages of pipelining, various hazards that can disrupt execution, and performance measures such as speedup and efficiency in parallel architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

CSC 313

Module 3

1
CPU OPERATION

Pipelining

2
What is Pipelining?

◼ overlap execution of multiple instructions

◼ A way of speeding up execution of instructions

3
The Laundry Analogy

◼ Ann, Bob, Cole, Dora


each have one load of clothes A B C D
to wash, dry, and fold

◼ Washer takes 30 minutes

◼ Dryer takes 30 minutes

◼ “Folder” takes 30 minutes

◼ “Stasher” takes 30 minutes


to put clothes into drawers

4
If we do laundry sequentially...

6 PM 7 8 9 10 11 12 1 2 AM

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T Time
a A
s
k
B
O
r
d C
e
r D

◼ Sequential laundry takes 8 hours for 4 loads


◼ If they learned pipelining, how long would laundry take?
5
To Pipeline, We Overlap Tasks

6 PM 7 8 9 10 11 12 1 2 AM

30 30 30 30 30 30 30 Time
T
a
s A
k
O
B
r
d C
e
r D

◼ Pipelined laundry takes 3.5 hours for 4 loads

6
What is Pipelining
◼ Process of creating a queue of fetched,
decoded, and executed instructions

◼ executing a second instruction before the


first has been completed.

7
Example

Instruction 1 Instruction 2

X X

Instruction 4 Instruction 3

X X
Four sample instructions, executed linearly

9
5

IF ID EX M W
1
IF ID EX M W
1
IF ID EX M W
1
IF ID EX M W

Four Pipelined Instructions

Computer Science Dept., UI 10


Instruction Fetch

◼ IF – obtaining the requested instruction from


memory.

❑ The instruction and the program counter (which is


incremented to the next instruction) are stored in
the IF/ID pipeline register as temporary storage
so that may be used in the next stage at the start
of the next clock cycle.

12
Instruction Decode

◼ ID- decoding the instruction and sending out


the various control lines to the other parts of the
processor.

◼ The instruction is sent to the control unit where


it is decoded and the registers are fetched from
the register file.

13
Execution

◼ EX- where any calculations are performed.

◼ The main component in this stage is the ALU.

◼ The ALU is made up of arithmetic, logic and


capabilities.

14
Memory and IO

◼ The Memory and IO (MEM) stage:


◼ storing and loading values to and from memory.

◼ For input or output from the processor.

◼ If the current instruction is not of Memory or IO


type then the result from the ALU is passed
through to the write back stage.

15
Write Back

◼ responsible for writing the result of a


calculation

◼ memory access or input into the register


file.

16
Advantages

❑ More efficient use of processor

❑ Quicker time of execution of large number of

instructions

17
Disadvantages

◼ Involves adding hardware to the chip

◼ Inability to continuously run the pipeline at full

speed because of pipeline hazards which disrupt

the smooth execution of the pipeline.

18
Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle
❑ Data hazards: Instruction depends on result of prior instruction

still in the pipeline


◼ an instruction uses the result of the previous instruction. A hazard
occurs exactly when an instruction tries to read a register in its ID
stage that an earlier instruction intends to write in its WB stage.

❑ Structural hazards: two instructions need to access the same


resource
◼ HW cannot support this combination of instructions - two dogs
fighting for the same bone

❑ Control hazards: the location of an instruction depends on


previous instruction
◼ Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps).
19
Data Hazards

Select R2 and R3 for ADD R2 and R3 STORE SUM IN


ALU Operations R1

ADD R1, R2, R3 IF ID EX M WB

SUB R4, R1, R5 IF ID EX M WB

Select R1 and R5 for


ALU Operations

20
Stalling

◼ halting the flow of instructions until the


required result is ready to be used.

◼ However stalling wastes processor time


by doing nothing while waiting for the
result.
Computer Science Dept., UI 21
◼ Example:
ADD R1, R2, R3 IF ID EX M WB

STALL IF ID EX M WB

STALL IF ID EX M WB

STALL IF ID EX M WB

SUB R4, R1, R5 IF ID EX M WB

Computer Science Dept., UI 22


Quantitative Performance Analysis of Instruction Set

To estimate machine performance:

▪ The most important measure - Execution time, T.

▪ Speedup, S: the effect of the improvement


o ratio of the execution time without the
improvement (Two) to the execution time with
the improvement (Tw):

𝑇𝑤𝑜
𝑆= (1)
𝑇𝑤
24
Example

▪ If adding a 1MB cache module to a computer system


results in lowering the execution time of some
benchmark program from 12 seconds to 8 seconds,
then the speedup:

• 12/8, = 1.5, or 50%.

• An equation to calculate speedup as a direct percent


can be represented as:
𝑇𝑤𝑜 −𝑇𝑊
𝑆= 𝑋 100 (2)
𝑇𝑤

25
Given
o clock period, τ, (no of clock cycles per instruction, CPI), and

o IC (a count of the no of instructions executed by the program during its execution),

o the total execution time for the program:


𝐼𝐶×𝐶𝑃𝐼
T = IC × CPI × t = (3)
𝐶𝑙𝑜𝑐𝑘 𝑅𝑎𝑡𝑒

𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 (𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚)


Where CPI = (4)
𝐼𝐶

𝐶𝑃𝑈 𝐶𝑙𝑜𝑐𝑘 𝐶𝑦𝑐𝑙𝑒𝑠 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑎 𝑃𝑟𝑜𝑔𝑟𝑎𝑚 ×


𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 (5)
Hence,

𝐼𝐶𝑤𝑜 ×𝐶𝑃𝐼𝑤𝑜 ×𝑟𝑤𝑜 −𝐼𝐶𝑤 ×𝐶𝑃𝐼×𝑟𝑤


𝑆= × 100 (6)
𝐼𝐶𝑤 ×𝐶𝑃𝐼×𝑟𝑤

26
Example
Suppose we wish to estimate the speedup obtained by
replacing a CPU having an average CPI of 5 with
another CPU having an average CPI of 3.5, with the
clock period increased from 100 ns to 120 ns.

The equation above becomes:

5×100−3.5×120
𝑆= × 100= 19%
3.5×120

27
Exercise
1 Calculate the speedup that can be expected if a 200 MHz
Pentium chip is replaced with a 300 MHz Pentium chip, if all
other parameters remain unchanged.

2 What is the speedup that can be expected if the instruction set of


a certain machine is changed so that the branch instruction takes
1 clock cycle instead of 3 clock cycles, if branch instructions
account for 20% of all instructions executed by a certain
program? Assume that other instructions average 3 clock cycles
per instruction, and that nothing else is altered by the change.

3 Given 100 processors for a computation with 5% of the code that


cannot be parallelized, compute speedup and efficiency. 28
Parallel Architectures

One method of improving the performance of a


processor is to decrease the time needed to
execute instructions.

▪ parallel processing: a number of processors


work collectively, in parallel, on a common
problem

▪ To increase the number of processors, and


decompose and distribute a single program
onto the processors.

29
Measures of Performance of Pipelined Architectures

✓parallel time

✓Speedup

✓Efficiency

✓Throughput
30
Measures of Performance

❑Parallel time: the absolute time needed for a


program to execute on a parallel processor.

❑Speedup: the ratio of the time for a program to


execute on a sequential (non-parallel, that is)
processor to the time for that same program to
execute on a parallel processor :
𝑇𝑆𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙
𝑆=
𝑇𝑃𝑎𝑟𝑎𝑙𝑙𝑒𝑙
31
Measures of Performance
• If we want to achieve a speedup of 100, it is not
enough to simply distribute a single program over 100
processors.

• If there are even a small number of sequential


operations in a parallel program, then the speedup can
be significantly limited.

• Amdahl’s law: speedup is expressed in terms of the


number of processors p and the fraction of operations
that must be performed sequentially f:
1
S=
1−f
f+
p 32
Example

If = 10% of the operations must be performed


sequentially, then speedup can be no greater than 10
regardless of how many processors are used:

1 1
𝑆= 0.9 ≅ 5.3 𝑆= 0.9 = 10
0.1+ 10 0.1+ ∞

p = 10 processors p = ∞ processors

This brings us to measurements of efficiency.

33
Efficiency
o the ratio of speedup to the number of
processors used.

o For a speedup of 5.3 with 10 processors, the


efficiency is:

5.3
= .53, 𝑜𝑟 53%
10

34
Measures of Performance
▪ If we double the number of processors to 20, then the speedup
increases to 6.9 but the efficiency reduces to 34%.

• Thus, parallelizing an algorithm can improve performance to a limit


that is determined by the amount of sequential operations.

• Efficiency is drastically reduced as speedup approaches its limit, and


so it does not make sense to simply use more processors in a
computation in the hope that a corresponding gain in performance
will be achieved.

35
Examples (Throughput/Performance)

❑ Throughput: Numbers of tasks per given time


• a measure of how much computation is achieved over
time

◼ Replace the processor with a faster version?


❑ 3.8 GHz instead of 3.2 GHz

◼ Add an additional processor to a system?


❑ Core Duo instead of P4

36
Amdahl’s Law
◼ Mainly used to predict the theoretical maximum speedup
for program processing using multiple processors.

◼ Focus on overall performance, not one aspect

37
Assignment/Seminar Topics

Distinguish between the following:

• Superscalar
• Super-pipelining
• In Order and Out of Order Execution
• VLIW.
• Vector vs Scalar Processors.
• CISC vs RISC

38
Superscalar Machines

▪ with separate execution units, several issuing more than 1 instruction per cycle (3 or 4 is
common)

▪ executes more than one instruction during a clock cycle by simultaneously


dispatching multiple instructions to redundant functional units on the processor.

▪ issuing more than 1 instruction per cycle (3 or 4 is common).

39
Computer Science Dept., UI 40
Superpipelined
◼ Many pipeline stages need less than half a clock cycle

◼ multiple pipelined operations at the same time.


❑ Collection of multiple pipelines that can operate simultaneously.

• So you can get multiple operations per CPU cycle.

• divide each processor cycle into two or more subcycles.

• executes two instructions per cycle (i.e., one instruction per


subcycle).
42
Superscalar vs. Superpipelining
◼ Superpipelining:
❑ Vaguely defined as deep pipelining, i.e., lots of stages
◼ Superscalar issue complexity: limits super-pipelining
◼ How do they compare?
❑ 2-way Superscalar vs. Twice the stages

❑ Not much difference.

fetch decode inst


fetch decode inst
fetch decode inst
fetch decode inst

F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2

43
Superscalar vs. Superpipelining
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst
fetch decode inst

F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2
F1 F2 D1 D2 E1 E2

44
Superscalar v
Superpipeline

45
Superscalar vs. Superpipelining

46
In-Order Issue
In-Order Completion
◼ Issue instructions in the order they occur
◼ Not very efficient
◼ May fetch >1 instruction
◼ Instructions must stall if necessary

52
In-Order Issue In-Order Completion
(Diagram)

53
In-Order Issue
Out-of-Order Completion
◼ Output dependency
❑ R3:= R3 + R5; (I1)
❑ R4:= R3 + 1; (I2)
❑ R3:= R5 + 1; (I3)
❑ I2 depends on result of I1 - data dependency
❑ If I3 completes before I1, the result from I1 will be wrong -
output (read-write) dependency

54
In-Order Issue Out-of-Order Completion
(Diagram)

55
Out-of-Order Issue Out-of-Order
Completion (Diagram)

56
VLIW Machines

◼ Very Long Instruction Word machines typically have many


more functional units than superscalars (and thus the need
for longer – 256 to 1024 bits – instructions to provide control
for them).

◼ These machines mostly use microprogrammed control units


with relatively slow clock rates because of the need to use
ROM to hold the microcode.

57
Very Long Instruction Word (VLIW)
Processors
• The hardware cost and complexity of the superscalar
scheduler is a major consideration in processor design.
• To address this issues, VLIW processors rely on compile
time analysis to identify and bundle together instructions
that can be executed concurrently.
• These instructions are packed and dispatched together,
and thus the name very long instruction word.
• This concept was used with some commercial success
in the Multiflow Trace machine (circa 1984).
• Variants of this concept are employed in the Intel IA64
processors.

58
Computer Science Dept., UI
Very Long Instruction Word (VLIW)
Processors: Considerations
• Issue hardware is simpler.
• Compiler has a bigger context from which to select co-
scheduled instructions.
• Compilers, however, do not have runtime information
such as cache misses. Scheduling is, therefore,
inherently conservative.
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler.
A number of techniques such as loop unrolling,
speculative execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way
parallelism.

59
Computer Science Dept., UI
The VLIW Architecture

• A typical VLIW (very long instruction word) machine


has instruction words hundreds of bits in length.

• Multiple functional units are used concurrently in a


VLIW processor.

• All functional units share the use of a common large


register file.

60
Computer Science Dept., UI
VLIW Architecture

• VLIW = Very Long Instruction Word


• Instructions usually hundreds of bits long.
• Each instruction word essentially carries multiple “short
instructions.”
• Each of the “short instructions” are effectively issued at
the same time.
• (This is related to the long words frequently used in
microcode.)
• Compilers for VLIW architectures should optimally try to
predict branch outcomes to properly group instructions.

Computer Science Dept., UI 61


Pipelining in VLIW Processors
• Decoding of instructions is easier in VLIW than in
superscalars, because each “region” of an instruction
word is usually limited as to the type of instruction it can
contain.

• Code density in VLIW is less than in superscalars,


because if a “region” of a VLIW word isn’t needed in a
particular instruction, it must still exist (to be filled with a
“no op”).

• Superscalars can be compatible with scalar processors;


this is difficult with VLIW parallel and non-parallel
architectures.
Computer Science Dept., UI 62
VLIW Opportunities

• “Random” parallelism among scalar operations is


exploited in VLIW, instead of regular parallelism in a
vector or SIMD machine.

• The efficiency of the machine is entirely dictated by the


success, or “goodness,” of the compiler in planning the
operations to be placed in the same instruction words.

• Different implementations of the same VLIW architecture


may not be binary-compatible with each other, resulting
in different latencies.

Computer Science Dept., UI 63


VLIW Summary

• VLIW reduces the effort required to detect parallelism


using hardware or software techniques.

• The main advantage of VLIW architecture is its simplicity


in hardware structure and instruction set.

• Unfortunately, VLIW does require careful analysis of


code in order to “compact” the most appropriate ”short”
instructions into a VLIW word.

Computer Science Dept., UI 64


Advantages of VLIW
Compiler prepares fixed packets of multiple
operations that give the full "plan of
execution"
– dependencies are determined by compiler and
used to schedule according to function unit
latencies

– function units are assigned by compiler and


correspond to the position within the instruction
packet ("slotting")

– compiler produces fully-scheduled, hazard-free


code => hardware doesn't have to "rediscover"
dependencies or schedule
Computer Science Dept., UI
65
Disadvantages of VLIW

Compatibility across implementations is a major


problem
– VLIW code won't run properly with different number
of function units or different latencies
– unscheduled events (e.g., cache miss) stall entire
processor

Code density is another problem


– low slot utilization (mostly nops)
– reduce nops by compression ("flexible VLIW",
"variable-length VLIW")
66
Computer Science Dept., UI
Vector Processors

• A vector processor is a coprocessor designed to perform


vector computations.

• A vector is a one-dimensional array of data items (each


of the same data type).

• Vector processors are often used in multipipelined


supercomputers.

• Architectural types include:


– register-to-register (with shorter instructions and register files)
– memory-to-memory (longer instructions with memory addresses)

67
Computer Science Dept., UI
Comparison: CISC, RISC,
VLIW

Computer Science Dept., UI 68

You might also like