Unit-6 Pipelining
Unit-6 Pipelining
Contents
6.1 Introduction : Parallel Processing, Multiple Functional Units, Flynn’s Classification
6.2 Pipelining: Concept and Demonstration with Example, Speedup Equation, Floating
Point addition and Subtraction with Pipelining
6.3 Instruction Level Pipelining: Instruction Cycle, Three & Four-Segment Instruction
Pipeline, Pipeline Conflicts and Solutions (Resource Hazards, Data Hazards, Branch
hazards)
6.4 Vector Processing: concept and Applications, Vector Operations, Matrix Multiplication
Parallel processing
• Parallel processing is a term used to denote a large class of techniques that are used to provide
simultaneous data-processing tasks for the purpose of increasing the computational speed of a
computer system.
• Instead of processing each instruction sequentially as in a conventional computer, a parallel
processing system is able to perform concurrent data processing to achieve faster execution time.
• For example, while an instruction is being executed in the ALU, the next instruction can be read
from memory. The system may have two or more ALUs and be able to execute two or more
instructions at the same time. Or A system may have two or more processors operating concurrently.
• The purpose of parallel processing is to speed up the computer processing capability and increase its
throughput, that is, the amount of processing that can be accomplished during a given interval of
time.
• The amount of increase with parallel processing, and with it, the cost of the system increases.
However, technological developments have reduced hardware costs to the point where parallel
processing techniques are economically feasible.
Single Multiple
Instruction stream
• Instructions are decoded by the Control Unit and then the Control Unit sends the instructions to the
processing units for execution.
• Data Stream flows between the processors and memory bi-directionally.
• Examples: Older generation computers, minicomputers, and workstations
MISD
• MISD structure is only of theoretical interest since no practical system has been constructed using this
organization.
• In MISD, multiple processing units operate on one single-data stream. Each processing unit operates on
the data independently via separate instruction stream.
M CU P
M CU P
Memory
• •
• •
• • Data stream
M CU P
Instruction stream
SIMD
• SIMD represents an organization that includes
Memory
many processing units under the Data bus
Segment 2
processing segments through which binary information
R3 R4
flows. Each segment performs partial processing dictated
by the way the task partitioned. Adder
R5 R3 + R4 Add Ci to product
• The first clock pulse transfers A1 and B1 into R1 and R2.
• The second clock pulse transfers the product of R1 and Clock Segment 1 Segment 2 Segment 3
Pulse
R2 into R3 and C1 into R4. Number R1 R2 R3 R4 R5
Clock
SS1 R S2 R2 S3 R3 R4
Input 1 S4
Clock cycles
Segment : 1 T1 T2 T3 T4 T5 T6
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
Ti = Task
4 T1 T2 T3 T4 T5 T6
Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipeline Speedup
n: Number of tasks to be performed
k: segments(stage)
Conventional Machine (Non-Pipelined)
tn: Clock cycles
t1: Time required to complete the n tasks
Total time t1 = n * tn
Pipelined Machine (k stages)
tp: Clock cycle (time to complete each suboperation,time of the Largest segment)
tk: Time required to complete the n tasks
Phase duration=(K+n-1), where,(n-1) remaining tasks completion = (n-1) tp
Total time (tk) = (k + n - 1) * tp
Speedup
Sk: Speedup
Speedup ratio= Non-pipelined execution time /pipelined execution time = Sk = n*tn / (k + n - 1)*tp
tn
Sk = ( if n >=k, if tn = k * tp )
tp
Example:
- 4-stage pipeline (k =4)
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed (n=100)
- 1 task in non-pipelined system; 20*4 = 80nS (tn or k*tp)
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n* tn = n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
Q. Non pipelined system takes 130ns to process an instruction . A program of 1000 instructions is executed in non
pipelined system. Then same program is processed with processor with 5 segment pipeline with clock cycle of 30
ns/stage. Determine speed up ratio of pipeline.
Solution:
For a non-pipelined system:
Total number of instruction/task (n)=1000 ,Total time required to perform a single task in pipelined processor
(Tp)=130 ns
total time to execute 1000 instructions in non-pipe line model = 1000 * 130 ns
For a pipelined system:
Total number of stages (k)=5, Total number of instruction/task (n)=1000
Total time required to perform a single task in pipelined processor (Tp)=30 ns
Total time to execute 1000 instructions in pipe line model = (k+n-1)*clock time of pipeline=(5+1000-1)* 30 ns =
1004 * 30
∴ Speed Up of pipeline = Time for non-pipeline mode/Time for pipeline model = 1000∗130/1004∗30 = 4.316 ns.
Q. Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns.
Calculate
1. Pipeline cycle time 1: Pipeline Cycle Time-
2. Non-pipeline execution time Cycle time
3. Speed up ratio = Maximum delay due to any stage + Delay due to
4. Pipeline time for 1000 tasks its register
5. Sequential time for 1000 tasks = Max { 60, 50, 90, 80 } + 10 ns
6. Throughput = 90 ns + 10 ns
Solution: = 100 ns
Given- 2: Non-Pipeline Execution Time-
Four stage pipeline is used Non-pipeline execution time for one instruction
Delay of stages = 60, 50, 90 and 80 ns = 60 ns + 50 ns + 90 ns + 80 ns
Latch delay or delay due to each register = 10 ns = 280 ns
3: Speed Up Ratio-
Speed up
5: Sequential Time For 1000 Tasks-
= Non-pipeline execution time / Pipeline execution time
Non-pipeline time for 1000 tasks
= 280 ns / Cycle time
= 1000 x Time taken for one task
= 280 ns / 100 ns
= 1000 x 280 ns
= 2.8
= 280000 ns
4: Pipeline Time For 1000 Tasks-
6: Throughput-
Pipeline time for 1000 tasks
Throughput for pipelined execution = Number
= Time taken for 1st task + Time taken for remaining 999 tasks
of instructions executed per unit time
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 1000 tasks / 100300 ns
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
Multiple Functional Units
• 4-Stage Pipeline is basically identical to the system with 4 identical function units.
P1 P2 P3 P4
Arithmetic pipeline
• Arithmetic Pipeline: Pipeline arithmetic units are usually found in very high-speed computers. They
are used to implement floating-point operations, multiplication of fixed-point numbers, and similar
computations encountered in scientific problems.
• A pipeline multiplier is essentially an array multiplier , with special adders designed to minimize the
carry propagation time through the partial products.
• The floating point addition and subtraction can be performed in four segments.
• We will discuss example of a pipeline unit for floating point addition and subtraction.
• The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers.
X = A x 2a
Y = B x 2b
floating point adder: The sub operations that are performed in the four segments are:
1. Compare the exponents
2. Align the mantissa
3. Add/sub the mantissa
4. Normalize the result
X= 0.9504 x 103
Y= 0.8200 x 102
Compare the exponent 3-2 = 1
Align the mantissa
X= 0.9504 x 103
Y= 0.0820 x 103
Add the mantissa
Z = X +Y = 0.9504 *103+ 0.0820*103 = 1.0324*103
Normalize the result
Z = 0.10324 x 104
Exponents Mantissas
a b A B
R R
R R
R R
Fetch instruction
Segment 1: from memory
Decode instruction
Segment 2: and calculate
effective address
yes Branch?
no
yes
Interrupt Interrupt?
handling
no
Update PC
Empty pipe
Instruction : 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
1. Structural hazards (Resource Conflicts) caused by access to memory by two segments at the
same time. Most of these conflicts can be resolved by using separate instruction and data memories.
2. Data hazards (Data Dependency Conflicts) conflicts arise when an instruction depends on the
result of a previous instruction, but this result is not yet available.
3. Control hazards(Branch Difficulties) arise branches and other instructions that change the PC make
the fetch of the next instruction to be delayed. JMP ID PC + PC Branch address dependency
bubble IF ID OF OE OS
Structural hazards(Resource Conflicts)
• Occur when two instructions require a given hardware resource at the same time.
• Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock.
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
memories. i+1 FI DA FO EX
• Data hazard can be deal with either hardware techniques or software technique.
• Similarly, an address dependency may occur when an operand address cannot be calculated because
the information needed by the addressing mode is not available.
Data Hazard Classification
Three types of data hazards
LOAD: R1 M[address 1]
LOAD: R2 M[address 2]
ADD: R3 R1 + R2
STORE: M[address 3] R3
• There will be a data conflict in instruction 3 because the operand in R2 is not yet available in the A
segment.
• This can be seen from the timing of the pipeline shown in Figure (a).
• The E segment in clock cycle 4 is in a process of placing the memory data into R2.
• The A segment in clock cycle 4 is using the data from R2.
• It is up to the compiler to make sure that the instruction following the load instruction uses the data
fetched from memory.
• This concept of delaying the use of the data loaded from memory is referred to as delayed load.
• Figure (b) shows the same program with a no-op instruction inserted after the load to R2 instruction.
• Thus the no-op instruction is used to advance one clock cycle in order to compensate for the data
conflict in the pipeline.
• The advantage of the delayed load approach is that the data dependency is taken care of by the
compiler rather than the hardware .
Figure (a): Three segment pipeline timing - Pipeline Figure (b): Three segment pipeline timing - Pipeline
timing with data conflict. timing with delayed load.
Delayed Branch
• The method used in most RISC processors is to rely on the compiler to redefine the branches so that they
take effect at the proper time in the pipeline. This method is referred to as delayed branch.
• The compiler is designed to analyze the instructions before and after the branch and rearrange the
program sequence by inserting useful instructions in the delay steps.
• It is up to the compiler to find useful instructions to put after the branch instruction. Failing that, the
compiler can insert no-op instructions.
• An Example of Delayed Branch
• The program for this example consists of five instructions.
• Load from memory to R1
• Increment R2
• Add R3 to R4
• Subtract R5 from R6
• Branch to address X
• In Figure(a) the compiler inserts two no-op instructions after the branch.
• The branch address X is transferred to PC in clock cycle 7 .
• The program in Figure(b) is rearranged by placing the add and subtract instructions after the branch
instruction.
• PC is updated to the value of X in clock cycle 5.
Clock cycles: 1 2 3 4 5 6 7 8 9 10 Clock cycles: 1 2 3 4 5 6 7 8
1. Load I A E
1. Load I A E
2. Increment I A E
2. Increment I A E
3. Add I A E
4. Subtract I A E 3. Branch to X I A E
5. Branch to X I A E 4. Add I A E
6. NOP I A E
5. Subtract I A E
7. NOP I A E
8. Instr. in X I A E 6. Instr. in X I A E
Figure (a): Using no operation instruction. Figure (b): Rearranging the instructions
Vector processing
• In many science and engineering applications, the problems can be formulated in terms of vectors
and matrices that lend themselves to vector processing.
• Vector processing as the process of using vectors to store a large number of variables for high-
intensity data processing.
• Computers with vector processing capabilities are in demand in specialized applications. e.g
• Long-range weather forecasting
• Petroleum explorations
• Seismic data analysis
• Medical diagnosis
• Aerodynamics and space flight simulations
• Artificial intelligence and expert systems
• Mapping the human genome
• Image processing
Vector Processor (computer)
• Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays,
much faster than conventional computers.
• Vector Processors may also be pipelined.
• To achieve the required level of high performance it is necessary to utilize the fastest and most
reliable hardware and apply innovative procedures from vector and parallel processing techniques.
Vector Operations
• Many scientific problems require arithmetic operations on large arrays of numbers.
• A vector is an ordered set of a one-dimensional array of data items.
• A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
• To examine the difference between a conventional scalar processor and a vector processor, consider
the following Fortran DO loop:
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
• This is implemented in machine language by the following sequence of operations.
Initialize I=0
20 Read A(I)
Read B(I)
Store C(I) = A(I)+B(I) Operation Base address Base address Base address Vector
code source 1 source 2 destination length
Increment I = I + 1
If I <= 100 go to 20 Figure : Instruction format for vector processor
Continue
• A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and
execute the instructions in the program loop. C(1:100) = A(1:100) + B(1:100)
• A possible instruction format for a vector instruction is shown in Figure.
• This assumes that the vector operands reside in memory.
• It is also possible to design the processor with a large number of registers and store all operands in registers prior
to the addition operation.
• The base address and length in the vector instruction specify a group of CPU registers.
Matrix Multiplication
• Matrix multiplication is one of the most computational intensive operations performed in computers with
vector processors.
• The multiplication of two n x n matrices consists of n2 inner products or n3 multiply-add operations.
• Consider, for example, the multiplication of two 3 x 3 matrices A and B.
3
• c11= a11 b11+ a12 b21+ a13 b31 , i.e. Inner product : cij = 𝑎𝑖𝑘 ∗ 𝑏𝑘𝑗
𝑘=1
• This requires three multiplication and (after initializing c11 to 0) three additions.
• An n x m matrix of numbers has n rows and m columns and may be considered as constituting a set of n
row vectors or a set of m column vectors. Consider, for example, the multiplication of two 3 x 3 matrices
A and B .
• In general, the inner product consists of the sum of k product terms of the form C=A1B1 +A2 B2 +A3
B3+…+Ak Bk.
• In a typical application k may be equal to 100 or even 1000.
• The inner product calculation on a pipeline vector processor is shown in Figure.
C = A1 B1 + A5 B5 + A9 B9 + A13 B13 + · · ·
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + . . .
+ A3 B3 + A7 B7 + A11 B11 + A15 B15 + · · ·
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + . . .
Source
A
M0 M1 M2 M3
AR AR AR AR
DR DR DR DR
Data bus
• A commercial computer with vector instructions and pipelines floating-point arithmetic operations is
referred to as a supercomputer.
• Supercomputers are very powerful. High-performance machines and used mostly for scientific
computations.
• To speed up the operations, the components are packed tightly together to minimize the distance that the
electronic signals have to travel.
• Supercomputers also use special techniques for removing the heat from circuits to prevent then from
burning up because of their close proximity.
• The instruction set of supercomputer contains the standard data transfer, data manipulation, and program
control instructions.
• A supercomputer is a computer system best known for its high computational speed, fast and large
memory systems, and the extensive use of parallel processing.
• The measure used to evaluate computers in their ability to perform a given number of floating-point
operations per second is referred to as flops.
• The term megaflops is used to denote million flops and gigaflops to denote billion flops.
• Typical supercomputer has a basic cycle time of 4-20 ns.
• If the processor can calculate a floating-point operations through a pipeline each cycle time, it will
have the ability to perform 50 to 250 megaflops.
• The first supercomputer developed in 1976 is the Cray-1 supercomputer.
• It uses vector processing with 12 distinct functional units in parallel.
• Each functional unit is segmented to process the data through pipeline.
• A floating-point operation can be performed on two set of 64-bit operands during one clock cycle
of 12.5 ns.
• This gives a rate of 80 megaflops
• It has a memory capacity of 4 millions 64-bit words.
• The memory is divided into 16 banks, with each bank having 50-ns access time.
• This means that when all 16 banks are accessed simultaneously, the memory transfer rate is 320
million words per second.
• Later version are Cray X-MP, Cray Y-MP, Cray-2 (12 times powerful that the Cray-1)
• Another supercomputers are Fujitsu VP-200, VP-2600, PARAM Computers.
Array Processors