0% found this document useful (0 votes)
27 views13 pages

Unit 7 N

The document discusses parallel processing techniques, including Flynn's classification of computers, which categorizes systems based on instruction and data streams. It explains pipelining, a method to enhance processing speed by breaking tasks into sub-operations, and addresses pipeline hazards and their solutions. Additionally, it covers vector processing, array processors, and superscalar processors, highlighting their applications and advantages in handling complex computations efficiently.

Uploaded by

lokbasnet368
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views13 pages

Unit 7 N

The document discusses parallel processing techniques, including Flynn's classification of computers, which categorizes systems based on instruction and data streams. It explains pipelining, a method to enhance processing speed by breaking tasks into sub-operations, and addresses pipeline hazards and their solutions. Additionally, it covers vector processing, array processors, and superscalar processors, highlighting their applications and advantages in handling complex computations efficiently.

Uploaded by

lokbasnet368
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Pipeline and Vector Processing

Parallel Processing, Flynn’s Classification of Computers


Parallel Processing
- Parallel processing is a technique that is used to provide simultaneously data processing
task for the purpose of increasing the conceptual speed of computer system.
- The system may have two or more ALUs and be able to execute two or more instructions
at the same time.
- The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput (the amount of task those are completed during given interval of
time). It is called parallel computing.
Flynn’s Classification of Computers
- There are four groups of computers according to the Flynn’s classification based on the
number of concurrent instruction and data streams manipulated by the computer system.
- The normal operation of a computer is to fetch instructions from memory and execute
them in processor
- The sequence of instructions read from memory constitutes an instruction stream.
- The operations performed on the data in the processor constitute a data stream.
- Parallel processing may occur in the instruction stream, in the data stream, or in both.
- The Flynn’s classification of computer are:
1. single instruction stream, single data stream (SISD)
2. single instruction stream, multiple data stream (SIMD)
3. multiple instruction stream, single data stream (MISD)
4. multiple instruction stream, multiple data stream (MIMD)
- SISD has a processor, a control unit, and a memory unit. There is a single processor
which executes a single instruction stream to operate on the data stored in a single
memory in this system. The parallel processing in this case may be achieved by means of
multiple functional units or pipeline processing.
- SIMD executes a single machine instruction on different set of data by different
processors. Each processing element has associated data memory.
- All the processor receives the same instruction from control unit but operate on different
items of data.
- Application of SIMD is vector and array processing.
- MISD has many functional units which perform different operations on the same data. It
is a theoretical model of computer.
- MIMD consists of a set of processors which simultaneously execute different instruction
sequences on the different set of data.
Pipelining
- Pipelining is a technique of decomposing a sequential process into sub operations, with
each sub process being executed in a special dedicated segment that operates
concurrently with all other segments.
- Each segment performs partial processing dictated by the way the task partitioned.
- The result obtained from the computation in each segment is transferred to the next
segment in the pipeline.
- The final result is obtained after the data have passed through all segments.
- It is characteristic of pipelines that several computations can be in progress in distinct
segments at the same time.
- The registers provide isolation between each segment so that each can operate on distinct
data simultaneously. Suppose we want to perform the combined multiply and add
operations with a stream of numbers.
E.g. Ai * Bi + Ci for i = 1 , 2, 3,…….,7
The sub operations performed in each segment of the pipeline are as follows:
R1 Ai, R2 Bi Input Ai and Bi
R3 R1*R2, R4 Ci Multiply and input Ci
R5 R3+R4 Add Ci to product

Fig: Example of pipeline processing


Speedup Equation
- Consider a K-segment pipeline with a clock cycle time tp to execute n tasks. The first
task T1 require time Ktp to complete. The remaining (n-1) tasks finish at the rate of one
task per click cycle and will be completed after time (n-1) tp.
- The total time to complete the n task is [Ktp + (n - 1) tp] = (K + n - 1) tp.
- Consider a non–pipeline unit that performs the same operation and takes tn time to
complete each task. The total time required for n tasks would be ntn.
- The speedup of pipeline processing over an equivalent non–pipeline processing is defined
by the ratio

For Example : To Calculate pipeline speedup if time taken to complete a task in


conventional machine is 25 ns. In the pipeline machine, one task is divided into 5 segments
and each sub operation tasks take 4 ns. Number of tasks to be completed is 100.
Solution: Here,
Time taken to complete a task (tn) = 25 ns
Number of segment (K) = 5
Clock cycle time (tp) = 4 ns
No. of tasks (n) = 100

Calculate the speed up rate of 5-segment pipeline with a clock cycle time 25ns to execute
100 tasks.
Solution: Here,
n = 100
Number of segment (K) = 5
Time to complete a task (tp) = 25 ns
tn = Ktp
A non–pipeline system takes 100 ns to process a task. The same task can be processed in a
six-segment pipeline with time delay of each segment in the pipeline is as follows; 20 ns, 25
ns, 30 ns. Determine the speed of ratio of pipeline for 100 tasks.
Solution:
tn = 100 ns
K=6
tp = 30 ns
n = 100

Suppose that time delays of four segments are t1 = 60ns, t2 = 70 ns, t3 = 100 ns, t4 = 80 ns
and interface register have a delay of 10 ns. Determine the speed up ratio.
Solution: Here,
tp = 100 + 10 = 110 ns
tn = t1 + t2 + t3 + t4
=60+70+100+80 = 320 ns

Arithmetic Pipeline
- Pipeline arithmetic units are usually found in very high speed computers. They are used
to implement floating point operations, multiplication of fixed point numbers, and similar
computations encountered in scientific problems.
- We will now show an example of a pipeline unit for floating point addition and
subtraction. The inputs to the floating point adder pipeline are two normalized floating
point binary numbers.
X=A +2a
Y=B - 2b
- The floating point addition and subtraction can be performed in four segments, as shown
in figure. The registers labeled R are placed between the segments to store intermediate
results.
- The sub operations that are performed in the four segments are
i. Compare the exponents.
ii. Align the mantissas.
iii. Add or subtract the mantissas.
iv. Normalize the result.
- Procedure: The exponents are compared by subtracting them to determine their
difference. The larger exponent is chosen as the exponent of the result. The exponent
difference determines how many times the mantissa associated with the smaller exponent
must be shifted to the right. This produces an alignment of the two mantissas. It should be
noted that the shift must be designed as a combinational circuit to reduce the shift time.
The two mantissas are added or subtracted in segment 3. The result is normalized in
segment 4. When an overflow occurs, the mantissa of the sum or difference is shifted
right and the exponent incremented by one
- For simplicity, we use decimal numbers, the two normalized floating-point numbers:
X = 0.9504 +103
Y = 0.8200-102
- The two exponents are subtracted in the first segment to obtain 3-2 = 1. The larger
exponent 3 is chosen as the exponent of the result. The next segment shifts the mantissa
of Y to the right to obtain
X = 0.9504 +103
Y = 0.0820 103
- This aligns the two mantissas under the same exponent. The addition of the two
mantissas in segment 3 produces the sum
Z = 1.0324 103
Instruction Pipeline
- Pipeline processing can not only occur in the data stream but in the instruction stream as
well.
- An instruction pipeline reads consecutive instructions from memory while previous
instructions are being executed in other segments.
- This causes the instruction fetch and execute phases to overlap and perform simultaneous
operations. This technique is called instruction pipelining.
Consider sub dividing instruction processing into two ways:
1. fetch instruction
2. execute instruction
- In the most general case, the computer needs to process each instruction with the following
sequence of steps:
i. Fetch the instruction from memory.
ii. Decode the instruction.
iii. Calculate the effective address.
iv. Fetch the operands from memory.
v. Execute the instruction.
vi. Store the result in the proper place.
To again further speed up, the pipeline must have more stages consider the following
decomposition of
instruction processing.
a) Fetch instruction (FI)
b) Decode instruction (DI)
c) Fetch operands (FO)
d) Execute instruction (EI)

Pipeline Hazards and their Solution


Hazards
1. resource conflicts
- If one common memory is used for both data and instruction, and there is need to
read/write data and fetch the instruction at the same time, the resource conflicts occur.
2. data dependency conflict
- Data dependency conflicts arise when an instruction depends on the result of a previous
instruction but result of that instruction is not yet available.
3. branch difficulties
- Branch difficulties arise from program control instructions that may change the value of
PC.
- A program is not a straight flow of sequential instructions. There may be branch
instructions after the normal flow of program which delay the pipelining executions and
affects the programs.
Solution of pipeline hazards
1. Resource conflicts can be solved by using separate instruction and data memory.
2. To solve the data dependency conflict, we have following methods:
i. hardware interlock
- Hardware interlock is a circuit that is responsible to detect the data dependency. After
detecting the particular instruction needs data from instruction which is being executed,
hardware interface delays the instruction till the data is not available.
ii. operand forwarding
- Operand forwarding uses special hardware to detect a conflict and then avoid by routing
the data through special paths between pipeline segments.
iii. delayed load
- Delayed load is a procedure that gives the responsibilities for solving data conflicts to the
compiler. The compiler is designed to detect conflict & reorder the instructions as
necessary to delay the loading of the conflicting data by inserting no operation
instructions.
3. Solution to branch difficulties
i. pre fetch target instruction
- Both the branch target instruction & the instruction following the branch are pre fetched
and are saved until the branch instruction is executed. If branching occurs then branch
target instruction is continuous.
ii. branch target buffer (BTB)
- A branch target buffer is an associative memory included in fetch segment of branch
instruction that stores the target situation for the previously executed branch. It also stores
the next few instructions after the branch target instruction. This way, the branch
instructions that have occurred previously are readily available in the pipeline without
interruption.
iii. loop buffer
- The loop buffer is similar to BTB but its speed is high. Program loops are stored in the
loop buffer. The program loop then can be executed directly without having access to
memory.
iv. branch prediction
- Special hardware is used to detect the branch in the conditional branch instruction. On the
basis of prediction, the instructions are pre fetched.
v. delayed branch
- Compiler detects the branch instructions, so it re-arranges the instruction to make delay
by inserting no operation instruction.
What is Vector (Array) Processing?
- In computing, a vector processor or array processor is a central processing unit
(CPU) that implements an instruction set containing instructions that operate on one-
dimensional arrays of data called vectors, compared to the scalar processors, whose
instructions operate on single data items
- Vector processing operates on the entire array in just one operation
i.e. it operates on elements of the array in parallel. But, vector
processing is possible only if the operations performed in parallel
are independent.

Here, V is representing the vector operands and S represents the scalar


operands. In the figure below, O1 and O2 are the unary operations and
O3 and O4 are the binary operations.

Vector Instruction

A vector instruction has the following fields:

1. Operation Code

2. Base Address

3. Address Increment

4. Address Offset

5. Vector Length

- There is a class of computational problems that are beyond the capabilities of a


conventional computer.
- These problems require vast number of computations on multiple data items, that will
take a conventional computer (with scalar processor) days or even weeks to complete.
- Such complex instruction, which operates on multiple data at the same time, requires a
better way of instruction execution, which was achieved by Vector processors.
- Scalar CPUs can manipulate one or two data items at a time, which is not very efficient.
Also, simple instructions like ADD A to B, and store into C are not practically efficient.
- Hence, the concept of Instruction Pipeline comes into picture, in which the instruction
passes through several sub-units in turn.
- These sub-units perform various independent functions, for example: the first one
decodes the instruction, the second sub-unit fetches the data and the third sub-unit
performs the math itself.
- Vector processor, not only use Instruction pipeline, but it also pipelines the data, working
on multiple data at the same time.
- A normal scalar processor instruction would be ADD A, B, which leads to addition of
two operands, but what if we can instruct the processor to ADD a group of
numbers(from 0to n memory location) to another group of numbers(lets
say, n to k memory location). This can be achieved by vector processors.
- In vector processor a single instruction, can ask for multiple data operations, which saves
time, as instruction is decoded once, and then it keeps on operating on different data
items.
Applications of Vector Processors
Computer with vector processing capabilities are in demand in specialized applications. The
following are some areas where vector processing is used:
1. Petroleum exploration.
2. Medical diagnosis.
3. Data analysis.
4. Weather forecasting.
5. Aerodynamics and space flight simulations.
6. Image processing.
7. Artificial intelligence.
Types of Array Processors
There are basically two types of array processors:
1. Attached Array Processors
2. SIMD Array Processors
Why use the Array Processor
- Array processors increase the overall instruction processing speed.
- As most of the Array processors operate asynchronously from the host CPU, hence it
improves the overall capacity of the system.
- Array Processors has its own local memory, hence providing extra memory for systems
with low memory.

Superscalar Processors
It was first invented in 1987. It is a machine which is designed to improve the performance of the
scalar processor. In most applications, most of the operations are on scalar quantities.
Superscalar approach produces the high performance general purpose processors.
The main principle of superscalar approach is that it executes instructions independently in
different pipelines. As we already know, that Instruction pipelining leads to parallel processing
thereby speeding up the processing of instructions. In Superscalar processor, multiple such
pipelines are introduced for different operations, which further improves parallel processing.
There are multiple functional units each of which is implemented as a pipeline. Each pipeline
consists of multiple stages to handle multiple instructions at a time which support parallel
execution of instructions.
It increases the throughput because the CPU can execute multiple instructions per clock cycle.
Thus, superscalar processors are much faster than scalar processors.
A scalar processor works on one or two data items, while the vector processor works with
multiple data items. A superscalar processor is a combination of both. Each instruction
processes one data item, but there are multiple execution units within each CPU thus multiple
instructions can be processing separate data items concurrently.
While a superscalar CPU is also pipelined, there are two different performance enhancement
techniques. It is possible to have a non-pipelined superscalar CPU or pipelined non-superscalar
CPU. The superscalar technique is associated with some characteristics, these are:

1. Instructions are issued from a sequential instruction stream.


2. CPU must dynamically check for data dependencies.
3. Should accept multiple instructions per clock cycle.

Vector(Array) Processor and its Types


Array processors are also known as multiprocessors or vector processors. They perform
computations on large arrays of data. Thus, they are used to improve the performance of the
computer.

Types of Array Processors


There are basically two types of array processors:

1. Attached Array Processors


2. SIMD Array Processors

Attached Array Processors


An attached array processor is a processor which is attached to a general purpose computer and
its purpose is to enhance and improve the performance of that computer in numerical
computational tasks. It achieves high performance by means of parallel processing with multiple
functional units.
SIMD Array Processors
SIMD is the organization of a single computer containing multiple processors operating in
parallel. The processing units are made to operate under the control of a common control unit,
thus providing a single instruction stream and multiple data streams.
A general block diagram of an array processor is shown below. It contains a set of identical
processing elements (PE's), each of which is having a local memory M. Each processor element
includes an ALU and registers. The master control unit controls all the operations of the
processor elements. It also decodes the instructions and determines how the instruction is to be
executed.
The main memory is used for storing the program. The control unit is responsible for fetching
the instructions. Vector instructions are send to all PE's simultaneously and results are returned
to the memory.
The best known SIMD array processor is the ILLIAC IV computer developed by
the Burroughs corps. SIMD processors are highly specialized computers. They are only suitable
for numerical problems that can be expressed in vector or matrix form and they are not suitable
for other types of computations.
Why use the Array Processor

 Array processors increases the overall instruction processing speed.


 As most of the Array processors operates asynchronously from the host CPU, hence it
improves the overall capacity of the system.
 Array Processors has its own local memory, hence providing extra memory for systems
with low memory.

You might also like