COA Unit V B
COA Unit V B
A data hazard occurs when the current instruction requires the result of a preceding instruction, but there are
insufficient segments in the pipeline to compute the result and write it back to the register file in time for the
current instruction to read that result from the register file.
Vector Processors
There are two essentially different models of parallel computers: vector processors and multiprocessors. A vector
processor, is simply a machine that has an instruction that can operate on a vector. A pipelined vector processor is a
vector processor that can issue a vector instruction that operates on all of the elements of the vector in parallel by
sending those elements through a highly pipelined functional unit with a fast clock. A processor array is a vector
processor that achieves the parallelism by having a collection of identical, synchronized processing elements (PE),
each of which executes the same instruction on different data, which are controlled by a single control unit. Every
PE has a unique identifier, its processor id, which can be used during the computation. The control unit, which
might be a full-fledged CPU, broadcasts the instruction to be executed to the processing elements, which execute it
on data from a memory that is usually local to each, and can store the result in their local memories, or can return
global results back to the CPU. A global result line is usually a separate, parallel bus that allows each PE to
transmit values back to the CPU to be combined by a parallel, global operation, such as a logical-and or a logical-
or, depending upon the hardware support in the CPU.
Because all PEs execute the same instruction at the same time, this type of architecture is suited to problems with
data parallelism. Data parallelism is a type of parallelism that is characterized by the ability to perform the same
operation on different data simultaneously. For example, a loop of the form
for i = 0 to N-1
do a[i] = a[i] + 1;
has data parallelism because the updates to the distinct array elements a[i] are independent of each other and may
be performed in parallel, whereas the loop
for i = 1 to N-1
do a[i] = a[i-1] + 1;
has no data parallelism because the update to a[i] cannot be performed until the update to a[i-1]. If the value of N is
smaller than the number of processing elements, the entire loop takes the same amount of time as a single processor
takes to perform the increment on a scalar variable. If the value of N is larger, then the work has to be distributed to
the PEs so that they each update the values of several array elements. This may be handled by the hardware, by a
runtime library, or by the programmer, depending on the particular architecture and software.
The VMAX (vector maximum), which finds the maximum scalar quantity from all the complements in the vector,
is an f2 operation. The pipe lined implementation of f2 operation is shown in the fig 2.
The VMPL (vector multiply) , which multiply the respective scalar components of two vector operands and
produces another product vector, is an f3 operation. The pipe lined implementation of f3 operation is shown in the
figure 3:
The SVP (scalar vector product), which multiply one constant value to each component of the vector is f4
operation. The pipe lined implementation of f4 operation is shown in the figure 4.
2)As most of the Array processors operates asynchronously from the host CPU, hence it improves the overall
capacity of the system.
3)Array Processors has its own local memory, hence providing extra memory for systems with low memory.
4)The AP (array processor) is most efficient in doing repetitive operations such as doing FFT’s and multiplying
large vectors. Its efficiency degrades for non-repetitive operations, or operations requiring a great number of
decisions based on the results of computations.
5)Since the AP’s have their own program and data memory, the AP instruction and data must be transferred to, and
the results transferred from the AP. These I/O operations may cost more CPU time than the amount saved by using
the array processor.
6) As a general rule , use of AP is most efficient than the CPU when multiple or complex (such as FFT) operations,
which are highly repetitious, are going to be done on relatively large amount of data (thousands of words or more.).
In other cases use of AP will not help much and will keep other processes from using valuable resource.
Cache Coherence : In a single CPU system two copies of the same data, one in cache and another one in main
memory become different. This data inconsistency is called as Cache coherence problem.