0% found this document useful (0 votes)
13 views5 pages

COA Unit V B

The document discusses data hazards in pipelined processors, outlining methods to resolve them such as forwarding, code re-ordering, and stall insertion. It also explains vector processors, their architectures, and instruction types, highlighting the differences between memory-to-memory and register-to-register architectures. Additionally, it covers array processors, their types, and advantages, emphasizing their efficiency in handling repetitive operations and the cache coherence problem in multi-processor systems.

Uploaded by

Rene Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

COA Unit V B

The document discusses data hazards in pipelined processors, outlining methods to resolve them such as forwarding, code re-ordering, and stall insertion. It also explains vector processors, their architectures, and instruction types, highlighting the differences between memory-to-memory and register-to-register architectures. Additionally, it covers array processors, their types, and advantages, emphasizing their efficiency in handling repetitive operations and the cache coherence problem in multi-processor systems.

Uploaded by

Rene Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Hazards

A data hazard occurs when the current instruction requires the result of a preceding instruction, but there are
insufficient segments in the pipeline to compute the result and write it back to the register file in time for the
current instruction to read that result from the register file.

We typically remedy this problem in one of three ways:


 Forwarding: In order to resolve a dependency, one adds special circuitry to the pipeline that is comprised
of wires and switches with which one forwards or transmits the desired value to the pipeline segment that
needs that value for computation. Although this adds hardware and control circuitry, the method works
because it takes far less time for the required value(s) to travel through a wire than it does for a pipeline
segment to compute its result.
 Code Re-Ordering: Here, the compiler reorders statements in the source code, or the assembler reorders
object code, to place one or more statements between the current instruction and the instruction in which
the required operand was computed as a result. This requires an "intelligent" compiler or assembler, which
must have detailed information about the structure and timing of the pipeline on which the data hazard
would occur. We call this type of software a hardware-dependent compiler.
 Stall Insertion: It is possible to insert one or more stalls (no-op instructions) into the pipeline, which delays
the execution of the current instruction until the required operand is written to the register file. This
decreases pipeline efficiency and throughput, which is contrary to the goals of pipeline processor design.
Stalls are an expedient method of last resort that can be used when compiler action or forwarding fails or
might not be supported in hardware or software design.

Vector Processors

There are two essentially different models of parallel computers: vector processors and multiprocessors. A vector
processor, is simply a machine that has an instruction that can operate on a vector. A pipelined vector processor is a
vector processor that can issue a vector instruction that operates on all of the elements of the vector in parallel by
sending those elements through a highly pipelined functional unit with a fast clock. A processor array is a vector
processor that achieves the parallelism by having a collection of identical, synchronized processing elements (PE),
each of which executes the same instruction on different data, which are controlled by a single control unit. Every
PE has a unique identifier, its processor id, which can be used during the computation. The control unit, which
might be a full-fledged CPU, broadcasts the instruction to be executed to the processing elements, which execute it
on data from a memory that is usually local to each, and can store the result in their local memories, or can return
global results back to the CPU. A global result line is usually a separate, parallel bus that allows each PE to
transmit values back to the CPU to be combined by a parallel, global operation, such as a logical-and or a logical-
or, depending upon the hardware support in the CPU.
Because all PEs execute the same instruction at the same time, this type of architecture is suited to problems with
data parallelism. Data parallelism is a type of parallelism that is characterized by the ability to perform the same
operation on different data simultaneously. For example, a loop of the form
for i = 0 to N-1
do a[i] = a[i] + 1;
has data parallelism because the updates to the distinct array elements a[i] are independent of each other and may
be performed in parallel, whereas the loop
for i = 1 to N-1
do a[i] = a[i-1] + 1;
has no data parallelism because the update to a[i] cannot be performed until the update to a[i-1]. If the value of N is
smaller than the number of processing elements, the entire loop takes the same amount of time as a single processor
takes to perform the increment on a scalar variable. If the value of N is larger, then the work has to be distributed to
the PEs so that they each update the values of several array elements. This may be handled by the hardware, by a
runtime library, or by the programmer, depending on the particular architecture and software.

Vector processor classification


According to from where the operands are retrieved in a vector processor, pipe lined vector computers are
classified
into two architectural configurations:
1. Memory to memory architecture –
In memory to memory architecture, source operands, intermediate and final results are retrieved (read) directly
from the main memory. For memory to memory vector instructions, the information of the base address, the offset,
the increment, and the the vector length must be specified in order to enable streams of data transfers between the
main memory and pipelines.
The main points about memory to
memory architecture are:

There is no limitation of size

Speed is comparatively slow in this architecture
2. Register to register architecture –
In register to register architecture, operands and results are retrieved indirectly from the main memory
through the use of large number of vector registers or scalar registers. The processors like Cray-1 and the
Fujitsu VP-200 use vector instructions in register to register formats. The main points about register to register
architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory architecture.
• The hardware cost is high in this architecture.
A block diagram of a modern multiple pipeline vector computer is shown below:

Vector instruction types


A Vector operand contains an ordered set of n elements, where n is called the a vector are same type scalar
quantities, which may be a floating character.

Four primitive types of vector instructions are:


f1 : V --> V
f2 : V --> S
f3 : V x V --> V
f4 : V x S --> V
Where V and S denotes a vector operand and a scalar operand, respectively. The instructions, f1 and f2 are unary
operations and f3 and f4 are binary operations. The VCOM (vector complement), which complements each
complement of the vector, is an f1 operation. The pipe lined implementation of f1 operation is shown in the figure
1:

The VMAX (vector maximum), which finds the maximum scalar quantity from all the complements in the vector,
is an f2 operation. The pipe lined implementation of f2 operation is shown in the fig 2.

The VMPL (vector multiply) , which multiply the respective scalar components of two vector operands and
produces another product vector, is an f3 operation. The pipe lined implementation of f3 operation is shown in the
figure 3:
The SVP (scalar vector product), which multiply one constant value to each component of the vector is f4
operation. The pipe lined implementation of f4 operation is shown in the figure 4.

Vector Instruction Format in Vector Processors


Different Instruction formats are used by different vector processors. Vector instructions are generally specified by
some fields. The main fields that are used in Vector Instruction Set are given below.
1. Operations Code (Opcode) –
The operation code must be specified to select the functional unit or to reconfigure a multi functional unit to
perform the specified operation dictated by this field. Usually, microcode control is used to set up the required
resources.
For example:
Opcode – 0001 mnemonic – ADD operation
Opcode – 0010 mnemonic – SUB operation
Opcode – 1111 mnemonic – HLT operation
2. Base addresses –
For a memory reference instruction, the base addresses are needed for both source operands and result
vectors. The designated vector registers must be specified in the instruction, if the operands and results are located
in the vector register file, i.e., collection of registers.
1. For example:
ADD R1, R2
Here, R1 and R2 are the addresses of the register.
2. Offset (or Displacement) –
This field is required to get the effective memory address of operand vector. The address offset relative to the base
address should be specified. Using the base address and the offset (positive or negative), the effective address is
calculated.
3. Address Increment –
The address increment between the scalar elements of vector operand must be specified. Some computers, i.e., the
increment is always 1.
For example:
R1 <- 400
Auto incr-R1 is incremented the value of R1 by 1.
R1 = 399
4. Vector length – The vector length (positive integer) is needed to determine the termination of a instruction
Array processor
Array processor A computer/processor that has an architecture especially designed for processing arrays (e.g.
matrices) of numbers. The architecture includes a number of processors (say 64 by 64) working simultaneously,
each handling one element of the array, so that a single operation can apply to all elements of the array in parallel.
To obtain the same effect in a conventional processor, the operation must be applied to each element of the array
sequentially, and so consequently much more slowly. An array processor may be built as a self-contained unit
attached to a main computer via an I/O port or internal bus; alternatively, it may be a distributed array processor
where the processing elements are distributed throughout, and closely linked to, a section of the computer's
memory. Array processors are very powerful tools for handling problems with a high degree of parallelism. They
do however demand a modified approach to programming. The conversion of conventional (sequential) programs
to serve array processors is not a trivial task, and it is sometimes necessary to select different parallel) algorithms to
suit the parallel approach.

Types of Array Processors


There are basically two types of array processors:
1. Attached Array Processors
2. SIMD Array Processors

Attached Array Processors


An attached array processor is a processor which is attached to a general purpose computer and its purpose is to
enhance and improve the performance of that computer in numerical computational tasks. It achieves high
performance by means of parallel processing with multiple functional units.

SIMD Array Processors


SIMD is the organization of a single computer containing multiple processors operating in parallel. The processing
units are made to operate under the control of a common control unit, thus providing a single instruction stream and
multiple data streams. A general block diagram of an array processor is shown below. It contains a set of identical
processing elements (PE's), each of which is having a local memory M. Each processor element includes an ALU
and registers. The master control unit controls all the operations of the processor elements. It also decodes the
instructions and determines how the instruction is to be executed. The main memory is used for storing the
program. The control unit is responsible for fetching the instructions. Vector instructions are send to all PE's
simultaneously and results are returned to the memory. The best known SIMD array processor is the ILLIAC IV
computer developed by the Burroughs corps. SIMD processors are highly specialized computers. They are only
suitable for numerical problems that can be expressed in vector or matrix form and they are not suitable for other
types of computations.
Summary about Array Processor

1)Array processors increases the overall instruction processing speed.

2)As most of the Array processors operates asynchronously from the host CPU, hence it improves the overall
capacity of the system.

3)Array Processors has its own local memory, hence providing extra memory for systems with low memory.

4)The AP (array processor) is most efficient in doing repetitive operations such as doing FFT’s and multiplying
large vectors. Its efficiency degrades for non-repetitive operations, or operations requiring a great number of
decisions based on the results of computations.

5)Since the AP’s have their own program and data memory, the AP instruction and data must be transferred to, and
the results transferred from the AP. These I/O operations may cost more CPU time than the amount saved by using
the array processor.

6) As a general rule , use of AP is most efficient than the CPU when multiple or complex (such as FFT) operations,
which are highly repetitious, are going to be done on relatively large amount of data (thousands of words or more.).
In other cases use of AP will not help much and will keep other processes from using valuable resource.

Cache Coherence : In a single CPU system two copies of the same data, one in cache and another one in main
memory become different. This data inconsistency is called as Cache coherence problem.

You might also like