Pipeline

Unit # 5
Pipeline and Vector Processing
Dr. Rajesh Tiwari

Professor ( CSE – AIML)
CMREC, Hyderabad, Telangana
Characteristics of multiprocessors
• A multiprocessor system is an interconnection of two or more
CPUs with memory and input-output equipment.
• The term “processor” in multiprocessor can mean either a central
processing unit (CPU) or an input-output processor (IOP).
• Multiprocessors are classified as multiple instruction stream,
multiple data stream (MIMD) systems
• The similarity and distinction between multiprocessor and
multicomputer are:
– Similarity
• Both support concurrent operations
– Distinction
• The network consists of several autonomous computers that may or
may not communicate with each other.
• A multiprocessor system is controlled by one operating system that
provides interaction between processors and all the components of the
system cooperate in the solution of a problem.
• Multiprocessing improves the reliability of the system.
• The benefit derived from a multiprocessor organization is an
improved system performance.
– Multiple independent jobs can be made to operate in parallel.
– A single job can be partitioned into multiple parallel tasks.
• Multiprocessing can improve performance by decomposing a
program into parallel executable tasks.
– The user can explicitly declare that certain tasks of the program be
executed in parallel.
• This must be done prior to loading the program by specifying the parallel
executable segments.
– The other is to provide a compiler with multiprocessor software that
can automatically detect parallelism in a user’s program.
• Multiprocessor are classified by the way their memory is
organized.
– A multiprocessor system with common shared memory is classified as a
shared-memory or tightly coupled multiprocessor.
• Tolerate a higher degree of interaction between tasks.
– Each processor element with its own private local memory is classified as
a distributed-memory or loosely coupled system.
• Are most efficient when the interaction between tasks is minimal
Interconnection Structures
• The components that form a multiprocessor system are CPUs,
IOPs connected to input-output devices, and a memory unit.
• The interconnection between the components can have
different physical configurations, depending on the number of
transfer paths that are available
– Between the processors and memory in a shared memory system
– Among the processing elements in a loosely coupled system
• There are several physical forms available for establishing an
interconnection network.
– Time-shared common bus
– Multiport memory
– Crossbar switch
– Multistage switching network
– Hypercube system
Time Shared Common Bus
• A common-bus multiprocessor system consists of a number of
processors connected through a common path to a memory unit.
• Disadvantage.:
– Only one processor can communicate with the memory or
another processor at any given time.
– As a consequence, the total overall transfer rate within the
system is limited by the speed of the single path
• A more economical implementation of a dual bus structure is
depicted in Fig. 5.1 and Fig. 5.2.
• Part of the local memory may be designed as a cache memory
attached to the CPU.
Figure 5.1: Time shared common bus organization

Figure 5.2: System bus structure for multiprocessors

Multiport Memory
• A multiport memory system employs separate buses between
each memory module and each CPU.
• The module must have internal control logic to determine which
port will have access to memory at any given time.
• Memory access conflicts are resolved by assigning fixed
priorities to each memory port.
• Advantage:
– The high transfer rate can be achieved because of the multiple paths.
• Disadvantage:
– It requires expensive memory control logic and a large number of cables
and connections
• Fig. 5.3 shows multiport memory system.
Multiport Memory
Figure 5.3: Multiport memory organization

Crossbar Switch
• This consists of a number of crosspoints that are placed at
intersections between processor buses and memory module
paths.
• The small square in each crosspoint is a switch that determines
the path from a processor to a memory module.
• Advantage:
– Supports simultaneous transfers from all memory modules
• Disadvantage:
– The hardware required to implement the switch can become
quite large and complex.
• Fig. 5.4 shows the functional design of a crossbar switch
connected to one memory module.
• Figure 5.5 shows Block diagram of crossbar switch.
Crossbar Switch
Figure 5.4: Crossbar switch

Crossbar Switch
Figure 5.5: Block diagram of crossbar switch

Multistage Switching Network
• The basic component of a multistage network is a two-input, two-
output interchange switch as shown in below Figure.
• Using the 2x2 switch as a building block, it is possible to build
a multistage network to control the communication between a
number of sources and destinations.
– To see how this is done, consider the binary tree shown in below
Figure.
– Certain request patterns cannot be satisfied simultaneously. i.e., if P1 -
000~011, then P2 - 100~111
• One such topology is the omega switching network shown in Fig.
5.6.
Fig. 5.6: 8 x 8 Omega Switching Network

• A particular request is initiated in the switching network by the
source, which sends a 3-bit pattern representing the destination
number.
• As the binary pattern moves through the network, each level
examines a different bit to determine the 2 x 2 switch setting.
• Level 1 inspects the most significant bit.
• level 2 inspects the middle bit, and level 3 inspects the least
significant bit.
• When the request arrives on either Input of the 2 x 2 switch, it
i$ routed to the upper output if the specified bit is 0 or to the
lower output if the bit is 1.
• Some request patterns cannot be connected simultaneously. i.e.,
any two sources cannot be connected simultaneously to destination
000 and 001
• In a tightly coupled multiprocessor system, the source is a

processor and the destination is a memory module. The first pass
through the network sets up the path. Succeeding passes are used
to transfer the address into memory and then transfer the data in
either direction. depending on whether the request is a read or a
write.
• In a loosely coupled multiprocessor system, both the source and

destination are processing elements. After the path is established,
the source processor transfers a message to the destination
processor.
Hypercube System
• The hypercube or binary n-cube multiprocessor structure is a
loosely coupled system composed of N=2n processors
interconnected in an n-dimensional binary cube.
– Each processor forms a node of the cube, in effect it contains not only a
CPU but also local memory and I/O interface.
– Each processor address differs from that of each of its n neighbors by
exactly one bit position.
• Fig. 5.7 shows the hypercube structure for n=1, 2, and 3.
• Routing messages through an n-cube structure may take from one
to n links from a source node to a destination node.
– A routing procedure can be developed by computing the exclusive-OR of
the source node address with the destination node address.
– The message is then sent along any one of the axes that the resulting binary
value will have 1 bits corresponding to the axes on which the two nodes
differ.
• A representative of the hypercube architecture is the Intel iPSC
computer complex.
• It consists of 128(n=7) microcomputers, each node consists of a
CPU, a floating-point processor, local memory, and serial
communication interface units.
Fig. 5.7: Hypercube structures for n=1,2,3

lnterprocessor Arbitration
• System Bus
– A typical system bus consists of approximately 100 signal lines.
– These lines are divided into three functional groups: data,

address, and control.
– In addition, there are power distribution lines that supply power

to the components.
– For example, the IEEE standard 796 multibus system has 16

data lines, 24 address lines, 26 control lines, and 20 power lines,
for a total of 86 lines.
Serial Arbitration Procedure:
• The serial priority resolving technique is obtained from a daisy-
chain connection of bus arbitration circuits similar to the priority
interrupt logic presented.
• The processors connected to the system bus are assigned priority

according to their position along the priority control line.
• The device closest to the priority line is assigned the highest

priority.
• When multiple devices concurrently request the use of the bus, the
device with the highest priority is granted access to it.
Serial Arbitration Procedure:
• Next Figure shows the daisy-chain connection of four arbiters.
• It is assumed that each processor has its own bus arbiter logic
with priority-in and priority-out lines.
• The priority out (PO) of each arbiter is connected to the
priority in (PI) of the next-lower-priority arbiter.
• The PI of the highest-priority unit is maintained at a logic 1
value.
• The highest-priority unit in the system will always receive
access to the system bus when it requests it.
• The PO output for a particular arbiter is equal to 1 if its PI
input is equal to 1 and the processor associated with the arbiter
logic is not requesting control of the bus.
Figure: Serial (daisy-chain) arbitration

• If the processor requests control of the bus and the
corresponding arbiter finds its PI input equal to 1, it sets its PO
output to 0.
• Lower-priority arbiters receive a 0 in PI and generate a 0 in PO

.
• Thus the processor whose arbiter has a PI = 1 and PO = 0 is

the one that is given control of the system bus.
Parallel Arbitration Logic:
• The parallel bus arbitration technique uses an external priority
encoder and a decoder as shown in next Fig. .
• Each bus arbiter in the parallel scheme has a bus request

output line and a bus acknowledge input line.
• Each arbiter enables the request line when its processor is

requesting access to the system bus.
• The processor takes control of the bus if its acknowledge input

line is enabled.
• The bus busy line provides an orderly transfer of control, as in

the daisy-chaining case.
Figure: Parallel arbitration

lnterprocessor Communication
& Synchronization
• Interprocessor communication is used for interchanging useful
information among various regions in one or more processes
(or programs).
• This communication could involve letting another process

know that some event has occurred or the transferring of data
from one process to another.
& Synchronization
Figure: Interprocess communication

& Synchronization
• Synchronization is an essential part of interprocess communication.
• It refers to a case where the data used to communicate between

processors is control information.
• It is either given by the interprocess control mechanism or handled

by the communicating processes.
• It is required to maintain the correct sequence of processes and to

make sure equal access to shared writable data.
& Synchronization
• Multiprocessor systems have various mechanisms for
the synchronization of resources.
• Below are some methods to provide synchronization
are as follows −
– Mutual Exclusion
– Semaphore
– Barrier
– Spinlock
& Synchronization
• Mutual Exclusion
– Mutual Exclusion requires that only a single process thread
can enter the critical section one at a time.
– This also helps synchronize and prevents the race condition

by creating a stable state.
& Synchronization
• Semaphore
– Semaphore is a type of variable that generally controls the
access to the shared resources by several processes.
– Semaphore is divided into two types as follows:
• Binary Semaphore
A binary semaphore is limited to zero or one. It could be used to
control access to one resource. In particular, it can be used to force
the same release of an important category in the user code.
• Counting Semaphore
Counting semaphore may take any integer value. It could be used to
control access to resources having many instances.
& Synchronization
• Barrier
– A barrier ( as evident by its name) does not allow an
individual process to proceed unless all the processes do
not reach it.
– Many parallel languages use it, and collective routines

impose barriers.
& Synchronization
• Spinlock
– A spinlock is a type of lock that prevents processes from
operating any function unless it is available.
– The processes which are trying to acquire the spinlock wait

in a loop while checking if the lock is available or not.
– This is also known as busy waiting because the process is

not doing any helpful operation even though it is active.
Cache Coherence
• The primary advantage of cache is its ability to reduce the average
access time in uniprocessors.
• When the processor finds a word in cache during a read operation,

the main memory is not involved in the transfer.
• If the operation is to write, there are two commonly used

procedures to update memory.
• In the write-through policy, both cache and main memory are

updated with every write operation.
• In the write-back policy, only the cache is updated and the location
is marked so that it can be copied later into main memory.
• To ensure the ability of the system to execute memory
operations correctly, the multiple copies must be kept
identical.
• This requirement imposes a cache coherence problem.
• A memory scheme is coherent if the value returned on a load

instruction is always the value given by the latest store
instruction with the same address.
• Without a proper solution to the cache coherence problem,

caching cannot be used in bus-oriented multiprocessors with
two or more processors.
• Cache coherency is a situation where multiple processor cores
share the same memory hierarchy, but have their own L1 data
and instruction caches.
• Incorrect execution could occur if two or more copies of a

given cache block exist, in two processors’ caches, and one of
these blocks is modified.
• In a multiprocessor system, data inconsistency may occur
among adjacent levels or within the same level of the memory
hierarchy.
• In a shared memory multiprocessor with a separate cache
memory for each processor, it is possible to have many copies
of any one instruction operand: one copy in the main memory
and one in each cache memory.
• When one copy of an operand is changed, the other copies of
the operand must be changed also.
• Example : Cache and the main memory may have inconsistent
copies of the same object.
• Suppose there are three processors, each having cache.
Suppose the following scenario:-
• Processor 1 read X : obtains 24 from the memory and caches
it.
• Processor 2 read X : obtains 24 from memory and caches it.
• Again, processor 1 writes as X : 64, Its locally cached copy
is updated. Now, processor 3 reads X, what value should it
get?
• Memory and processor 2 thinks it is 24 and processor 1 thinks
it is 64.
• As multiple processors operate in parallel, and independently
multiple caches may possess different copies of the same memory
block, this creates a cache coherence problem.
• Cache coherence is the discipline that ensures that changes in the
values of shared operands are propagated throughout the system in a
timely fashion.
• There are three distinct level of cache coherence :-
– Every write operation appears to occur instantaneously.
– All processors see exactly the same sequence of changes of
values for each separate operand.
– Different processors may see an operation and assume different
sequences of values; this is known as non-coherent behavior.
Thank You

Pipeline

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Pipeline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipeline

Uploaded by

Copyright:

Available Formats

Unit # 5

Pipeline and Vector Processing

Dr. Rajesh Tiwari

Figure 5.1: Time shared common bus organization

Figure 5.2: System bus structure for multiprocessors

Figure 5.3: Multiport memory organization

Figure 5.4: Crossbar switch

Figure 5.5: Block diagram of crossbar switch

Fig. 5.6: 8 x 8 Omega Switching Network

• In a tightly coupled multiprocessor system, the source is a

• In a loosely coupled multiprocessor system, both the source and

Fig. 5.7: Hypercube structures for n=1,2,3

– These lines are divided into three functional groups: data,

– In addition, there are power distribution lines that supply power

– For example, the IEEE standard 796 multibus system has 16

• The processors connected to the system bus are assigned priority

• The device closest to the priority line is assigned the highest

Figure: Serial (daisy-chain) arbitration

• Lower-priority arbiters receive a 0 in PI and generate a 0 in PO

• Thus the processor whose arbiter has a PI = 1 and PO = 0 is

• Each bus arbiter in the parallel scheme has a bus request

• Each arbiter enables the request line when its processor is

• The processor takes control of the bus if its acknowledge input

• The bus busy line provides an orderly transfer of control, as in

Figure: Parallel arbitration

• This communication could involve letting another process

Figure: Interprocess communication

• It refers to a case where the data used to communicate between

• It is either given by the interprocess control mechanism or handled

• It is required to maintain the correct sequence of processes and to

– This also helps synchronize and prevents the race condition

– Many parallel languages use it, and collective routines

– The processes which are trying to acquire the spinlock wait

– This is also known as busy waiting because the process is

• When the processor finds a word in cache during a read operation,

• If the operation is to write, there are two commonly used

• In the write-through policy, both cache and main memory are

• This requirement imposes a cache coherence problem.

• A memory scheme is coherent if the value returned on a load

• Without a proper solution to the cache coherence problem,

• Incorrect execution could occur if two or more copies of a

You might also like