Unit-V NEW
Unit-V NEW
Any operation that can be decomposed into a sequence of suboperations of about the same
complexity can be implemented by a pipeline processor
The technique is efficient for those applications that need to repeat the same task many time
with different sets of data
The general structure of a four-segment
pipeline is as shown in fig;
A task is the total operation performed
going through all segments of a pipeline
The behavior of a pipeline can be illustrated with a space-time diagram
This shows the segment utilization as a function of time
Once the pipeline is
full, it takes only one
clock period to obtain
an output
Consider a k-segment
pipeline with a clock
cycle time tp to execute n tasks
The first task T1 requires time ktp to complete
The remaining n – 1 tasks finish at the rate of one task per clock cycle and will be completed
after time (n – 1)tp
The total time to complete the n tasks is [k + n – 1]tp
The above example requires [4 + 6 – 1] clock cycles to finish
Consider a non-pipeline unit that performs the same operation and takes tn time to complete
each task
The total time to complete n tasks would be ntn
The speedup of a pipeline processing over an equivalent non-pipeline processing is defined by
the ratio
If we assume that the time to process a task is the same in both circuits, tn =k tp
It is assumed that the processor has separate instruction and data memories
Assume now that instruction 3 is a branch instruction. As soon as this instruction is decoded in
segment DA in step 4, the transfer from FI to DA of the other instructions is halted until the
branch instruction is executed in step 6. If the branch is taken, a new instruction is fetched in
step 7. If the branch is not taken, the instruction fetched previously in step 4 can be used. The
pipeline then continues until a new branch instruction is encountered.
Another delay may occur in the pipeline if the EX segment needs to store the result of the
operation in the data memory while the FO segment needs to fetch an operand. In that case,
segment FO must wait until segment EX has finished its operation.
Reasons for the pipeline to deviate from its normal operation are:
Resource conflicts caused by access to memory by two segments at the same time. Most of
these instructions can be resolved by using separate instruction and data memories.
Data dependency conflicts arise when an instruction depends on the result of a previous
instruction, but his result is not yet available
Branch difficulties arise from program control instructions that may change the value of PC
Methods to handle data dependency:
Hardware interlocks are circuits that detect instructions whose source operands are
destinations of prior instructions. Detection causes the hardware to insert the required delays
without altering the program sequence.
Operand forwarding uses special hardware to detect a conflict and then avoid it by routing the
data through special paths between pipeline segments. This requires additional hardware paths
through multiplexers as well as the circuit to detect the conflict.
Delayed load is a procedure that gives the responsibility for solving data conflicts to the
compiler. The compiler is designed to detect a data conflict and reorder the instructions as
necessary to delay the loading of the conflicting data by inserting no-operation instructions.
Multiport Memory
A multiport memory system employs separate buses between each memory module and each
CPU. This is shown in Fig. for four CPUs and four memory modules (MMs).
Each processor bus is connected to each memory
module. A processor bus consists of the address, data,
and control lines required to communicate with memory.
The memory module is said to have four ports and each
port accommodates one of the buses. The module must
have internal control logic to determine which port will
have access to memory at any given time.
Memory access conflicts are resolved by assigning fixed priorities to each memory port. The
priority for memory access associated with each processor may be established by the physical
port position that its bus occupies in each module.
The advantage of the multi port memory organization is the high transfer rate that can be
achieved because of the multiple paths between processors and memory.
The disadvantage is that it requires expensive memory control logic and a large number of
cables and connectors.
Crossbar Switch
The crossbar switch organization consists of a number of
cross points that are placed at intersections between
processor buses and memory module paths.
The small square in each cross point is a switch that
determines the path from a processor to a memory
module.
Each switch point has control logic to set up the transfer
path between a processor and memory.
It examines the address that is placed in the bus to determine whether its particular module is
being addressed.
It also resolves multiple requests for access to the same memory module on a predetermined
priority basis.
The functional design of a crossbar switch connected to one memory module is shown in figure.
The circuit consists of multiplexers that select the
data address, and control from one CPU for
communication with the memory module.
Priority levels are established by the arbitration
logic to select one CPU when two or more CPUs
attempt to access the same memory.
A crossbar switch organization supports
simultaneous transfers from memory modules because there is a separate path associated with
each module.
Multistage Switching Network
The basic component of a multistage network is a two-
input, two-out interchange switch.
The switch has the capability of connecting input A to
either of the outputs. Terminal B of the switch behaves in a
similar fashion. The switch also has the capability to
arbitrate between conflicting requests.
Using the 2 x 2 switch as a building block, it is possible to
build multistage network to control the communication
between a number of sources and destinations.
Consider the binary tree shown Fig. The two processors
P1 and P2 are connected through switches to eight
memory modules marked in binary from 000 through 111.
The path from source to a destination is determined from
the binary bits of the destination number. The first bit of
the destination number determines the switch output in the first level. The second bit specifies
the output of the switch in the second level, and the third bit specifies the output of the switch in
the third level.
Many different topologies have been proposed for multistage switching networks to control
processor-memory communication in a tightly coupled multiprocessor system or to control the
communication between the processing elements in a loosely coupled system.
UNIT-V 13 G JAGAN NAIK G
The processor whose arbiter has a PI = 1 and PO = 0 is the one that is given control of the
system bus
A processor may be in the middle of a bus operation when a higher priority processor requests
the bus. The lower-priority processor must complete its bus operation before it relinquishes
control of the bus.
When an arbiter receives control of the bus (because its PI = 1 and PO = 0) it examines the
busy line. If the line is inactive, it means that no other processor is using the bus. The arbiter
activates the busy line and its processor takes control of the bus. However, if the arbiter finds
the busy line active, it means that another processor is currently using the bus.
The arbiter keeps examining the busy line while the lower-priority processor that lost control of
the bus completes its operation.
When the bus busy line returns to its inactive state, the higher-priority arbiter enables the busy
line, and its corresponding processor can then conduct the required bus transfers.
Parallel Arbitration Logic
The parallel bus arbitration technique uses an
external priority encoder and a decoder as shown in
Fig. Each bus arbiter in the parallel scheme has a bus
request output line and a bus acknowledge input line.
Each arbiter enables the request line when its
processor is requesting access to the system bus. The
processor takes control of the bus if its acknowledge
input line is enabled.
Dynamic Arbitration Algorithms
A dynamic priority algorithm gives the system the capability for changing the priority of the
devices while the system is in operation.
The time slice algorithm allocates a fixed-length time slice of bus time that is offered
sequentially to each processor, in round-robin fashion. The service given to each system
component with this scheme is independent of its location along the bus.
In a bus system that uses polling, the bus grant signal is replaced by a set of lines called poll
lines which are connected to all units. These lines are used by the bus controller to define an
address for each device connected to the bus.
When a processor that requires access recognizes its address, it activates the bus busy line and
then accesses the bus. After a number of bus cycles, the polling process continues by choosing a
different processor. The polling sequence is normally programmable, and as a result, the
selection priority can be altered under program control.
The least recently used (LRU) algorithm gives the highest priority to the requesting device
that has not used the bus for the longest interval. The priorities are adjusted after a number of
bus cycles according to the LRU algorithm.
In the first-come, first-serve scheme, requests are served in the order received. To implement
this algorithm, the bus controller establishes a queue arranged according to the time that the bus
requests arrive. Each processor must wait for its turn to use the bus on a first-in, first-out
(FIFO) basis.
The rotating daisy-chain procedure is a dynamic extension of the daisy chain algorithm. In this
scheme there is no central bus controller, and the priority line is connected from the priority-out
of the last device back to the priority-in of the first device in a closed loop.
Each arbiter priority for a given bus cycle is determined by its position along the bus priority
line from the arbiter whose processor is currently controlling the bus. Once an arbiter releases
the bus, it has the lowest priority.
INTERPROCESSOR COMMUNICATION AND SYNCHRONIZATION
The various processors in a multiprocessor system must be provided with a facility for
communicating with each other. A communication path can be established through common
input-output channels.
In a shared memory multiprocessor system, the most common procedure is to set aside a
portion of memory that is accessible to all processors. The primary use of the common memory
is to act as a message center similar to a mailbox, where each processor can leave messages for
other processors and pick up messages intended for it.
The sending processor structures a request, a message, or a procedure, and places it in the
memory mailbox. Status bits residing in common memory are generally used to indicate the
condition of the mailbox, whether it has meaningful information, and for which processor it is
intended.
The receiving processor can check the mailbox periodically to determine if there are valid
messages for it. The response time of this procedure can be time consuming since a processor
will recognize a request only when polling messages.
A more efficient procedure is for the sending processor to alert the receiving processor directly
by means of an interrupt signal. This can be accomplished through a software-initiated
interprocessor interrupt by means of an instruction in the program of one processor which when
executed produces an external interrupt condition in a second processor. This alerts the
interrupted processor of the fact that a new message was inserted by the interrupting processor.
In addition to shared memory, a multiprocessor system may have other shared resources. For
example, a magnetic disk storage unit connected to an lOP may be available to all CPUs. This
provides a facility for sharing of system programs stored in the disk.
A communication path between two CPUs can be established through a link between two lOPs
associated with two different CPUs. This type of link allows each CPU to treat the other as an
IO device so that messages can be transferred through the IO path.
To prevent conflicting use of shared resources by several processors there must be a provision
for assigning resources to processors. This task is given to the operating system. There are three
organizations that have been used in the design of operating system for multiprocessors: master-
slave configuration, separate operating system, and distributed operating system.
In a master-slave mode, one processor, designated the master, always executes the operating
system functions. The remaining processors, denoted as slaves, do not perform operating
system functions. If a slave processor needs an operating system service, it must request it by
interrupting the master and waiting until the current program can be interrupted.
In the separate operating system organization, each processor can execute the operating system
routines it needs. This organization is more suitable for loosely coupled systems where every
processor may have its own copy of the entire operating system.
In the distributed operating system organization, the operating system routines are distributed
among the available processors. However, each particular operating system function is assigned
to only one processor at a time. This type of organization is also referred to as a floating
operating system since the routines float from one processor to another and the execution of the
routines may be assigned to different processors at different times.
In a loosely coupled multiprocessor system the memory is distributed among the processors and
there is no shared memory for passing information.
The communication between processors is by means of message passing through IO channels.
The communication is initiated by one processor calling a procedure that resides in the memory
of the processor with which it wishes to communicate. When the sending processor and
receiving processor name each other as a source and destination, a channel of communication is
established.
A message is then sent with a header and various data objects used to communicate between
nodes. There may be a number of possible paths available to send the message between any two
nodes.
The operating system in each node contains routing information indicating the alternative paths
that can be used to send a message to other nodes. The communication efficiency of the
interprocessor network depends on the communication routing protocol, processor speed, data
link speed, and the topology of the network.
Interprocessor Synchronization
The instruction set of a multiprocessor contains basic instructions that are used to implement
communication and synchronization between cooperating processes.
Communication refers to the exchange of data between different processes. For example,
parameters passed to a procedure in a different processor constitute interprocessor
communication.
Synchronization refers to the special case where the data used to communicate between
processors is control information. Synchronization is needed to enforce the correct sequence of
processes and to ensure mutually exclusive access to shared writable data.
Multiprocessor systems usually include various mechanisms to deal with the synchronization of
resources.
Low-level primitives are implemented directly by the hardware. These primitives are the basic
mechanisms that enforce mutual exclusion for more complex mechanisms implemented in
software.
A number of hardware mechanisms for mutual exclusion have been developed.
One of the most popular methods is through the use of a binary semaphore. Mutual Exclusion
with a Semaphore
A properly functioning multiprocessor system must provide a mechanism that will guarantee
orderly access to shared memory and other shared resources.
This is necessary to protect data from being changed simultaneously by two or more processors.
This mechanism has been termed mutual exclusion. Mutual exclusion must be provided in a
multiprocessor system to enable one processor to exclude or lock out access to a shared
resource by other processors when it is in a critical section.
A critical section is a program sequence that, once begun, must complete execution before
another processor accesses the same shared resource.
A binary variable called a semaphore is often used to indicate whether or not a processor is
executing a critical section. A semaphore is a software controlled flag that is stored in a
memory location that all processors can access.
When the semaphore is equal to 1, it means that a processor is executing a critical program, so
that the shared memory is not available to other processors.
When the semaphore is equal to 0, the shared memory is available to any requesting processor.
Processors that share the same memory segment agree by convention not to use the memory
segment unless the semaphore is equal to 0, indicating that memory is available . They also
agree to set the semaphore to 1 when they are executing a critical section and to clear it to 0
when they are finished.
Testing and setting the semaphore is itself a critical operation and must be performed as a single
indivisible operation. If it is not, two or more processors may test the semaphore simultaneously
and then each set it, allowing them to enter a critical section at the same time. This action would
allow simultaneous execution of critical section, which can result in erroneous initialization of
control parameters and a loss of essential information.
A semaphore can be initialized by means of a test and set instruction in conjunction with a
hardware lock mechanism.
A hardware lock is a processor generated signal that serves to prevent other processors from
using the system bus as long as the signal is active. The test-and-set instruction tests and sets a
semaphore and activates the lock mechanism during the time that the instruction is being
executed.
This prevents other processors from changing the semaphore between the time that the
processor is testing it and the time that it is setting it. Assume that the semaphore is a bit in the
least significant position of a memory word whose address is symbolized by SEM.
Let the mnemonic TSL designate the "test and set while locked" operation. The instruction
TSL SEM will be executed in two memory cycles (the first to read and the second to write)
without interference as follows:
R M[SEM] Test semaphore
M[SEM]1 Set semaphore
The semaphore is tested by transferring its value to a processor register R and then it is set to 1.
The value in R determines what to do next.
If the processor finds that R = 1, it knows that the semaphore was originally set. (The fact that it
is set again does not change the semaphore value.) That means that another processor is
executing a critical section, so the processor that checked the semaphore does not access the
shared memory.
If R = 0, it means that the common memory (or the shared resource that the semaphore
represents) is available. The semaphore is set to 1 to prevent other processors from accessing
memory. The processor can now execute the critical section.
The last instruction in the program must clear location SEM to zero to release the shared
resource to other processors. Note that the lock signal must be active during the execution of
the test-and-set instruction. It does not have to be active once the semaphore is set.
Thus the lock mechanism prevents other processors from accessing memory while the
semaphore is being set. The semaphore itself, when set, prevents other processors from
accessing shared memory while one processor is executing a critical section.