05 - 02 Multi Processors
05 - 02 Multi Processors
Multi Processors
Syllabus
Characteristics of Multiprocessors, Interconnection Structures, Interprocessor arbitration, Interprocessor
communication and synchronization, Cache Coherence
Characteristics of Multiprocessors:
The interconnection of two or more CPUs with memory and input-output equipment is called as multiprocessor
system.
But, a system with a single CPU and one or more lOPs is usually not a multiprocessor system, unless lOP has
computational facilities comparable to a CPU.
So, a multiprocessor system implies the existence of multiple CPUs, with one or more lOPs.
A multicomputer system consists of several autonomous computers interconnected with each other by means of
communication lines.
The reliability of the system is improved with multiprocessing because a failure or error in one processor has a
limited effect on the rest of the system, as the second processor can be assigned to perform the functions of the
disabled processor.
Hence, the system will continue to function correctly with probably some loss in efficiency.
High performance can be achieved using a multiprocessor organization as the computations can proceed in
parallel in one of the following two ways.
1. Multiple independent jobs can be operated in parallel.
2. A single job can be partitioned into multiple parallel tasks.
b. In a computer system
One processor performs high-speed floating-point mathematical computations.
Other processor(s) take(s) care of routine data-processing tasks.
Based on the organization of memory, the multiprocessors are classified into two types:
1. Shared-memory or tightly coupled microprocessor
2. Distributed- memory or loosely coupled microprocessor
Tightly coupled systems can tolerate a higher degree of interaction between tasks.
Loosely coupled systems are most efficient for minimal interaction between tasks.
Interconnection Structures:
A multiprocessor system consists of following components:
- CPUs,
- IOPs connected to input-output devices,
- A memory unit that may be partitioned into a number of separate modules
Depending on the number of transfer paths available between the processors and memory in a shared memory
system or distributed memory system, the interconnection among the components can have different physical
configurations. They are:
1. Time-shared common bus
2. Multiport memory
3. Crossbar switch
4. Multistage switching network
5. Hypercube system
1. Time-Shared Common Bus:
In a common-bus multiprocessor system, a number of processors are connected through a common path to a
memory unit as shown below.
In this configuration:
- At any given time, only one processor having the control over the bus is allowed to communicate with the
memory or another processor.
- Any processor wishing to initiate a transfer:
First determines the availability of the bus
After the availability of the bus, address the destination is used to initiate the transfer and a command is
issued to indicate the operation to be performed.
The destination recognizes its address in the bus and responds to the sender’s control signals
accordingly.
The system will resolve the conflicts by incorporating a bus controller that establishes priorities among the
requesting units.
Limitations:
- A single common-bus system is restricted to one transfer at a time. So, all other processors are either busy
with internal operations or must be idle waiting for the bus.
- Hence, the total overall transfer rate within the system is limited by the speed of the single bus.
To overcome these, the processors in the system can be kept busy through the implementation of two or more
independent buses to permit multiple simultaneous bus transfers. But, this increases the overall system cost and
complexity.
2. Multiport Memory:
A multiport memory system employs separate buses between each memory module and each CPU as shown
below.
Disadvantages:
- Expensive memory control logic
- Large number of cables and connectors.
In lieu of the above, this interconnection structure is appropriate for a small number of processors.
3. Crossbar Switch:
The crossbar switch organization shown below consists of a number of crosspoints that are placed at
intersections between processor buses and memory module paths.
- Switch point also resolves multiple requests for access to the same memory module on a predetermined
priority basis. The functional design of a crossbar switch connected to one memory module is illustrated
below.
A crossbar switch organization uses a separate path associated with each memory module to support
simultaneous transfers. However, the hardware required to implement the switch can become quite large and
complex.
The 2 x 2 switch can be used as a building block, to construct multistage network to control the communication
between a number of sources and destinations.
Let us consider the following binary tree:
- The two processors P1 and P2 are connected through switches to eight memory modules marked in binary
from 000 through 111.
- The path from source to a destination is determined from the binary bits of the destination number.
The first bit determines the switch output in the first level.
The second bit specifies the output of the switch in the second level
The third bit specifies the output of the switch in the third level.
The above example illustrates that either P1 or P2 can be connected to any one of the eight memories.
But, certain request patterns cannot be satisfied simultaneously.
For example, if P1 is connected to one of the destinations 000 through 011, P2 can be connected to only one of
the destinations 100 through 111.
Many different topologies have been proposed for multistage switching networks to control
- Processor-memory communication in a tightly coupled multiprocessor system
- Communication between the processing elements in a loosely coupled system.
The omega switching network shown below is one of the popular topology.
In a tightly coupled multiprocessor system, the source is a processor and the destination is a memory module.
- The first pass through the network sets up the path.
- Succeeding passes are used to transfer the address into the memory and then transfer the data in either
direction based on Read or Write signals
In a loosely coupled multiprocessor system, both the source and destination are processing elements.
- The first pass through the network sets up the path.
- Then the source transfers the message to the destination.
5. Hypercube Interconnection:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled system composed of N = 2n nodes
interconnected in an n-dimensional binary cube.
- Each node can be a CPU, local memory or an I/O interface.
- Each node has direct communication paths called as edges to n other neighbor nodes.
- There are 2n distinct n-bit binary addresses that can be assigned to the nodes.
- Each node address differs from that of each of its n neighbors by exactly one bit position.
For example, the three neighbors of the node with address 100 in a three-cube structure are 000, 110, and
101. Each of these binary numbers differs from address 100 by one bit value.
In general, an n-cube structure has 2n nodes with a processor residing in each node.
In an n-cube structure, one to n links are required to route messages from a source node to a destination node.
For example, in a three-cube structure, node 000:
- Can communicate directly with node 001.
- Must cross at least two links to communicate with 011 (from 000 to 001 to 011 or from 000 to 010 to 011).
- Go through at least three links to communicate from node 000 to node 111.
A routing procedure is based on the result of exclusive-OR of the source node address with the destination node
address. The resulting binary value will have 1 bits corresponding to the axes on which the two nodes differ.
The message is then sent along any one of the axes.
For example, in a three-cube structure, a message from 010 to 001 produces an exclusive-OR result equal to
011. The message can be sent along the second axis to 000 and then through the third axis to 001.
Interprocessor Arbitration:
A number of buses at various levels are used to transfer the information between the components in computer
systems.
- Internal buses for transfer of information between processor registers and ALU.
- A Memory bus for transferring data, address and read/write information
- An I/O bus is used to transfer information to and from I/O devices
A system bus connects major components in a multiprocessor system, such as CPUs, IOPs and memory.
System Bus: The lines in a system bus are divided into three functional groups:
- Data
- Address
- Control
In addition, there will be power distribution lines also to the components.
Data Lines:
- Provides a path for the transfer of data between processors and common memory.
- Usually in the multiples of 8.
- Terminated with three-state buffers
- Bidirectional
Address Lines:
- Used to identify a memory address or any other source or destination (I/O ports)
- Determines the maximum possible memory capacity
FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA
BVRIT HYDERABAD College of Engineering for Women PAGE NO: 10
In a synchronous bus,
- Each data item is transferred from the source to destination units during a pre-known time slice.
- Both source and destination are driven by a common clock.
- In the case of separate clocks, synchronization signals are transmitted periodically to keep all clocks in step
with each other.
In an asynchronous bus, the transfer of each data item is accomplished by using handshaking control signals.
The following figures list the 86 lines available in the IEEE standard 796 multibus.
Arbitration Procedures:
All the requests from the processor(s) are serviced by arbitration procedures based on the priorities. There are
three kinds of arbitration procedures:
- Serial
- Parallel
- Dynamic
In case of a higher-priority processor requests the bus; the lower-priority processor must complete its bus
operation before relinquishing the control.
Polling Algorithm:
- The poll lines connected to all units are used by the bus controller to define an address for each device.
- The bus controller sequences through the addresses in a prescribed manner.
- A processor that requires access, recognizes its address, activates the bus busy line and accesses the bus.
- The polling process continues by choosing a different processor, after a number of bus cycles.
- This process is programmable and the selection priority can be altered under program control.
LRU Algorithm:
- The highest priority is given to the unit that has not used the bus for the longest interval.
- The priorities are adjusted after a number of bus cycles based on the usage.
- No unit is favored as the priorities are dynamically changed and every unit will get an opportunity.
FCFS Algorithm:
- All the requests are served in the order received.
- A queue is established by the bus controller to maintain the arrived bus requests.
- Each unit must wait for its turn to use the bus on a First-In, First-Out (FIFO) basis.
- The priority-out of the last device is connected to the priority-in of the first device, forming closed loop.
- Each arbiter priority for a given bus cycle is determined by its position from the arbiter, controlling the bus.
- The arbiter releasing the bus will have the lowest priority.
To prevent conflicting use of shared resources by several processors there must be a provision in the operating
system for assigning resources to processors.
There are three organizations that have been used in the design of operating system for multiprocessors:
- Master-slave configuration
- Separate operating system
- Distributed operating system.
Master-Slave Configuration:
- One processor, designated the master, always executes the operating system functions.
- The remaining processors, denoted as slaves, do not perform operating system functions, but requests the
service by interrupting the master.
Interprocessor Synchronization
The multiprocessor’s instruction set contains basic instructions to implement communication and
synchronization between cooperating processes.
- Communication refers to the exchange of data between different processes.
- Synchronization refers to the control information, needed to enforce the correct sequence of processes and
to ensure mutually exclusive access to shared writable data.
Multiprocessor systems include various mechanisms to deal with the synchronization of resources.
- Low-level primitives to enforce mutual exclusion are implemented directly by the hardware.
- A number of hardware mechanisms for mutual exclusion have been developed. One of the most popular
methods is through the use of a binary semaphore.
A critical section is a program sequence that, once begun, must complete execution before another processor
accesses the same shared resource.
A semaphore is:
- A binary variable used to indicating a processor executing a critical section or not.
- A software controlled flag, stored in a memory location, accessible to all the processors.
- Equal to 1, while a processor is executing a critical program, and the shared memory is not available to other
processors.
- Equal to 0, while the shared memory is available to any requesting processor.
- Set to 1, when a processor is executing a critical section and to clear it to 0 when it is finished.
Testing and setting the semaphore is also a critical operation and must be performed as a single indivisible
operation.
Otherwise, two or more processors may test the semaphore simultaneously and then each set it, to enter a
critical section at the same time, resulting in erroneous initialization.
A semaphore’s test and set instruction should work in conjunction with a hardware lock mechanism.
A hardware lock:
- A process or generated signal to prevent other processors from using the system bus, when active.
- Activated during execution of the instruction after setting a semaphore.
- Prevents other processors from changing the semaphore between the time that the processor is testing it and
the time that it is setting it.
Hence, the instruction TSL SEM will be executed in two memory cycles (the first to read and the second to
write) as follows:
R M[SEM] Test Semaphore
M[SEM] 1 Set Semaphore
Cache Coherence
In Uniprocessors:
- Usage of cache memory reduces the average access time.
- The main memory is not involved in the transfer, when a word is found in cache during a read operation.
- For write operation, there are two commonly used procedures to update memory:
Write-through policy, both cache and main memory is updated with every write operation.
Write-back policy, only the cache is updated and the location is marked so that it can be copied later into
main memory.
In Multiprocessors:
- All the processors share a common memory.
- Each processor may have a local memory, part or all of it may be a cache, to reduce the average access time.
- The same information may reside as a number of copies in some caches and main memory.
- For the correct execution of memory operations, the multiple copies must be kept identical, leading to a cache
coherence problem.
A memory scheme is said to be coherent iff the value returned on a load instruction is always the value given by
the latest store instruction with the same address.
Caching cannot be used in bus-oriented multiprocessors with two or more processors, without having a proper
solution to the cache coherence problem.
Let us illustrate this by considering the three-processor configuration with private caches shown below:
In the above figure, during some operation the value of an element X, i.e, 54 is loaded from main memory is
loaded into the three processors, P1, P2, and P3, i.e., copied into the private caches of the three processors.
Hence, there are consistent copies in the caches and main memory.
Now, one of the processors performs a store to X with a value of 120, resulting in the inconsistency of the
copies of X in the caches. Hence, a load by the other processors will not return the latest value.
Depending on the memory update policy used in the cache, the main memory may also be inconsistent with
respect to the cache.
Using a write-through policy, consistency between memory and the originating cache is maintained as the new
value gets updated in main memory also. But the other two caches are inconsistent as they still hold the old
value, as shown below:
Using a write-back policy, the main memory is not updated at the time of the store. Hence, the copies in the
other two caches and main memory are inconsistent, as shown below:
DMA activity in conjunction with an IOP connected to the system bus may also cause consistency problems:
- In the case of input, the DMA may modify locations in main memory that also reside in cache without
updating the cache.
- In the case of output, DMA may read memory locations before they are updated from the cache when using a
write-back policy.
IO-based memory incoherence can be overcome by making the IOP a participant in the cache coherent
solution, adopted in the system.
The cache coherence problem can be solved by means of schemes that are:
- A combination of software and hardware
- Hardware-only.
The two schemes, cachable and noncachable & Centralized Global Table use software-based procedures.
Hardware-only solutions are handled by the hardware automatically and have the advantage of higher speed and
program transparency.
Various schemes have been proposed to solve the cache coherence problem by means of snoopy cache protocol.
The simplest method is to adopt a write-through policy by using the following procedure:
- All the snoopy controllers watch the bus for memory store operations.
- The updation of a word in a cache by writing into it, results in the updation of the corresponding location in
main memory and removing from all other caches.
- The local snoopy controllers in all other caches check their memory to determine if they have a copy of the
word that has been overwritten.
- If a copy exists in a remote cache, that location is marked as invalid.
- During future processor accesses the invalid item is treated as a cache miss, and the updated item is transferred
from main memory.