0% found this document useful (0 votes)
4 views18 pages

05 - 02 Multi Processors

The document outlines the syllabus for a course on multiprocessors, detailing their characteristics, interconnection structures, and types of memory organization. It explains the differences between multiprocessor and multicomputer systems, the advantages of multiprocessing, and various interconnection methods like common bus, multiport memory, and crossbar switches. Additionally, it covers interprocessor arbitration and data transfer methods in multiprocessor systems.

Uploaded by

jnananya2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

05 - 02 Multi Processors

The document outlines the syllabus for a course on multiprocessors, detailing their characteristics, interconnection structures, and types of memory organization. It explains the differences between multiprocessor and multicomputer systems, the advantages of multiprocessing, and various interconnection methods like common bus, multiport memory, and crossbar switches. Additionally, it covers interprocessor arbitration and data transfer methods in multiprocessor systems.

Uploaded by

jnananya2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BVRIT HYDERABAD College of Engineering for Women PAGE NO: 1

Multi Processors
Syllabus
Characteristics of Multiprocessors, Interconnection Structures, Interprocessor arbitration, Interprocessor
communication and synchronization, Cache Coherence
Characteristics of Multiprocessors:
The interconnection of two or more CPUs with memory and input-output equipment is called as multiprocessor
system.

The term "processor" here refers to


- A central processing unit (CPU) or
- An input-output processor (lOP).

But, a system with a single CPU and one or more lOPs is usually not a multiprocessor system, unless lOP has
computational facilities comparable to a CPU.
So, a multiprocessor system implies the existence of multiple CPUs, with one or more lOPs.

A multicomputer system consists of several autonomous computers interconnected with each other by means of
communication lines.

Both multiprocessor and multicomputer systems support concurrent operations.


However, there exists few differences between them and they are:
1. Construction of multicomputer is easier and cost effective than a multiprocessor.
2. Programming is easier in multiprocessor system compared to multicomputer system.
3. Multiprocessor supports parallel computing, Multicomputer supports distributed computing.

The reliability of the system is improved with multiprocessing because a failure or error in one processor has a
limited effect on the rest of the system, as the second processor can be assigned to perform the functions of the
disabled processor.
Hence, the system will continue to function correctly with probably some loss in efficiency.
High performance can be achieved using a multiprocessor organization as the computations can proceed in
parallel in one of the following two ways.
1. Multiple independent jobs can be operated in parallel.
2. A single job can be partitioned into multiple parallel tasks.

1. Multiple independent jobs:


- The overall function can be partitioned into a number of independent tasks.
- Each task is assigned to separate processor, sometimes to a special purpose processor.
- Examples:
a. In a computer system
 One processor performs the computations for an industrial process control.
 Other processors monitor and control the various parameters, such as temperature and flow rate.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 2

b. In a computer system
 One processor performs high-speed floating-point mathematical computations.
 Other processor(s) take(s) care of routine data-processing tasks.

2. Multiple Parallel Tasks:


Here a program is divided into parallel executable tasks and this can be achieved in one of two ways:
a. The user can explicitly declare for parallel execution of certain tasks in the program. An operating system
with programming language constructs suitable for specifying parallel processing is required.
b. A compiler with multiprocessor software can automatically detect parallelism in a user's program by
checking for data dependency in the entire program using following criteria:
 A program depending on the data generated in another part, then the part yielding the needed data must
be executed first.
 Two parts of a program that do not use data generated by each can run concurrently.

Based on the organization of memory, the multiprocessors are classified into two types:
1. Shared-memory or tightly coupled microprocessor
2. Distributed- memory or loosely coupled microprocessor

1. Shared-memory or tightly coupled microprocessor:


- Shares a common memory
- May have cache memory for each processor
- Information is shared by placing in the common global memory.

2. Distributed- memory or loosely coupled microprocessor:


- Each processor has its own local memory
- Switching scheme is used to connect the processors
- Information is exchanged in the form of packets using routing mechanism. A packet consists of an address, the
data content, and some error detection code.
- Packets can be addressed to a specific processor, a subset of processors or all the processors.

Tightly coupled systems can tolerate a higher degree of interaction between tasks.
Loosely coupled systems are most efficient for minimal interaction between tasks.

Interconnection Structures:
A multiprocessor system consists of following components:
- CPUs,
- IOPs connected to input-output devices,
- A memory unit that may be partitioned into a number of separate modules

Depending on the number of transfer paths available between the processors and memory in a shared memory
system or distributed memory system, the interconnection among the components can have different physical
configurations. They are:
1. Time-shared common bus
2. Multiport memory

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 3

3. Crossbar switch
4. Multistage switching network
5. Hypercube system
1. Time-Shared Common Bus:
In a common-bus multiprocessor system, a number of processors are connected through a common path to a
memory unit as shown below.

In this configuration:
- At any given time, only one processor having the control over the bus is allowed to communicate with the
memory or another processor.
- Any processor wishing to initiate a transfer:
 First determines the availability of the bus
 After the availability of the bus, address the destination is used to initiate the transfer and a command is
issued to indicate the operation to be performed.
 The destination recognizes its address in the bus and responds to the sender’s control signals
accordingly.

The system will resolve the conflicts by incorporating a bus controller that establishes priorities among the
requesting units.
Limitations:
- A single common-bus system is restricted to one transfer at a time. So, all other processors are either busy
with internal operations or must be idle waiting for the bus.
- Hence, the total overall transfer rate within the system is limited by the speed of the single bus.

To overcome these, the processors in the system can be kept busy through the implementation of two or more
independent buses to permit multiple simultaneous bus transfers. But, this increases the overall system cost and
complexity.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 4

A more economical implementation of a dual bus structure is as shown below.

In the above configuration:


- Number of local buses is used to connect to its own local memory and to one or more processors.
- Each local bus may be connected to a CPU, an IOP, or any combination of processors along with local
memory. Part of the local memory can be cache memory.
- A system bus controller links each local bus to a common system bus.
- The memory or IOP (not in the figure) connected to the common system bus is shared by all processors.
- At a given time, only one processor is allowed to communicate with the shared memory and other common
resources through the system bus.
- All other processors are kept busy communicating with their local memory and IO devices.

2. Multiport Memory:
A multiport memory system employs separate buses between each memory module and each CPU as shown
below.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 5

In the above configuration:


- Each processor bus consisting of the address, data and control lines is connected to each memory module.
- The memory module is having four ports to accommodate the buses.
- The internal control logic of the memory module determines one of the port to have access at any given
time.
- The conflicts are resolved by assigning fixed priorities to each memory port, and the processors are
connected accordingly.
Advantage:
High transfer rate can be achieved using multiple paths between processors and memory.

Disadvantages:
- Expensive memory control logic
- Large number of cables and connectors.

In lieu of the above, this interconnection structure is appropriate for a small number of processors.

3. Crossbar Switch:
The crossbar switch organization shown below consists of a number of crosspoints that are placed at
intersections between processor buses and memory module paths.

In the above organization:


- The small square at each crosspoint is a switch to determine the path from a processor to a memory module.
- Each switch point has control logic to set up the transfer path between a processor and memory based on the
address placed in the bus.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 6

- Switch point also resolves multiple requests for access to the same memory module on a predetermined
priority basis. The functional design of a crossbar switch connected to one memory module is illustrated
below.

The multiplexers and arbitration logic:


- Selects the data, address, and control from one CPU for communication with the memory module.
- Establishes priority levels and implements using priority encoder to select one CPU during conflicts to
access the same memory.

A crossbar switch organization uses a separate path associated with each memory module to support
simultaneous transfers. However, the hardware required to implement the switch can become quite large and
complex.

4. Multistage Switching Network:


The basic component of a multistage network is a two-input, two-output interchange switch as shown below.

Each 2 X 2 switch has


- Two inputs labeled A and B, and two outputs, labeled 0 and 1.
- The control signals (not shown) to establish the interconnection between the input and output terminals.
- The capability to connect either input A or B to either of the outputs.
- The capability to arbitrate between conflicting requests.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 7

The 2 x 2 switch can be used as a building block, to construct multistage network to control the communication
between a number of sources and destinations.
Let us consider the following binary tree:

- The two processors P1 and P2 are connected through switches to eight memory modules marked in binary
from 000 through 111.
- The path from source to a destination is determined from the binary bits of the destination number.
 The first bit determines the switch output in the first level.
 The second bit specifies the output of the switch in the second level
 The third bit specifies the output of the switch in the third level.

For example, to connect P1 to memory 101, a path is formed with:


- output 1 in the first-level switch,
- output 0 in the second-level switch
- output 1 in the third-level switch

The above example illustrates that either P1 or P2 can be connected to any one of the eight memories.
But, certain request patterns cannot be satisfied simultaneously.
For example, if P1 is connected to one of the destinations 000 through 011, P2 can be connected to only one of
the destinations 100 through 111.

Many different topologies have been proposed for multistage switching networks to control
- Processor-memory communication in a tightly coupled multiprocessor system
- Communication between the processing elements in a loosely coupled system.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 8

The omega switching network shown below is one of the popular topology.

In the above configuration:


- There is exactly one path from each source to any particular destination.
- Some request patterns cannot be connected simultaneously. For example, any two sources cannot be
connected simultaneously to destinations 000 and 001.

The establishment of the path takes place as:


- The source initiates the request by sending a 3-bit destination number in the switching network.
- Each level examines a different bit in the 3-bit number to determine the 2 x 2 switch setting.
 Level 1 inspects the most significant bit.
 Level 2 inspects the middle bit, and
 Level 3 inspects the least significant bit.
- At the input of the 2 x 2 switch, it is routed to the upper output if the specified bit is 0 or to the lower output
if the bit is 1.

In a tightly coupled multiprocessor system, the source is a processor and the destination is a memory module.
- The first pass through the network sets up the path.
- Succeeding passes are used to transfer the address into the memory and then transfer the data in either
direction based on Read or Write signals

In a loosely coupled multiprocessor system, both the source and destination are processing elements.
- The first pass through the network sets up the path.
- Then the source transfers the message to the destination.

5. Hypercube Interconnection:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled system composed of N = 2n nodes
interconnected in an n-dimensional binary cube.
- Each node can be a CPU, local memory or an I/O interface.
- Each node has direct communication paths called as edges to n other neighbor nodes.
- There are 2n distinct n-bit binary addresses that can be assigned to the nodes.
- Each node address differs from that of each of its n neighbors by exactly one bit position.
For example, the three neighbors of the node with address 100 in a three-cube structure are 000, 110, and
101. Each of these binary numbers differs from address 100 by one bit value.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 9

The hypercube structure for n = 1, 2, and 3 is as given below.

In general, an n-cube structure has 2n nodes with a processor residing in each node.

In an n-cube structure, one to n links are required to route messages from a source node to a destination node.
For example, in a three-cube structure, node 000:
- Can communicate directly with node 001.
- Must cross at least two links to communicate with 011 (from 000 to 001 to 011 or from 000 to 010 to 011).
- Go through at least three links to communicate from node 000 to node 111.

A routing procedure is based on the result of exclusive-OR of the source node address with the destination node
address. The resulting binary value will have 1 bits corresponding to the axes on which the two nodes differ.
The message is then sent along any one of the axes.
For example, in a three-cube structure, a message from 010 to 001 produces an exclusive-OR result equal to
011. The message can be sent along the second axis to 000 and then through the third axis to 001.

Interprocessor Arbitration:
A number of buses at various levels are used to transfer the information between the components in computer
systems.
- Internal buses for transfer of information between processor registers and ALU.
- A Memory bus for transferring data, address and read/write information
- An I/O bus is used to transfer information to and from I/O devices

A system bus connects major components in a multiprocessor system, such as CPUs, IOPs and memory.
System Bus: The lines in a system bus are divided into three functional groups:
- Data
- Address
- Control
In addition, there will be power distribution lines also to the components.
Data Lines:
- Provides a path for the transfer of data between processors and common memory.
- Usually in the multiples of 8.
- Terminated with three-state buffers
- Bidirectional
Address Lines:
- Used to identify a memory address or any other source or destination (I/O ports)
- Determines the maximum possible memory capacity
FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA
BVRIT HYDERABAD College of Engineering for Women PAGE NO: 10

- Terminated with three-state buffers


- Unidirectional, from processor to memory or I/O ports
Control Lines:
- Provide signals for controlling the information transfer between components.
- These include:
 Timing signals indicate the validity of data and address information.
 Command signals specify operations to be performed.
 Transfer Signals such as
 Memory read and write.
 Acknowledge of a transfer.
 Interrupt requests.
 Bus control signals – Bus request, Bus grant & Arbitration.

Data transfers over the system transfers may be:


- Synchronous
- Asynchronous

In a synchronous bus,
- Each data item is transferred from the source to destination units during a pre-known time slice.
- Both source and destination are driven by a common clock.
- In the case of separate clocks, synchronization signals are transmitted periodically to keep all clocks in step
with each other.

In an asynchronous bus, the transfer of each data item is accomplished by using handshaking control signals.

The following figures list the 86 lines available in the IEEE standard 796 multibus.

Arbitration Procedures:
All the requests from the processor(s) are serviced by arbitration procedures based on the priorities. There are
three kinds of arbitration procedures:
- Serial
- Parallel
- Dynamic

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 11

Serial Arbitration Procedure:


- This is a hardware bus priority resolving technique.
- The units requesting the control of the system bus are connected in series as a daisy-chain connection.
- The processors/units are assigned priority based on their position along the priority control line.
The daisy-chain connection of four arbiters is shown below:

In the above figure:


- Each processor has its own bus arbiter logic with priority-in (PI) and priority-out (PO) lines.
- The PO of each arbiter is connected to the PI of next-lower priority arbiter.
- The PI of the highest-priority unit is maintained at a logic 1 value.
- The PO output of a particular arbiter is equal to 0, if its PI is 1 and it is requesting the system bus, otherwise
PO output is 1.
- In this way the priority is passed to the next unit in the chain.
- If PI is 0, then PO also will be 0.
- The device with PI as 1 and PO as 0, will receive the control over the system bus by activating bus busy
line.

In case of a higher-priority processor requests the bus; the lower-priority processor must complete its bus
operation before relinquishing the control.

Parallel Arbitration Procedure:


- This is a hardware bus priority resolving technique.
- It uses an external priority encoder and decoder.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 12

In the above figure:


- Each bus arbiter has a bus request output line and a bus acknowledge input line.
- The request line is enabled, when its corresponding processor requests the access to the system bus.
- The processor takes control of the bus; when it’s corresponding acknowledge input line is enabled.
- An orderly transfer of control is achieved with the help of bus busy line.
- All the four request lines are connected to a 4X2 encoder.
- The encoder generates 2-bit code, representing the highest-priority device among the requests.
- This 2-bit output from the encoder helps the 2X4 decoder to enable the appropriate acknowledge line.

Dynamic Arbitration Algorithms:


- Both serial and parallel arbitration procedures use a static priority algorithm as the priority of each device is
fixed.
- A dynamic priority algorithm provides the capability for the system to change the priority of the devise
while in operation.

Few arbitration procedures that use dynamic priority algorithms are:


 Time slice
 Polling
 Least Recently Used (LRU)
 First-Come First-Serve (FCFS)
 Rotating daisy-chain

Time slice Algorithm:


- A fixed-length time slice of bus time is allocated to each processor serially in round-robin fashion.
- The service provided to each processor is independent of its location along the bus.
- No preference is given to any processor as the same amount of time is allocated.

Polling Algorithm:
- The poll lines connected to all units are used by the bus controller to define an address for each device.
- The bus controller sequences through the addresses in a prescribed manner.
- A processor that requires access, recognizes its address, activates the bus busy line and accesses the bus.
- The polling process continues by choosing a different processor, after a number of bus cycles.
- This process is programmable and the selection priority can be altered under program control.

LRU Algorithm:
- The highest priority is given to the unit that has not used the bus for the longest interval.
- The priorities are adjusted after a number of bus cycles based on the usage.
- No unit is favored as the priorities are dynamically changed and every unit will get an opportunity.

FCFS Algorithm:
- All the requests are served in the order received.
- A queue is established by the bus controller to maintain the arrived bus requests.
- Each unit must wait for its turn to use the bus on a First-In, First-Out (FIFO) basis.

Rotating daisy-chain Algorithm:


- It is a dynamic extension of serial daisy-chain algorithm.
- There is no central bus controller.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 13

- The priority-out of the last device is connected to the priority-in of the first device, forming closed loop.
- Each arbiter priority for a given bus cycle is determined by its position from the arbiter, controlling the bus.
- The arbiter releasing the bus will have the lowest priority.

Interprocessor Communication and Synchronization


Interprocessor Communication
In a multiprocessor system a facility must be provided for communication among the various processors.
- A communication path can be established through common input-output channels.
- In a shared memory multiprocessor system,
 A portion of memory, act as a message center is available for all the processors to access.
 Each processor can leave messages for other processors and pick up messages intended for it.
 The sending processor structures a request, a message, or a procedure and places it in the memory
mailbox by setting the appropriate status bits.
 The receiving processor can check the mailbox periodically to determine if there are valid messages for
it.
 The response time in this procedure is poor.
 This procedure can be made efficient when the sending processor alerts the receiving processor by
means of an interrupt signal.
 This alerts the interrupted processor that a new message was inserted by the interrupting processor.
- A communication path between two CPUs can be established through a link between two lOPs associated
with two different CPUs. This type of link allows each CPU to treat the other as an IO device so that
messages can be transferred through the IO path.

To prevent conflicting use of shared resources by several processors there must be a provision in the operating
system for assigning resources to processors.
There are three organizations that have been used in the design of operating system for multiprocessors:
- Master-slave configuration
- Separate operating system
- Distributed operating system.

Master-Slave Configuration:
- One processor, designated the master, always executes the operating system functions.
- The remaining processors, denoted as slaves, do not perform operating system functions, but requests the
service by interrupting the master.

Separate operating system:


- Every processor may have its own copy of the entire operating system, i.e., loosely coupled systems.
- Each processor can execute the operating system routines it needs.

Distributed operating system:


- The operating system routines are distributed among the available processors.
- But, each particular operating system function is assigned to only one processor at a time.
- Also referred to as a floating operating system as the routines float from one processor to another and the
execution of the routines may be assigned to different processors at different times.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 14

In a loosely coupled multiprocessor system,


- There is no shared memory for passing information.
- The memory is distributed among the processors.
- The communication between processors is by message passing through IO channels.
- The communication is initiated by the sender, calling a procedure that resides in the memory of the
destination to establish a channel.
- A message is then sent with a header and various data objects used to communicate between nodes.
- In the case of availability of multiple paths, the operating system in each node contains routing information.
- The communication efficiency of the interprocessor network depends on:
 Communication routing protocol.
 Processor speed.
 Data link speed.
 Topology of the network.

Interprocessor Synchronization
The multiprocessor’s instruction set contains basic instructions to implement communication and
synchronization between cooperating processes.
- Communication refers to the exchange of data between different processes.
- Synchronization refers to the control information, needed to enforce the correct sequence of processes and
to ensure mutually exclusive access to shared writable data.

Multiprocessor systems include various mechanisms to deal with the synchronization of resources.
- Low-level primitives to enforce mutual exclusion are implemented directly by the hardware.
- A number of hardware mechanisms for mutual exclusion have been developed. One of the most popular
methods is through the use of a binary semaphore.

Mutual Exclusion with a Semaphore:


Mutual exclusion is a mechanism that
- Guarantee orderly access to shared memory and other shared resources.
- Protect data from being changed simultaneously by two or more processors.
- Enable one processor to exclude or lock out access to a shared resource by other processors when it is in a
critical section.

A critical section is a program sequence that, once begun, must complete execution before another processor
accesses the same shared resource.
A semaphore is:
- A binary variable used to indicating a processor executing a critical section or not.
- A software controlled flag, stored in a memory location, accessible to all the processors.
- Equal to 1, while a processor is executing a critical program, and the shared memory is not available to other
processors.
- Equal to 0, while the shared memory is available to any requesting processor.
- Set to 1, when a processor is executing a critical section and to clear it to 0 when it is finished.

Testing and setting the semaphore is also a critical operation and must be performed as a single indivisible
operation.
Otherwise, two or more processors may test the semaphore simultaneously and then each set it, to enter a
critical section at the same time, resulting in erroneous initialization.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 15

A semaphore’s test and set instruction should work in conjunction with a hardware lock mechanism.

A hardware lock:
- A process or generated signal to prevent other processors from using the system bus, when active.
- Activated during execution of the instruction after setting a semaphore.
- Prevents other processors from changing the semaphore between the time that the processor is testing it and
the time that it is setting it.

Let us assume that the:


- Semaphore is a LSB of a memory word, symbolized by SEM.
- Mnemonic TSL designate the "test and set while locked" operation.

Hence, the instruction TSL SEM will be executed in two memory cycles (the first to read and the second to
write) as follows:
R  M[SEM] Test Semaphore
M[SEM]  1 Set Semaphore

Cache Coherence
In Uniprocessors:
- Usage of cache memory reduces the average access time.
- The main memory is not involved in the transfer, when a word is found in cache during a read operation.
- For write operation, there are two commonly used procedures to update memory:
 Write-through policy, both cache and main memory is updated with every write operation.
 Write-back policy, only the cache is updated and the location is marked so that it can be copied later into
main memory.

In Multiprocessors:
- All the processors share a common memory.
- Each processor may have a local memory, part or all of it may be a cache, to reduce the average access time.
- The same information may reside as a number of copies in some caches and main memory.
- For the correct execution of memory operations, the multiple copies must be kept identical, leading to a cache
coherence problem.

A memory scheme is said to be coherent iff the value returned on a load instruction is always the value given by
the latest store instruction with the same address.
Caching cannot be used in bus-oriented multiprocessors with two or more processors, without having a proper
solution to the cache coherence problem.

Conditions for Incoherence


In multiprocessors with private caches, cache coherence problems exist because of the need to share writable
data.
Read-only data can safely be replicated without cache coherence enforcement mechanisms.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 16

Let us illustrate this by considering the three-processor configuration with private caches shown below:

In the above figure, during some operation the value of an element X, i.e, 54 is loaded from main memory is
loaded into the three processors, P1, P2, and P3, i.e., copied into the private caches of the three processors.
Hence, there are consistent copies in the caches and main memory.

Now, one of the processors performs a store to X with a value of 120, resulting in the inconsistency of the
copies of X in the caches. Hence, a load by the other processors will not return the latest value.
Depending on the memory update policy used in the cache, the main memory may also be inconsistent with
respect to the cache.

Using a write-through policy, consistency between memory and the originating cache is maintained as the new
value gets updated in main memory also. But the other two caches are inconsistent as they still hold the old
value, as shown below:

Using a write-back policy, the main memory is not updated at the time of the store. Hence, the copies in the
other two caches and main memory are inconsistent, as shown below:

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 17

DMA activity in conjunction with an IOP connected to the system bus may also cause consistency problems:
- In the case of input, the DMA may modify locations in main memory that also reside in cache without
updating the cache.
- In the case of output, DMA may read memory locations before they are updated from the cache when using a
write-back policy.
IO-based memory incoherence can be overcome by making the IOP a participant in the cache coherent
solution, adopted in the system.

Solutions to the Cache Coherence Problem


Various schemes have been proposed to solve the cache coherence problem in shared memory multiprocessors.
1. Simple scheme:
- Disallows private caches for each processor and have a shared cache memory associated with main memory.
- Every data access is made to the shared cache.
- But, violates the principle of closeness of CPU to cache, thereby increasing the average memory access time.

2. Cachable and Noncachable Scheme:


- It is desirable to attach a private cache to each processor, to improve performance.
- Only nonshared and read-only data, called as cachable, allowed being stored in caches.
- Shared writable data are noncachable, available in main memory.
- The compiler must tag data as either cachable or noncachable.
- The performance may degrade because of extra software overhead.

3. Centralized Global Table Scheme:


- Allows writable data to exist in at least one cache.
- Each block is identified as read-only (RO) or read and write (RW).
- The status of memory blocks is stored in the central global table.
- All caches can have copies of blocks identified as RO, but only one cache will have a copy of an RW block. -
- The updation in the cache with an RW block, will not affect other caches.

The cache coherence problem can be solved by means of schemes that are:
- A combination of software and hardware
- Hardware-only.

The two schemes, cachable and noncachable & Centralized Global Table use software-based procedures.

Hardware-only solutions are handled by the hardware automatically and have the advantage of higher speed and
program transparency.

In the hardware solution:


- The cache controller monitors all bus requests from CPUs and lOPs.
- All caches attached to the bus constantly monitor the network for possible write operations.
- Depending on the method, caches either update or invalidate their own copies based on the match.
- The bus controller monitoring this action is referred to as a snoopy cache controller.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA


BVRIT HYDERABAD College of Engineering for Women PAGE NO: 18

Various schemes have been proposed to solve the cache coherence problem by means of snoopy cache protocol.
The simplest method is to adopt a write-through policy by using the following procedure:
- All the snoopy controllers watch the bus for memory store operations.
- The updation of a word in a cache by writing into it, results in the updation of the corresponding location in
main memory and removing from all other caches.
- The local snoopy controllers in all other caches check their memory to determine if they have a copy of the
word that has been overwritten.
- If a copy exists in a remote cache, that location is marked as invalid.
- During future processor accesses the invalid item is treated as a cache miss, and the updated item is transferred
from main memory.

FACULTY MEMBER: Prof. Murali Nath R S SUBJECT: COA

You might also like