Notes Multiprocessor
Notes Multiprocessor
● The interstage connection pattern is the perfect shuffle over 8 objects. The perfect
shuffle is obtained by shifting 1 bit to the left and wrapping around the most significant to
the least significant position.
● The Omega network can also be used to broadcast data from one source to many
destinations, as exemplified in Fig.a, using the upper broadcast or lower broadcast
switch settings.
● In Fig. a, the message at input 001 is being broadcast to all eight outputs through a
binary tree connection.
● The two way shuffle interstage connections can be replaced by four-way shuffie
interstage connections when 4 x 4 switch boxes are used as building blocks, as
exemplified in Fig. 7.9b for a I6-input Omega network with log 4 16 = 2 stages.
● Note that a four-way shuffle corresponds to dividing the I6 inputs into four equal subsets
and then shuffling them evenly among the four subsets.
● The write-update protocol (Fig.c) demands the new block content X’ be broadcast to all
cache copies via the bus. The memory copy is also updated if write-through caches are
used. In using write-back caches, the memory copy is updated later at block
replacement time.
Write-Through Caches
● The states of a cache block copy change with respect to read, write, and replacement
operations in the cache.
● Figure 7.15 shows the state transitions for two basic write-invalidate Snoopy protocols
developed for write-through and write-back caches, respectively.
● A block copy of a write-through cache i attached to processor i can assume one of two
possible cache states: valid or invalid (Fig. 7.15a).
● A remote processor is denoted j, where j =/ i. For each of the two cache states, six
possible events may take place. Note that all cache copies of the same block use the
same transition graph in making state changes.
● In a valid state (Fig. 7.15a), all processors can read (R(i),R(j)) safely. Local processor i
can also write (W(i)) safely in a valid state.
● The invalid state corresponds to the case of the block either being invalidated or being
replaced (Z(i) or Z(j)).
● Wherever a remote processor writes (W(j)) into its cache copy all other cache copies
become invalidated.
● The cache block in cache i becomes valid whenever a successful read (R(i)) or write
(W(i)) is carried out by a local processor i.
Write-Back Caches
● The valid state of a write-back cache can be further split into two cache states, labeled
RW (read-write) and RO (read-only) as shown in Fig. 7.15b.
● The INV (invalidated or not-in-cache) cache state is equivalent to the invalid state
mentioned before.
● This three-state coherence scheme corresponds to an ownership protocol.
● When the memory owns a block, caches can contain only the RO copies of the block. In
other words, multiple copies may exist in the RO state and every processor having a
copy (called a keeper of the copy) can read (R(i), R(j) the copy safely.
● The INV state is entered whenever a remote processor writes (W(j)) its local copy or the
local processor replaces (Z(i)) its own block copy.
● The RW state corresponds to only one cache copy existing in the entire system owned
by the local processor i. Read (R(i)) and write (W(i)) can be safely performed in the RW
state.
● From either the RO state or the INV state, the cache block becomes uniquely owned
when a local write (W(i)) takes place.
Write-once Protocol
● James Goodman (1983) proposed a cache coherence protocol for bus-based
multiprocessors.
● This scheme combines the advantages of both write-through and write-back
invalidations.
● In order to reduce bus traffic, the very first write of a cache block uses a write-through
policy.
● This will result in a consistent memory copy while all other cache copies are invalidated.
● After the first write, shared memory is updated using a write-back policy. This scheme
can be described by the four-state transition graph shown in Fig. 7.16.
● The four cache states are defined below:
1. Valid: The cache block, which is consistent with the memory copy, has been read
from shared memory and has not been modified.
2. Invalid: The block is not found in the cache or is inconsistent with the memory
copy.
3. Reserved: Data has been written exactly once since being read from shared
memory. The cache copy is consistent with the memory copy, which is the only
other copy.
4. Dirty Thc cache block has been modified (written) more than once, and the cache
copy is the only one in the system (thus inconsistent with all other copies).
● The solid lines in Fig. 7.16 correspond to access commands issued by a local processor
labeled read-miss, write-hit, and write-miss.
● Whenever a read-miss occurs, the valid state is entered.
● The first write-hit leads to the reserved state.
● The second write-hit leads to the dirty state, and all future write-hits stay in the dirty
state.
● Whenever a write-miss occurs, the eache block enters the dirty state.
● The dashed lines correspond to invalidation commands issued by remote processors via
the snoopy bus.
● The read-invalidate command reads a block and invalidates all other copies.
● The write-invalidate command invalidates all other copies of a block. The bus-read
command corresponds to a normal memory read by a remote processor via the bus
Cache Events and Actions The memory-access and invalidation commands trigger the
following events and actions:
● Read-miss: When a processor wants to read a block that is not in the cache, a
read-miss occurs. A bus-read operation will be initiated. If no dirty copy exists, then main
memory has a consistent copy and supplies a copy to the requesting cache. If a dirty
copy does exist in a remote cache, that cache will inhibit the main memory and send a
copy to the requesting cache. In all cases, the cache copy will enter the valid state after
a read-miss.
● Write-hit: If the copy is in the dirty or reserved state, the write can be carried out locally
and the new state is dirty. If the new state is valid, a write-invalidate command is
broadcast to all caches, invalidating their copies. The shared memory is written through,
and the resulting state is reserved after this first write.
● Write-miss: When a processor fails to write in a local cache, the copy must come either
from the main memory or from a remote cache with a dirty block. This is accomplished
by sending a read-invalidate command which will invalidate all cache copies. The local
copy is thus updated and ends up in a dirty state.
● Read-hit: Read-hits can always be performed in a local cache without causing a state
transition or using the snoopy bus for invalidation.
● Block Replacement: If a copy is dirty, it has to be written back to main memory by block
replacement. If thc copy is clean (i.e., in either the valid, reserved, or invalid state), no
replacement will take place.
3.4.3 Directory-Based Protocols
When a multistage or packet switched network is used to build a large multiprocessor
with hundreds of processors, the snoopy cache protocols must be modified to suit the network
capabilities. Since broadcasting is expensive to perform in such a network, consistency
commands will be sent only to those caches that keep a copy of the block. This leads to
directory based protocols for network-connected multiprocessors.
Directory Structures
● In a multistage or packet switched network, cache coherence is supported by using
cache directories to store information on where copies of cache blocks reside.
● The first directory scheme, which used a central directory containing duplicates of all
cache directories. This central directory, providing all the information needed to enforce
consistency, is usually very large and must be associatively searched, like the individual
cache directories. Contention and long search times are two drawbacks in using a
central directory for a large multiprocessor
● In a distributed-directory scheme each memory module maintains a separate directory
which records the state and presence information for each memory block.
● A cache-coherence protocol that does not use broadcasts must store the locations of all
cached copies of each block of shared data. This list of cached locations, whether
centralized or distributed, is called a cache directory.
● A directory entry for each block of data contains a number of pointers to specify the
locations of copies of the block.
● Each directory entry also contains a dirty bit to specify whether a particular cache has
permission to Write the associated block of data.
● Different types of directory protocols fall under three primary categories: full map
directories, limited directories, chained directories.
Full-Map Directories
● Full-map directories store enough data associated with each block in global memory so
that every cache in the system can simultaneously store a copy of any block of data.
That is each directory entry contains N pointers, where N is the number of processors in
the system.
● The full-map protocol implements directory entries with one bit per processor and a dirty
bit. Each bit represents the status of the block in the corresponding processor's cache
(present or absent).
● If the dirty bit is set, then one and only one processor's bit is set and that processor can
write into the block.
● In the first state, location X is missing in all of the caches in the system.
● The second state results from three caches (Cl, C2, and C3) requesting copies of
location X.
● Three pointers (processor bits) are set in the entry to indicate the caches that have
copies of the block of data.
● In the first two states, the dirty bit on the left side of the directory entry is set to clean (C),
indicating that no processor has permission to write to the block of data.
● The third state results from cache C3 requesting write permission for the block.
● In the final state, the dirty bit is set to dirty (D), and there is a single pointer to the block
of data in cache C3.
Let us examine the transition from the second state to the third state in more detail. Once
processor P3 issues the write to cache C3, the following events will take place:
1. Cache C3 detects that the block containing location X is valid but that the processor
does not have permission to write to the block, indicated by the block‘s write-permission
bit in thc cache.
2. Cache C3 issues a write request to the memory module containing location X and stalls
processor P3.
3. The memory module issues invalidate requests to caches C1 and C2.
4. Caches C1 and C2 receive thc invalidate requests, set the appropriate bit to indicate that
the block containing location X is invalid and send acknowledgements back to the
memory module.
5. The memory module receives the acknowledgements, sets the dirty bit, clears the
pointers to caches C1 and C2, and sends write permission to cache C3.
6. Cache C3 receives the write permission message, updates the state in the cache, and
reactivates processor P3.
Limited Directories
● Limited directory protocols are designed to solve the directory size problem.
● A directory protocol can be classified as Diri X .
● The symbol i stands for the number of pointers, and X is NB for a scheme with no
broadcast.
● A full-map scheme without broadcast is represented as DirN N B .
● A limited directory protocol that uses i<N pointers is denoted Diri N B .
● Figure b shows the situation when three caches request read copies in a memory
system with a Dir2 N B protocol.
● In this case, we can view the two-pointer directory as a two-way set-associative cache
of pointers to shared copies.
● When cache C3 requests a copy of location X, the memory module must invalidate the
copy in either cache C1 or cache C2. This process of pointer replacement is called
eviction.
● Since the directory acts as a set-associative cache, it must have a pointer replacement
policy.
● Diri B protocols allow more than i copies of each block of data to exist, but they resort to
a broadcast mechanism when more than i cached copies of a block need to be
invalidated.
Chained Directories
● Chained directories realize the scalability of limited directories without restricting the
number of shared copies of data blocks.
● This type of cache coherence scheme is called a chained scheme because it keeps
track of shared copies of data by maintaining a chain of directory pointers.
● The simpler of the two schemes implements a singly linked chain, which is best
described by example (Fig.c).
● Suppose there are no shared copies of location X. If processor P1 reads location X, the
memory sends a copy to cache C1, along with a chain termination (CT) pointer. The
memory also keeps a pointer to cache C1.
● Subsequently, when processor P2 reads location X, the memory sends a copy to cache
C2, along with the pointer to cache C1. The memory then keeps a pointer to cache C2.
● By repeating the above step, all of the caches can cache a copy of the location X.
● If processor P3 writes to location X, it is necessary to send a data invalidation message
down the chain.
● To ensure sequential consistency, the memory module denies processor P3 write
permission until the processor with the chain termination pointer acknowledges the
invalidation of the chain.
● Perhaps this scheme should be called a gossip protocol (as opposed to a snoopy
protocol) because information is passed from individual to individual rather than being
spread by covert observation.