0% found this document useful (0 votes)

5 views48 pages

Module-4 Notes

The document discusses multiprocessor and multicomputer systems, focusing on efficient interconnects for communication among processors, memory, and I/O devices. It covers various network architectures, including hierarchical bus systems, crossbar switches, and multistage networks, detailing their characteristics and operational mechanisms. Additionally, it addresses cache coherence issues and synchronization mechanisms necessary for maintaining data consistency in multiprocessor environments.

Uploaded by

prasidhgowdaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views48 pages

Module-4 Notes

Uploaded by

prasidhgowdaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

MODULE-IV

Chapter-7 Multiprocessors and Multicomputers

7.1 Multiprocessor system interconnect

 Parallel processing demands the use of efficient system interconnects for fast communication
among multiple processors and shared memory, I/O and peripheral devices.
 Hierarchical buses, crossbar switches and multistage networks are often used for this purpose.
 A generalized multiprocessor system is depicted in Fig. 7.1. This architecture combines features
from the UMA, NUMA and COMA models.
 Each processor Pi is attached to its own local memory and private cache.
 These multiple processors connected to share memory through interprocessor memory network
(IPMN).
 Processors share the access of I/O and peripheral devices through Processor-I/O Network (PION).
Both IPMN and PION are necessary in a shared-resource multiprocessor.
 An optional Inter-processor Communication Network (IPCN) can permit processor
communicationwithout using shared memory.

Network Characteristics
The networks are designed with many choices like timing, switching and control strategy like in case
of dynamic network the multiprocessors interconnections are under program control.

Timing
 Synchronous – controlled by a global clock which synchronizes all network activity.
 Asynchronous – use handshaking or interlock mechanisms for communication and
especially suitable for coordinating devices with different speed.

Switching Method
 Circuit switching – a pair of communicating devices control the path for the entire duration
of data transfer
 Packet switching – large data transfers broken into smaller pieces, each of which can
compete for use of the path

Network Control
 Centralized – global controller receives and acts on requests
 Distributed – requests handled by local devices independently

7.1.1 Hierarchical Bus Systems

 A bus system consists of a hierarchy of buses connecting various system and subsystem
components in a computer.
 Each bus is formed with a number of signal, control, and power lines. Different buses are used to
perform different interconnection functions.
 In general, the hierarchy of bus systems are packaged at different levels as depicted in Fig. 7.2,
including local buses on boards, backplane buses, and I/O buses.
Local Bus
 Buses implemented on printed-circuit boards are called local buses.
 On a processor board one often finds a local bus which provides a common communication path
among major components (chips) mounted on the board.
 A memory board uses a memory bus to connect the memory with the interface logic.
 An I/O board or network interface board uses a data bus. Each of these board buses consists of
signal and utility lines.

Backplane Bus
A backplane is a printed circuit on which many connectors are used to plug in functional boards. A
system bus, consisting of shared signal paths and utility lines, is built on the backplane. This system bus
provides a common communication path among all plug-in boards.

I/O Bus
Input/Output devices are connected to a computer system through an I/O bus such as the
SCSI(SmallComputer Systems Interface) bus.
This bus is made of coaxial cables with taps connecting disks, printer and other devices to a processor
through an I/O controller.
Special interface logic is used to connect various board types to the backplane bus.

Hierarchical Buses and Caches

This is a multilevel tree structure in which the leaf nodes are processors and their private caches
(denoted Pj and C1j in Fig. 7.3). These are divided into several clusters, each of which is connected
through a cluster bus.
An intercluster bus is used to provide communications among the clusters. Second level caches
(denoted as C2i) are used between each cluster bus and the intercluster bus. Each second level cache
must have a capacity that is at least an order of magnitude larger than the sum of the capacities of all
first-level caches connected beneath it.
 Each single cluster operates on a single-bus system. Snoopy bus coherence protocols can be used to
establish consistency among first level caches belonging to the same cluster.
 Second level caches are used to extend consistency from each local cluster to the upper level.
 The upper level caches form another level of shared memory between each cluster and the main
memory modules connected to the intercluster bus.
 Most memory requests should be satisfied at the lower level caches.
 Intercluster cache coherence is controlled among the second-level caches and the resulting effects
are passed to the lower level.

7.1.2 Crossbar Switch and Multiport Memory

Single stage networks are sometimes called recirculating networks because data items may have to pass
through the single stage many times. The crossbar switch and the multiported memory organization are
both single-stage networks.

This is because even if two processors attempted to access the same memory module (or I/O device) at the
same time, only one of the requests is serviced at a time.

Multistage Networks
Multistage networks consist of multiple sages of switch boxes, and should be able to connect any input to
any output.
A multistage network is called blocking if the simultaneous connections of some multiple input/output
pairs may result in conflicts in the use of switches or communication links.
A nonblocking multistage network can perform all possible connections between inputs and outputs by
rearranging its connections.

Crossbar Networks
Crossbar networks connect every input to every output through a crosspoint switch. A crossbar network is a
single stage, non-blocking permutation network.
In an n-processor, m-
unary switch which can be open or closed, providing a point-to-point connection path between the
processor and a memory module.

Crosspoint Switch Design

Out of n crosspoint switches in each column of an n * m crossbar mesh, only one can be connected at a
time.
Crosspoint switches must be designed to handle the potential contention for each memory module. A
2
crossbar switch avoids competition for bandwidth by using O(N ) switches to connect N inputs to N
outputs.
Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number
of workstations, typically 20 or fewer.

Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a
crosspoint switch for a single column. The crosspoint switch eventually responds with an
acknowledgement when the access has been completed.

Multiport Memory
Since crossbar switches are expensive and not suitable for systems with many processors or memory
modules, multiport memory modules may be used instead.
A multiport memory module has multiple connection points for processors (or I/O devices), and the
memory controller in the module handles the arbitration and switching that might otherwise have been
accomplished by a crosspoint switch.
A two function switch can assume only two possible state namely state or exchange states. However a four
function switch box can be any of four possible states. A multistage network is capable of connecting any
input terminal to any output terminal. Multi-stage networks are basically constructed by so called shuffle-
exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.

7.1.3 Multistage and Combining Networks

Multistage networks are used to build larger multiprocessor systems. We describe two multistage
networks, the Omega network and the Butterfly network, that have been built into commercial
machines.
Routing in Omega Networks
An 8-input Omega network is shown in Fig. 7.8.
In general, an n-input Omega network has log2n stages. The stages are labeled from 0 to log2n — 1
from the input end to the output end.
Data routing is controlled by inspecting the destination code in binary. When the ith high-order bit of
the destination code is a 0, a 2 x 2 switch at stage i connects the input to the upper output.
Otherwise, the input is directed to the lower output.

 Two switch settings are shown in Figs. 7.8a and b with respect to permutations Π 1 = (0,7,6,4,2)
(1,3)(5) and Π2= (0,6,4,7,3) (1,5)(2), respectively.
 The switch settings in Fig. 7.8a are for the implementation of Π 1, which maps 0 7, 7 6, 64,
42, 2 0, 1 3, 3 1, 5 5.
 Consider the routing of a message from input 001 to output 011. This involves the use of switches
A, B, and C. Since the most significant bit of the destination 011 is a "zero," switch A must be set
straight so that the input 001 is connected to the upper output (labeled 2).
 The middle bit in 011 is a "one," thus input 4 to switch B is connected to the lower output with a
"crossover" connection.
 The least significant bit in 011 is a "one," implying a flat connection in switch C.
 Similarly, the switches A, E, and D are set for routing a message from input 101 to output 101.
There exists no conflict in all the switch settings needed to implement the permutation Π 1 in Fig.
7.8a.

 Now consider implementing the permutation Π2 in the 8-input Omega network (Fig. 7.8b0.
Conflicts in switch settings do exist in three switches identified as F, G, and H. The conflicts
occurring at F are caused by the desired routings 000 110 and 100 111.
 Since both destination addresses have a leading bit 1, both inputs to switch F must be connected to
the lower output.
 To resolve the conflicts, one request must be blocked.
 Similarly we see conflicts at switch G between 011 000 and 111011, and at switch H between
101001 and 011 000. At switches I and J, broadcast is used from one input to two outputs,
which is allowed if the hardware is built to have four legitimate states as shown in fig. 2.24a.
 The above example indicates the fact that not all permutations can be implemented in one pass
through the Omega network.

Routing in Butterfly Networks

 This class of networks is constructed with crossbar switches as building blocks. Fig. 7.10 shows
two Butterfly networks of different sizes.
 Fig. 10a shows a 64-input Butterfly network built with two stages (2=log864) of 8X8 crossbar
switches.
 The eight-way shuffle function is used to establish the interstage connections between stage 0 and
stage 1.
 In Fig. 7.10b, a three-stage Butterfly network is constructed for 512 inputs, again with 8X8
crossbar switches.
 Each of the 64X64 boxes in Fig. 7.10b is identical to the two-stage Butterfly network in Fig. 7.10a.
 In total, sixteen 8x8 crossbar switches are used in Fig. 7.10a and 16 x 8+8 x 8 = 192 are used in
Fig. 7.10b. Larger Butterfly networks can be modularly constructed using more stages.
 Note that no broadcast connections are allowed in a Butterfly network, making these networks a
restricted subclass of Omega networks.
The Hot-Spot Problem
 When the network traffic is non-uniform, a hot spot may appear corresponding to a certain
memorymodule being excessively accessed by many processors at the same time.
 For example, a semaphore variable being used as a synchronization barrier may become a hot spot
since it is shared by many processors.
 Hot spots may degrade the network performance significantly. In the NYU Ultra-computer and
theIBM RP3 multiprocessor, a combining mechanism has been added to the Omega network.
 The purpose was to combine multiple requests heading for the same destination at switch points
where conflicts are taking place.
 An atomic read-modify-write primitive Fetch & Add(x,e), has been developed to perform
parallelmemory updates using the combining network.

Fetch & Add

 This atomic memory operation is effective in implementing an N-way synchronization with a
complexity independent of N.
 In a Fetch & Add(x, e) operation, i is an integer variable in shared memory and e is an
integer increment.
 When a single processor executes this operation, the semantics is

Fetch & Add(x, e)

{ temp  x;
x  temp + e; (7.1)
return temp }

 When N processes attempt to Fetch & Add(x, e) the same memory word simultaneously, the
memory is updated only once following a serialization principle.
 The sum of the N increments, e1 + e2 + • • • + eN, is produced in any arbitrary serialization of the N
requests.
 This sum is added to the memory word x, resulting in a new value x + e1 + e2 + • • • + eN
 The values returned to the N requests are all unique, depending on the serialization order followed.
 The net result is similar to a sequential execution of N Fetch & Adds but is performed in
one indivisible operation.
 Two simultaneous requests are combined in a switch as illustrated in Fig. 7.11.
One of the following operations will be performed if processor P1 executes Ans1  Fetch & Add(x,e1)
and P2 executes Ans2  Fetch & Add(x,e2) simultaneously on the shared variable x.
If the request from P1 is executed ahead of that from P2, the following values are returned:
Ans1  x
Ans2  x+ e1 (7.2)

If the execution order is reversed, the following values arc returned:

Ans1  x + e2
Ans2  x
Regardless of the executing order, the value x+ e1 + e2 is stored in memory.
It is the responsibility of the switch box to form the sum e1+ e2, transmit the combined request Fetch &
Add(x, e1 + e2), store the value e1 (or e2) in a wait buffer of the switch and return the values x and x+
e1 to satisfy the original requests Fetch & Add(x, e1) and Fetch & Add(x, e2) respectively, as shown
in fig. 7.11 in four steps.
7.2 Cache Coherence and Synchronization Mechanisms
7.2.1 Cache Coherence Problem:

 In a memory hierarchy for a multiprocessor system, data inconsistency may occur between
adjacent levels or within the same level.
 For example, the cache and main memory may contain inconsistent copies of the same data object.
 Multiple caches may possess different copies of the same memory block because multiple
processors operate asynchronously and independently.
 Caches in a multiprocessing environment introduce the cache coherence problem. When multiple
processors maintain locally cached copies of a unique shared-memory location, any local
modification of the location can result in a globally inconsistent view of memory.
 Cache coherence schemes prevent this problem by maintaining a uniform state for each cached
block of data.
 Cache inconsistencies caused by data sharing, process migration or I/O are explained below.

Inconsistency in Data sharing:

The cache inconsistency problem occurs only when multiple private caches are used.
In general, three sources of the problem are identified:
 sharing of writable data,
 process migration
 I/O activity.

• Consider a multiprocessor with two processors, each using a private cache and both sharing the
main memory.
• Let X be a shared data element which has been referenced by both processors. Before update, the
three copies of X are consistent.
• If processor P writes new data X’ into the cache, the same copy will be written immediately into
the shared memory under a write through policy.
• In this case. inconsistency occurs between the two copies (X and X') in the two caches.
• On the other hand, inconsistency may also occur when a write back policy is used, as shown on the
right.
• The main memory will be eventually updated when the modified data in the cache are replaced or
invalidated.
Process Migration and I/O
The figure shows the occurrence of inconsistency after a process containing a shared variable X
migrates from processor 1 to processor 2 using the write-back cache on the right. In the middle, a
process migrates from processor 2 to processor1 when using write-through caches.

In both cases, inconsistency appears between the two cache copies, labeled X and X’. Special
precautions must be exercised to avoid such inconsistencies. A coherence protocol must be established
before processes can safely rnigrate from one processor to another.

Two Protocol Approaches for Cache Coherence

• Many of the early commercially available multiprocessors used bus-based memory systems.
• A bus is a convenient device for ensuring cache coherence because it allows all processors in the
system to observe ongoing memory transactions.
• If a bus transaction threatens the consistent state of a locally cached object, the cache controller can
take appropriate actions to invalidate the local copy.
• Protocols using this mechanism to ensure coherence are called snoopy protocols because each
cache snoops on the transactions of other caches.
• On the other hand, scalable multiprocessor systems interconnect processors using short point-to-
point links in direct or multistage networks.
• Unlike the situation in buses, the bandwidth of these networks increases as more processors are
added to the system.
• However, such networks do not have a convenient snooping mechanism and do not provide an
efficient broadcast capability. In such systems, the cache coherence problem can be solved using
some variant of directory schemes.

Protocol Approaches for Cache Coherence:

1. Snoopy Bus Protocol
2. Directory Based Protocol

7 . 2 . 2 Snoopy Bus Protocol

 Snoopy protocols achieve data consistency among the caches and shared memory through a bus
watching mechanism.
 In the following diagram, two snoopy bus protocols create different results. Consider 3 processors
(P1, P2, Pn) maintaining consistent copies of block X in their local caches (Fig. 7.14a) and in the
shared memory module marked X.
 Using a write-invalidate protocol, the processor P1 modifies (writes) its cache from X to X’, and all
other copies are invalidated via the bus (denoted I in Fig. 7.14b). Invalidated blocks are called
dirty, meaning they should not be used.
 The write-update protocol (Fig. 7.14c) demands the new block content X’ be broadcast to all cache
copies via the bus.
 The memory copy also updated if write through caches are used. In using write-back caches, the
memory copy is updated later at block replacement time.

Write Through Caches:

• The states of a cache block copy change with respect to read, write and replacement operations in
the cache shows the state transitions for two basic write-invalidate snoopy protocols developed for
write-through and write-back caches, respectively.
• A block copy of a write through cache i attached to processor i can assume one of two possible
cache states: valid or invalid.
• A remote processor is denoted j, where j # i. For each of the two cache states, six possible events
may take place.
• Note that all cache copies of the same block use the same transition graph in making state changes.
• In a valid state (Fig. 7.15a), all processors can read (R(i), R(j)) safely. Local processor i can also
write(W(i)) safely in a valid state. The invalid state corresponds to the case of the block either
being invalidated or being replaced (Z(i) or Z(j)).

Write Back Caches:

• The valid state of a write-back cache can be further split into two cache states, Labeled RW(read-
write) and RO(read-only) as shown in Fig.7.15b.
• The INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before. This three-state coherence scheme corresponds to an ownership protocol.
• When the memory owns a block, caches can contain only the RO copies of the block. In other
words, multiple copies may exist in the RO state and every processor having a copy (called a
keeper of the copy) can read (R(i),R(j)) safely.
• The Inv state is entered whenever a remote processor writes (W(j)) its local copy or the local
processor replaces (Z(i)) its own block copy.
• The RW state corresponds to only one cache copy existing in the entire system owned by the local
processor i.
• Read (R(i)) and write(W(i)) can be safely performed in the RW state. From either the RO state or
the INV state, the cache block becomes uniquely owned when a local write (W(i)) takes place.
7.2.3 Directory Based Protocol
7.4 Message – Passing Mechanisms
Two Message Passing mechanisms are:
1. Store and Forward Routing
2. Wormhole Routing
1. Store and Forward Routing

 Packets are the basic unit of information flow in a store-and-forward network.

 Each node is required to use a packet buffer.
 A packet is transmitted from a source node to a destination node through a sequence of
intermediate nodes.
 When a packet reaches an intermediate node, it is first stored in the buffer.
 Then it is forwarded to the next node if the desired output channel and a packet buffer in the
receiving node are both available.

2. Wormhole Routing

 Packets are subdivided into smaller flits. Flit buffers are used in the hardware routers attached to
nodes.
 The transmission from the source node to the destination node is done through a sequence of
routers.
 All the flits in the same packet are transmitted in order as inseparable companions in a pipelined
fashion.
 Only the header flit knows where the packet is going.
 All the data flits must follow the header flit.
 Flits from different packets cannot be mixed up. Otherwise they may be towed to the wrong
destination.

Asynchronous Pipelining

 The pipelining of successive flits in a packet is done asynchronously using a handshaking protocol
as shown in Fig. 7.28. Along the path, a 1-bit ready/request (R/A) line is used between adjacent
routers.
 When the receiving router (D) is ready (7.28a) to receive a flit (ie., a flit buffer is available), it pulls
the R/A line low. When the sending router (S) is ready (Fig. 2.8b), it raises the line high and
transmits flit I through the channel.
 While the flit is being received by D (Fig. 7.28c), the R/A line is kept high. After flit I is removed
from D’s buffer (ie., transmitted to the next node) (Fig. 7.28d), the cycle repeats itself for the
transmission of the next flit i+1 until the entire packet is transmitted.

Advantages:

 Very efficient
 Faster clock

Latency Analysis:
 The communication latency in store-and-forward networks is directly proportional to the distance
(the number of hops) between the source and the destination.
TSF = L (D + 1) / W

 Wormhole Routing has a latency almost independent of the distance between the source and the
destination
TWH = L / W + F D / W

where, L: Packet length (in bits)

W: Channel Bandwidth (in bits per second)
D: Distance (number of nodes traversed minus 1)
F: Flit length (in bits)

7.4.2 Deadlock and Virtual channels

The communication channels between nodes in a wormhole-routed multicomputer network are

actually shared by many possible source and destination pairs. The sharing of a physical channel leads
to the concept of virtual channels.

Virtual channels

 A virtual channel is logical link between two nodes. It is formed by a flit buffer in the source node,
a physical channel between them and a flit buffer in the receiver node.
 Four flit buffers are used at the source node and receiver node respectively. One source buffer is
paired with one receiver buffer to form a virtual channel when the physical channel is allocated for
the pair.
 Thus the physical channel is time shared by all the virtual channels. By adding the virtual channel
the channel dependence graph can be modified and one can break the deadlock cycle.
 Here the cycle can be converted to spiral thus avoiding a deadlock. Virtual channel can be
implemented with either unidirectional channel or bidirectional channels.
 However a special arbitration line is needed between adjacent nodes interconnected by
bidirectional channel. This line determines the direction of information flow.
 The virtual channel may reduce the effective channel bandwidth available to each request.
 There exists a tradeoff between network throughput and communication latency in determining the
degree of using virtual channels.

Deadlock Avoidance

By adding two virtual channels, V3 and V4 in Fig. 7.32c, one can break the deadlock cycle. A modified
channel-dependence graph is obtained by using the virtual channels V3 and V4, after the use of channel
C2, instead of reusing C3 and C4.

The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing
can be done at the flit level or at the packet level if the packet length is sufficiently short.

Virtual channels can be implemented with either unidirectional channels or bidirectional channels.
Chapter-8 Multivector and SIMD Computers

8.1 Vector Processing Principles

Vector Processing Definitions

Vector: A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive elements called
the stride.

Vector Processor: A vector processor is an ensemble of hardware resources, including vector

registers, functional pipelines, processing elements, and register counters, for performing vector
operations.

Vector Processing: Vector processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one pair of data.
Vector processing is faster and more efficient than scalar processing.
Vectorization: The conversion from scalar code to vector code is called vectorization.

Vectorizing Compiler: A compiler capable of vectorization is called a Vectorizing Compiler

(vectorizer).

Vector Instruction Types

There are six types of vector instructions. These are defined by mathematical mappings between their
working registers or memory where vector operands are stored.

1. Vector - Vector instructions

2. Vector - Scalar instructions
3. Vector - Memory instructions
4. Vector reduction instructions
5. Gather and scatter instructions
6. Masking instructions
1. Vector - Vector instructions: One or two vector operands are fetched form the respective
vector registers, enter through a functional pipeline unit, and produce result in another vector
register.

F1: Vi  Vj
F2: Vi x Vj Vk

Examples: V1 = sin(V2), V3 = V1+ V2

2. Vector - Scalar instructions
Elements of vector register are multiplied by a scalar value.

F3: s x Vi  Vj

Examples: V2 = 6 + V1

3. Vector - Memory instructions: This corresponds to Store-load of vector registers (V) and
the Memory (M).

F4: M  V (Vector Load)

F5: V  M (Vector Store)

Examples: X = V1 V2 = Y
4. Vector reduction instructions: include maximum, minimum, sum, mean value.

F6: Vi  s

F7: Vi x Vj  s
5. Gather and scatter instructions Two instruction registers are used to gather or scatter
vector elements randomly throughout the memory corresponding to the following mappings
F8: M  Vi x Vj (Gather)
F9: Vi x Vj  M (Scatter)
Gather is an operation that fetches from memory the nonzero elements of a sparse
vector using indices.
Scatter does the opposite, storing into memory a vector in a sparse vector whose
nonzero entries are indexed.

6. Masking instructions The Mask vector is used to compress or to expand a vector to a shorter
or longer index vector (bit per index correspondence).
F10: Vi x Vm  Vj (Vm is a binary vector)

 The gather, scatter, and masking instructions are very useful in handling sparse vectors or sparse
matrices often encountered in practical vector processing applications.
 Sparse matrices are those in which most of the entries arc zeros.
 Advanced vector processors implement these instructions directly in hardware.
8.1.2 Vector-Access Memory Schemes
The flow of vector operands between the main memory and vector registers is usually pipelined with
multiple access paths.
Vector Operand Specifications
 Vector operands may have arbitrary length.
 Vector elements are not necessarily stored in contiguous memory locations.
 To access a vector a memory, one must specify its base, stride, and length.
 Since each vector register has fixed length, only a segment of the vector can be loaded into a vector
register.
 Vector operands should be stored in memory to allow pipelined and parallel access. Access itself
should be pipelined.

Three types of Vector-access memory organization schemes

C-Access memory organization
The m-way low-order memory structure, allows m words to be accessed concurrently and
overlapped.
The access cycles in different memory modules are staggered. The low-order a bits select the
modules, and the high-order b bits select the word within each module, where m=2a and a+b = n is
the address length.
 To access a vector with a stride of 1, successive addresses are latched in the address buffer at the
rate of one per cycle.
 Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle as
stated in Fig. 5.16b.
 If the stride is 2, the successive accesses must be separated by two minor cycles in order to avoid
access conflicts. This reduces the memory throughput by one-half.
 If the stride is 3, there is no module conflict and the maximum throughput (m words) results.
 In general, C-access will yield the maximum throughput of m words per memory cycle if the stride
is relatively prime to m, the number of interleaved memory modules.

S-Access memory organization

All memory modules are accessed simultaneously in a synchronized manner. The high order (n-a)
bits select the same offset word from each module.
At the end of each memory cycle (Fig. 8.3b), m = 2a consecutive words are latched. If the stride is
greater than 1, then the throughput decreases, roughly proportionally to the stride.

C/S-Access memory organization

 Here C-access and S-access are combined.
 n access buses are used with m interleaved memory modules attached to each bus.
 The m modules on each bus are m-way interleaved to allow C-access.
 In each memory cycle, at most m.n words are fetched if the n buses are fully used with
pipelined memory accesses

 The C/S-access memory is suitable for use in vector multiprocessor configurations.

 It provides parallel pipelined access of a vector data set with high bandwidth.
 A special vector cache design is needed within each processor in order to guarantee smooth data
movement between the memory and multiple vector processors.

8.3 Compound Vector Processing

8.3.1 Compound vector operation
A compound vector function (CVF) is defined as a composite function of vector operations
converted from a looping structure of linked scalar operations.

Do 10 I=1,N
Load R1, X(I)
Load R2, Y(I)
Multiply R1, S
Add R2, R1
Store Y(I), R2
10 Continue
where X(I) and Y(I), I=1, 2,…. N, are two source vectors originally residing in the memory. After the
computation, the resulting vector is stored back to the memory. S is an immediate constant supplied to
the multiply instruction.
After vectorization, the above scalar SAXPY code is converted to a sequence of five vector
instructions:
M( x : x + N-1)  V1 Vector Load
M( y : y + N-1)  V2 Vector Load
S X V1  V1 Vector Multiply
V2 X V1  V2 Vector Add
V2  M( y : y + N-1) Vector Store

X and y are starting memory addresses of the X and Y vectors, respectively; V1 and V2 are two
N-element vector registers in the vector processor.

CVF: Y(1:N) = S X(1:N) + Y(1:N) or Y(I) = S X(I) + Y(I)

where Index I implies that all vector operations involve N elements.

 Typical CVF for one-dimensional arrays are load, store, multiply, divide, logical and
shifting operations.
 The number of available vector registers and functional pipelines impose some restrictions on how
many CVFs can be executed simultaneously.
 Chaining:
Chaining is an extension of technique of internal data forwarding practiced in scalar processors.
Chaining is limited by the small number of functional pipelines available in a vector processor.

 Strip-mining:
When a vector has a length greater than that of the vector registers, segmentation of the long vector
into fixed-length segments is necessary. One vector segment is processed at a time (in Cray
computers segment is 64 elements).
 Recurrence:
The special case of vector loops in which the output of a functional pipeline may feed back into
one of its own source vector registers
Chapter 9- Scalable, Multithreaded, and Dataflow Architectures

9.1 Latency Hiding Techniques.

9.1.1 Shared Virtual Memory
Single-address-space multiprocessors/multicomputers must use shared virtual memory.
The Architecture Environment
 The Dash architecture was a large-scale, cache-coherent, NUMA multiprocessor system as
depicted in Fig. 9.1.

 It consisted of multiple multiprocessor clusters connected through a scalable, low latency

interconnection network.
 Physical memory was distributed among the processing nodes in various clusters. The distributed
memory formed a global address space.
 Cache coherence was maintained using an invalidating, distributed directory-based protocol. For
each memory block, the directory kept track of remote nodes caching it.
 When a write occurred, point-to-point messages were sent to invalidate remote copies of the block.
 Acknowledgement messages were used to inform the originating node when an invalidation was
completed.

 Two levels of local cache were used per processing node. Loads and writes were separated with the
use of write buffers for implementing weaker memory consistency models.
 The main memory was shared by all processing nodes in the same cluster. To facilitate prefetching
and the directory-based coherence protocol, directory memory and remote-access caches were used
for each cluster.
 The remote-access cache was shared by all processors in the same cluster.

The SVM Concept

 Figure 9.2 shows the structure of a distributed shared memory. A global virtual address space is
shared among processors residing at a large number of loosely coupled processing nodes.
 The idea of Shared virtual memory (SVM) is to implement coherent shared memory on a network
of processors without physically shared memory.
 The coherent mapping of SVM on a message-passing multicomputer architecture is shown in Fig.
9.2b.
 The system uses virtual addresses instead of physical addresses for memory references.

 Each virtual address space can be as large as a single node can provide and is shared by all nodes in
the system.
 The SVM address space is organized in pages which can be accessed by any node in the system. A
memory-mapping manager on each node views its local memory as a large cache of pages for its
associated processor.
Page Swapping
 A memory reference causes a page fault when the page containing the memory location is not in a
processor’s local memory.
 When a page fault occurs, the memory manager retrieves the missing page from the memory of
another processor.
 If there is a page frame available on the receiving node, the page is moved in.
 Otherwise, the SVM system uses page replacement policies to find an available page frame,
swapping its contents to the sending node.
 A hardware MMU can set the access rights (nil, read-only} writable) so that a memory access
violating memory coherence will cause a page fault.
 The memory coherence problem is solved in IVY through distributed fault handlers and their
servers. To client programs, this mechanism is completely transparent.
 The large virtual address space allows programs to be larger in code and data space than the
physical memory on a single node.
 This SVM approach offers the ease of shared-variable programming in a message-passing
environment.
 In addition, it improves software portability and enhances system scalability through modular
memory growth.
Latency hiding can be accomplished through 4 complementary approaches:
i) Pre-fetching techniques which bring instructions or data close to the processor
beforethey are actually needed
ii) Coherent caches supported by hardware to reduce cache misses
iii) Relaxed memory consistency models by allowing buffering and pipelining of memory
references
iv) Multiple-contexts support to allow a processor to switch from one context to another
when a long latency operation is encountered.

9.1.2 Pre-fetching Techniques

Pre-fetching uses knowledge about the expected misses in a program to move the
corresponding data close to the processor before it is actually needed.
Pre-fetching can be classified based on whether it is
 Binding
 Non binding
or whether it is controlled by
 hardware
 software
 Binding pre-fetching : the value of a later reference (eg, a register load) is bound at the time
whenthe pre-fetch completes.
 Non binding pre-fetching : brings data close to the processor, but the data remains visible to
thecache coherence protocol and is thus kept consistent until the processor actually reads the value.
 Hardware Controlled Pre-fetching: includes schemes such as long cache lines and
instruction look-ahead.
 Software Controlled Pre-fetching: explicit pre-fetch instructions are issued. Allows the pre-
fetchingto be done selectively and extends the possible interval between pre-fetch issue and actual
reference.

9.1.2 Distributed Coherent Caches

 While the cache coherence problem is easily solved for small bus-based multiprocessors through
the use of snoopy cache coherence protocols, the problem is much more complicated for large scale
multiprocessors that use general interconnection networks.
 As a result, some large scale multiprocessors did not provide caches, others provided caches that
must be kept coherent by software, and still others provided full hardware support for coherent
caches.
 Caching of shared read-write data provided substantial gains in performance. The largest benefit
came from a reduction of cycles wasted due to read misses. The cycles wasted due to write misses
were also reduced.
 Hardware cache coherence is an effective technique for substantially increasing the performance
with no assistance from the compiler or programmer.

9.1.5 Relaxed memory consistency models

Some different consistency models can be defined by relaxing one or more requirements in
sequential consistency called relaxed consistency models. These consistency models do not
provide memory consistency at the hardware level. In fact, the programmers are responsible for
implementing the memory consistency by applying synchronization techniques.
There are 4 comparisons to define the relaxed consistency:
9.2 Relaxation
9.3 Synchronizing vs non-synchronizing
9.4 Issue vs View-Based
9.5 Relative Model Strength
9.2 Principles of Multithreading
9.2.1 Multithreading Issues and Solutions
Multithreading demands that the processor be designed to handle multiple contexts simultaneously on
a context-switching basis.

Architecture Environment
Multithreading MPP system is modeled by a network of Processor (P) and memory (M) nodes as
shown in Fig. 9.11a. The distributed memories form a global address space.
Four machine parameters are defined below to analyze the performance of this network:
1. The Latency (L): This is the communication latency on a remote memory access. The value of
L includes the network delays, cache-miss penalty and delays caused by contentions in split
transactions.
2. The number of Threads (N): This is the number of threads that can be interleaved in each
processor. A thread is represented by a context consisting of a program counter, a register set
and the required context status words.
3. The context-switching overhead (C): This refers to the cycles lost in performing context
switching in a processor. This time depends on the switch mechanism and the amount of
processor states devoted to maintaining active threads.
4. The interval between switches (R): This refers to the cycles between switches triggered by
remote reference. The inverse p=1/R is called the rate of requests for remote accesses. This
reflects a combination of program behavior and memory system design.
In order to increase efficiency, one approach is to reduce the rate of requests by using distributed
coherent caches. Another is to eliminate processor waiting through multithreading.

Multithreaded Computations
Fig 9.11b shows the structure of the multithreaded parallel computations model.
The computation starts with a sequential thread (1), followed by supervisory scheduling (2), where
the processors begin threads of computation (3), by intercomputer messages that update variables
among the nodes when the computer has distributed memory (4), and finally by synchronization
prior to beginning the next unit of parallel work (5).
The communication overhead period (4) inherent in distributed memory structures is usually
distributed throughout the computation and is possibly completely overlapped.
Message passing overhead in multicomputers can be reduced by specialized hardware operating in
parallel with computation.
Communication bandwidth limits granularity, since a certain amount of data has to be transferred
with other nodes in order to complete a computational grain. Message passing calls (4) and
synchronization (5) are nonproductive.
Fast mechanisms to reduce or to hide these delays are therefore needed. Multithreading is not
capable of speedup in the execution of single threads, while weak ordering or relaxed consistency
models are capable of doing this.

Problems of Asynchrony
Massively parallel processors operate asynchronously in a network environment. The asynchrony
triggers two fundamental latency problems:
1. Remote loads
2. Synchronizing loads
Solutions to Asynchrony Problem
1. Multithreading Solutions
2. Distributed Caching

1. Multithreading Solutions – Multiplex among many threads

When one thread issues a remote-load request, the processor begins work on another thread, and so on
(Fig. 9.13a).
 Clearly the cost of thread switching should be much smaller than that of the latency of the remote
load, or else the processor might as well wait for the remote load’s response.
 As the inter node latency increases, more threads are needed to hide it effectively. Another concern
is to make sure that messages carry continuations. Suppose, after issuing a remote load from thread
T1 (Fig 9.13a), we switch to thread T2, which also issues a remote load.
 The responses may not return in the same order. This may be caused by requests traveling different
distances, through varying degrees of congestion, to destination nodes whose loads differ greatly,
etc.
 One way to cope with the problem is to associate each remote load and response with an identifier
for the appropriate thread, so that it can be re-enabled on the arrival of a response.

2. Distributed Caching
 The concept of Distributed Caching is shown in Fig. 9.13b. every memory location has an owner
node. For example, N1 owns B and N2 owns A.
 The directories are used to contain import-export lists and state whether the data is shared (for
reads, many caches may hold copies) or exclusive (for writes, one cache holds the current value).
 The directories multiplex among a small number of contexts to cover the cache loading effects.

 The Distributed Caching offers a solution for the remote-loads problem, but not for the
synchronizing-loads problem.
 Multithreading offers a solution for remote loads and possibly for synchronizing loads.
 The two approaches can be combined to solve both types of remote access problems.

9.2.2 Multiple-Context Processors

Multithreaded systems are constructed with multiple-context (multithreaded) processors.
Enhanced Processor Model
 A conventional single-thread processor will wait during a remote reference, it is idle for a period of
time L.
 A multithreaded processor, as modeled in Fig. 9.14a, will suspend the current context and switch to
another, so after some fixed number of cycles it will again be busy doing useful work, even though
the remote reference is outstanding.
 Only if all the contexts are suspended (blocked) will the processor be idle.
The objective is to maximize the fraction of time that the processor is busy, we will use the efficiency
of the processor as our performance index, given by:

where busy, switching and idle represent the amount of time, measured over some large interval, that
the processor is in the corresponding state.
The basic idea behind a multithreaded machine is to interleave the execution of several contexts on
order to dramatically reduce the value of idle, but without overly increasing the magnitude of
switching.

Context-Switching Policies
Different multithreaded architectures are distinguished by the context-switching policies adopted.
Four switching policies are:
1. Switch on Cache miss – This policy corresponds to the case where a context is preempted
when it causes a cache miss.
In this case, R is taken to be the average interval between misses (in Cycles) and L the time
required to satisfy the miss.
Here, the processor switches contexts only when it is certain that the current one will be
delayed for a significant number of cycles.
2. Switch on every load - This policy allows switching on every load, independent of whether it
will cause a miss or not.
In this case, R represents the average interval between loads. A general multithreading model
assumes that a context is blocked for L cycles after every switch; but in the case of a switch-on-
load processor, this happens only if the load causes a cache miss.
3. Switch on every instruction – This policy allows switching on every instruction, independent
of whether it is a load or not. Successive instructions become independent , which will benefit
pipelined execution.
4. Switch on block of instruction – Blocks of instructions from different threads are interleaved.
This will improve the cache-hit ratio due to locality. It will also benefit single-context
performance.

Chapter-7 Multiprocessors and Multicomputers: Module-Iv
No ratings yet
Chapter-7 Multiprocessors and Multicomputers: Module-Iv
53 pages
MENDELSON Elliott - Introduction To Mathematical Logic
100% (1)
MENDELSON Elliott - Introduction To Mathematical Logic
225 pages
(Valmir - C. - Barbosa) An Introduction To Distributed Algorithms PDF
100% (1)
(Valmir - C. - Barbosa) An Introduction To Distributed Algorithms PDF
318 pages
Multiprocessor System and Interconnection Networks
No ratings yet
Multiprocessor System and Interconnection Networks
66 pages
Module 4 ACA Notes
No ratings yet
Module 4 ACA Notes
53 pages
Eetop - CN - Ayan Mandal, Sunil P. Khatri, Rabi Mahapatra (Auth.) - Source-Synchronous Networ
No ratings yet
Eetop - CN - Ayan Mandal, Sunil P. Khatri, Rabi Mahapatra (Auth.) - Source-Synchronous Networ
151 pages
@vtucode - in 21CS643 Module 4 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 4 2021 Scheme
189 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
28 pages
Circuit Switching and Telephone Network
No ratings yet
Circuit Switching and Telephone Network
59 pages
05 Noc Deadlock and Livelock
No ratings yet
05 Noc Deadlock and Livelock
29 pages
Unit I Introduction
No ratings yet
Unit I Introduction
54 pages
17cs72 Notes Module 4
No ratings yet
17cs72 Notes Module 4
54 pages
Parallel Algorithms: Peter Harrison and William Knottenbelt
No ratings yet
Parallel Algorithms: Peter Harrison and William Knottenbelt
65 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
No ratings yet
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
49 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
60 pages
15CS72ACA Module 4 ACA
No ratings yet
15CS72ACA Module 4 ACA
95 pages
Aca Unit-3
No ratings yet
Aca Unit-3
10 pages
Multiprocessors Interconnection Networks
No ratings yet
Multiprocessors Interconnection Networks
32 pages
Module-4 ACA
No ratings yet
Module-4 ACA
53 pages
SpaceWire Users Guide
No ratings yet
SpaceWire Users Guide
117 pages
Lecture 5
No ratings yet
Lecture 5
72 pages
Network 34
No ratings yet
Network 34
76 pages
Written by Jun Ho Bahn (Jbahn@uci - Edu) : Overview of Network-on-Chip
No ratings yet
Written by Jun Ho Bahn (Jbahn@uci - Edu) : Overview of Network-on-Chip
5 pages
4 - Interconnection Networks
No ratings yet
4 - Interconnection Networks
57 pages
Introduction To MIMD Architectures
No ratings yet
Introduction To MIMD Architectures
17 pages
CT122 Lecture 3
No ratings yet
CT122 Lecture 3
75 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
54 pages
Switching
No ratings yet
Switching
63 pages
Interfacing and Communication
No ratings yet
Interfacing and Communication
50 pages
Parallel Programming - Slides
No ratings yet
Parallel Programming - Slides
268 pages
Interfacing Processors and Peripherals: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Interfacing Processors and Peripherals: CS151B/EE M116C Computer Systems Architecture
31 pages
Interconnection Networks
No ratings yet
Interconnection Networks
40 pages
Lectures On Lectures On Multiprocessors: Unit 10
No ratings yet
Lectures On Lectures On Multiprocessors: Unit 10
26 pages
Provided by Texas A&M University
No ratings yet
Provided by Texas A&M University
95 pages
Comporg6 ch7
No ratings yet
Comporg6 ch7
37 pages
Massively Parallel Processors
No ratings yet
Massively Parallel Processors
102 pages
CH07 - Part2
No ratings yet
CH07 - Part2
33 pages
ch.4 and 5
No ratings yet
ch.4 and 5
41 pages
Network-on-Chip: Ben Abdallah Abderazek The University of Aizu E-Mail: Benab@u-Aizu - Ac.jp
No ratings yet
Network-on-Chip: Ben Abdallah Abderazek The University of Aizu E-Mail: Benab@u-Aizu - Ac.jp
83 pages
Module 4 Chapter 1
No ratings yet
Module 4 Chapter 1
28 pages
Module 3
No ratings yet
Module 3
25 pages
Networks On Chip, Router Architectures and Performance Challenges
No ratings yet
Networks On Chip, Router Architectures and Performance Challenges
6 pages
1 Module 1 Introduction To Multiprocessors September 29 2024
No ratings yet
1 Module 1 Introduction To Multiprocessors September 29 2024
29 pages
Chapter 3
No ratings yet
Chapter 3
32 pages
Notes Multiprocessor
No ratings yet
Notes Multiprocessor
19 pages
ACA Mod3
No ratings yet
ACA Mod3
59 pages
Io Buses
No ratings yet
Io Buses
56 pages
Chapter Thirteen: Multiprocessors
No ratings yet
Chapter Thirteen: Multiprocessors
55 pages
Dynamic Networks: CS 213, LECTURE 15 L.N. Bhuyan
No ratings yet
Dynamic Networks: CS 213, LECTURE 15 L.N. Bhuyan
25 pages
Pipeline
No ratings yet
Pipeline
43 pages
Slides 1
No ratings yet
Slides 1
28 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
Publication 3 3685 213
No ratings yet
Publication 3 3685 213
25 pages
Lec 10
No ratings yet
Lec 10
23 pages
Announcements
No ratings yet
Announcements
19 pages
1multiprocessors and Multicomputers: A. Multiprocessor System Interconnects
No ratings yet
1multiprocessors and Multicomputers: A. Multiprocessor System Interconnects
16 pages
Module 6 Io Systems
No ratings yet
Module 6 Io Systems
23 pages
Verilog Modeling and Simulation of A Communication Coprocessor For Multicomputers
No ratings yet
Verilog Modeling and Simulation of A Communication Coprocessor For Multicomputers
9 pages
FPGA Implementation ON-Chip Communication Using Implementation of 9 Port Router For Ommunication Using V Outer For 3D Verilog
No ratings yet
FPGA Implementation ON-Chip Communication Using Implementation of 9 Port Router For Ommunication Using V Outer For 3D Verilog
6 pages
Unit-5 Part-2
No ratings yet
Unit-5 Part-2
22 pages
Chapter 17 - Distributed Memory MIMD Architectures: D. Sima, T. J. Fountain, P. Kacsuk Advanced Computer Architectures
No ratings yet
Chapter 17 - Distributed Memory MIMD Architectures: D. Sima, T. J. Fountain, P. Kacsuk Advanced Computer Architectures
70 pages
Lectures On Multiprocessors: Unit 10
No ratings yet
Lectures On Multiprocessors: Unit 10
26 pages
Notes II
No ratings yet
Notes II
13 pages
Multiprocessor Architecture and Programming
No ratings yet
Multiprocessor Architecture and Programming
20 pages
Interconnection Networks
No ratings yet
Interconnection Networks
7 pages
Co Unit-V
No ratings yet
Co Unit-V
12 pages
Structure of A Switch
No ratings yet
Structure of A Switch
9 pages
Lecture5 (Share Memory" According To Connection)
No ratings yet
Lecture5 (Share Memory" According To Connection)
9 pages
NIRGAM Manual
No ratings yet
NIRGAM Manual
22 pages
Morgan Manual
No ratings yet
Morgan Manual
34 pages
COA Group Assigment
No ratings yet
COA Group Assigment
11 pages
Unit 11
No ratings yet
Unit 11
10 pages
What Is An Interconnection Network
No ratings yet
What Is An Interconnection Network
5 pages
Multiprocessors
No ratings yet
Multiprocessors
8 pages
Lecture 3.2.4 (Various Interconnection Networks)
No ratings yet
Lecture 3.2.4 (Various Interconnection Networks)
5 pages
Fpga-Based Laboratory Assignments For Noc-Based Manycore Systems
No ratings yet
Fpga-Based Laboratory Assignments For Noc-Based Manycore Systems
10 pages
Noc Book 7
No ratings yet
Noc Book 7
6 pages
DBR: A Simple, Fast and Efficient Dynamic Network Reconfiguration Mechanism Based On Deadlock Recovery Scheme
No ratings yet
DBR: A Simple, Fast and Efficient Dynamic Network Reconfiguration Mechanism Based On Deadlock Recovery Scheme
14 pages
You Press 'Enter' on the Browser: What happens when..., #1
From Everand
You Press 'Enter' on the Browser: What happens when..., #1
Dustin W. Morris
5/5 (1)
200-301 CCNA (Cisco Certified Network Associate) Study Guide
From Everand
200-301 CCNA (Cisco Certified Network Associate) Study Guide
Anand Vemula
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet

Module-4 Notes

Uploaded by

Module-4 Notes

Uploaded by

MODULE-IV

Chapter-7 Multiprocessors and Multicomputers

7.1 Multiprocessor system interconnect

7.1.1 Hierarchical Bus Systems

Hierarchical Buses and Caches

7.1.2 Crossbar Switch and Multiport Memory

Crosspoint Switch Design

7.1.3 Multistage and Combining Networks

Routing in Butterfly Networks

Fetch & Add

Fetch & Add(x, e)

If the execution order is reversed, the following values arc returned:

Inconsistency in Data sharing:

Two Protocol Approaches for Cache Coherence

Protocol Approaches for Cache Coherence:

7 . 2 . 2 Snoopy Bus Protocol

Write Through Caches:

Write Back Caches:

 Packets are the basic unit of information flow in a store-and-forward network.

where, L: Packet length (in bits)

7.4.2 Deadlock and Virtual channels

The communication channels between nodes in a wormhole-routed multicomputer network are

8.1 Vector Processing Principles

Vector Processing Definitions

Vector Processor: A vector processor is an ensemble of hardware resources, including vector

Vectorizing Compiler: A compiler capable of vectorization is called a Vectorizing Compiler

Vector Instruction Types

1. Vector - Vector instructions

Examples: V1 = sin(V2), V3 = V1+ V2

F4: M  V (Vector Load)

Three types of Vector-access memory organization schemes

S-Access memory organization

C/S-Access memory organization

 The C/S-access memory is suitable for use in vector multiprocessor configurations.

8.3 Compound Vector Processing

CVF: Y(1:N) = S X(1:N) + Y(1:N) or Y(I) = S X(I) + Y(I)

9.1 Latency Hiding Techniques.

 It consisted of multiple multiprocessor clusters connected through a scalable, low latency

The SVM Concept

9.1.2 Pre-fetching Techniques

9.1.2 Distributed Coherent Caches

9.1.5 Relaxed memory consistency models

1. Multithreading Solutions – Multiplex among many threads

9.2.2 Multiple-Context Processors

You might also like