Parallel Processing
Parallel Processing
Topic Overview
Implicit Parallelism: Trends in Microprocessor Architectures Limitations of Memory System Performance Dichotomy of Parallel Computing Platforms Communication Model of Parallel Platforms Physical Organization of Parallel Platforms Communication Costs in Parallel Machines Messaging Cost Models and Routing Mechanisms Mapping Techniques Case Studies
Scope of Parallelism
Conventional architectures coarsely comprise of a processor, memory system, and the datapath. Each of these components present significant performance bottlenecks. Parallelism addresses each of these components in significant ways. Different applications utilize different aspects of parallelism - e.g., data itensive applications utilize high aggregate throughput, server applications utilize high aggregate network bandwidth, and scientific applications typically utilize high processing and memory system performance. It is important to understand each of these performance bottlenecks.
Superscalar Execution
Scheduling of instructions is determined by a number of factors:
True Data Dependency: The result of one operation is an input to the next. Resource Dependency: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently based on these factors. The complexity of this hardware is an important constraint on superscalar processors.
Impact of Caches
Repeated references to the same data item correspond to temporal locality. In our example, we had O(n2) data accesses and O(n3) computation. This asymptotic difference makes the above example particularly desirable for caches. Data reuse is critical for cache performance.
The code fragment sums columns of the matrix b into a vector column_sum.
Multiplying a matrix with a vector: (a) multiplying column-bycolumn, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.
In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.
The first approach is called prefetching, the second multithreading, and the third one corresponds to spatial locality in accessing memory words.
Each dot-product is independent of the other, and therefore represents a concurrent unit of execution. We can safely rewrite the above code segment as:
for (i = 0; i < n; i++) c[i] = create_thread(dot_product,get_row(a, i), b);
SIMD Processors
Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines. Variants of this concept have found use in co-processing units such as the MMX units in Intel processors and DSP chips such as the Sharc. SIMD relies on the regular structure of computations (such as those in image processing). It is often necessary to selectively turn off operations on certain data items. For this reason, most SIMD programming paradigms allow for an ``activity mask'', which determines if a processor should participate in a computation or not.
Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement; (b) the execution of the statement in two steps.
MIMD Processors
In contrast to SIMD processors, MIMD processors can execute different programs on different processors. A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors. It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
SIMD-MIMD Comparison
SIMD computers require less hardware than MIMD computers (single control unit). However, since SIMD processors ae specially designed, they tend to be expensive and have long design cycles. Not all applications are naturally suited to SIMD processors. In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf components with relatively little effort in a short amount of time.
Shared-Address-Space Platforms
Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine.
Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memoryaccess shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only.
Message-Passing Platforms
These platforms comprise of a set of processors and their own (exclusive) memory. Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers. These platforms are programmed using (variants of) send and receive primitives. Libraries such as MPI and PVM provide such primitives.
Classification of interconnection networks: (a) a static network; and (b) a dynamic network.
Interconnection Networks
Switches map a fixed number of inputs to outputs. The total number of ports on a switch is the degree of the switch. The cost of a switch grows as the square of the degree of the switch, the peripheral hardware linearly as the degree, and the packaging costs linearly as the number of pins.
Network Topologies
A variety of network topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement hybrids of multiple topologies for reasons of packaging, cost, and available components.
Bus-based interconnects (a) with no local caches; (b) with local memory/caches.
Since much of the data accessed by processors is local to the processor, a local memory can improve the performance of bus-based machines.
An omega network has p/2 log p switching nodes, and the cost of such a network grows as (p log p).
An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB.
(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.
Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
Network
Diameter
Bisection Width
Arc Connectivity
Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables.
Example of parallel program execution with the simple three-state coherence protocol.
Architecture of typical directory based systems: (a) a centralized directory; and (b) a distributed directory.
Store-and-Forward Routing
A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop. The total communication cost for a message of size m words to traverse l communication links is
Routing Techniques
Passing a message from node P0 to P3 (a) through a store-andforward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that the message is in transit. The startup time associated with this message transfer is assumed to be zero.
Packet Routing
Store-and-forward makes poor use of communication resources. Packet routing breaks messages into packets and pipelines them through the network. Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information. The total communication time for packet routing is approximated by: The factor tw accounts for overheads in packet headers.
Cut-Through Routing
Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits. Since flits are typically small, the header information must be minimized. This is done by forcing all flits to take the same path, in sequence. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed.
Cut-Through Routing
The total communication time for cut-through routing is approximated by:
Routing a message from node Ps (010) to node Pd (111) in a threedimensional hypercube using E-cube routing.
(a) A three-bit
reflected Gray code ring; and (b) its embedding into a three-dimensional hypercube.
(a) A 4 4 mesh illustrating the mapping of mesh nodes to the nodes in a four-dimensional hypercube; and (b) a 2 4 mesh embedded into a three-dimensional hypercube.
(a) Embedding a 16 node linear array into a 2-D mesh; and (b) the inverse of the mapping. Solid lines correspond to links in the linear array and normal lines to links in the mesh.
Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.