1.1 Parallelism and Computing: 1.1.1 Trends in Applications
1.1 Parallelism and Computing: 1.1.1 Trends in Applications
Parallelism has sometimes been viewed as a rare and exotic subarea of computing,
interesting but of little relevance to the average programmer. A study of trends in
applications, computer architecture, and networking shows that this view is no longer
tenable. Parallelism is becoming ubiquitous, and parallel programming is becoming
central to the programming enterprise.
Although commercial applications may define the architecture of most future parallel
computers, traditional scientific applications will remain important users of parallel
computing technology. Indeed, as nonlinear effects place limits on the insights offered
by purely theoretical investigations and as experimentation becomes more costly or
impractical, computational studies of complex systems are becoming ever more
important. Computational costs typically increase as the fourth power or more of the
``resolution'' that determines accuracy, so these studies have a seemingly insatiable
demand for more computer power. They are also often characterized by large memory
and input/output requirements. For example, a ten-year simulation of the earth's
climate using a state-of-the-art model may involve floating-point operations---
ten days at an execution speed of floating-point operations per second (10
gigaflops). This same simulation can easily generate a hundred gigabytes ( bytes)
or more of data. Yet as Table 1.1 shows, scientists can easily imagine refinements to
these models that would increase these computational requirements 10,000 times.
Table 1.1: Various refinements proposed to climate models, and the increased
computational requirements associated with these refinements. Altogether, these
refinements could increase computational requirements by a factor of between
and .
In summary, the need for faster computers is driven by the demands of both data-
intensive applications in commerce and computation-intensive applications in science
and engineering. Increasingly, the requirements of these fields are merging, as
scientific and engineering applications become more data intensive and commercial
applications perform more sophisticated computations.
The performance of the fastest computers has grown exponentially from 1945 to the
present, averaging a factor of 10 every five years. While the first computers
performed a few tens of floating-point operations per second, the parallel computers
of the mid-1990s achieve tens of billions of operations per second (Figure 1.1).
Similar trends can be observed in the low-end computers of different eras: the
calculators, personal computers, and workstations. There is little to suggest that this
growth will not continue. However, the computer architectures used to sustain this
growth are changing radically---from sequential to parallel.
Figure 1.1: Peak performance of some of the fastest supercomputers, 1945--1995.
The exponential growth flattened off somewhat in the 1980s but is accelerating again
as massively parallel supercomputers become available. Here, ``o'' are
uniprocessors, ``+'' denotes modestly parallel vector computers with 4--16
processors, and ``x'' denotes massively parallel computers with hundreds or
thousands of processors. Typically, massively parallel computers achieve a lower
proportion of their peak performance on realistic applications than do vector
computers.
Figure 1.2: Trends in computer clock cycle times. Conventional vector
supercomputer cycle times (denoted ``o'') have decreased only by a factor of 3 in
sixteen years, from the CRAY-1 (12.5 nanoseconds) to the C90 (4.0). RISC
microprocessors (denoted ``+'') are fast approaching the same performance. Both
architectures appear to be approaching physical limits.
This result means that not only is it difficult to build individual components that
operate faster, it may not even be desirable to do so. It may be cheaper to use more,
slower components. For example, if we have an area of silicon to use in a
computer, we can either build components, each of size A and able to perform an
operation in time T , or build a single component able to perform the same operation
in time T/n . The multicomponent system is potentially n times faster.
Figure 1.4: The von Neumann computer. A central processing unit (CPU) executes a
program that performs a sequence of read and write operations on an attached
memory.
Figure 1.5: The multicomputer, an idealized parallel computer model. Each node
consists of a von Neumann machine: a CPU and memory. A node can communicate
with other nodes by sending and receiving messages over an interconnection
network.
Figure 1.6: Classes of parallel computer architecture. From top to bottom: a
distributed-memory MIMD computer with a mesh interconnect, a shared-memory
multiprocessor, and a local area network (in this case, an Ethernet). In each case, P
denotes an independent processor.
Two classes of computer system that are sometimes used as parallel computers are
the local area network (LAN), in which computers in close physical proximity (e.g.,
the same building) are connected by a fast network, and the wide area network
(WAN), in which geographically distributed computers are connected. Although
systems of this sort introduce additional concerns such as reliability and security,
they can be viewed for many purposes as multicomputers, albeit with high remote-
access costs. Ethernet and asynchronous transfer mode (ATM) are commonly used
network technologies.
1.3 A Parallel Programming Model
The von Neumann machine model assumes a processor able to execute sequences of
instructions. An instruction can specify, in addition to various arithmetic operations,
the address of a datum to be read or written in memory and/or the address of the next
instruction to be executed. While it is possible to program a computer in terms of this
basic model by writing machine language, this method is for most purposes
prohibitively complex, because we must keep track of millions of memory locations
and organize the execution of thousands of machine instructions. Hence, modular
design techniques are applied, whereby complex programs are constructed from
simple components, and components are structured in terms of higher-level
abstractions such as data structures, iterative loops, and procedures. Abstractions such
as procedures make the exploitation of modularity easier by allowing objects to be
manipulated without concern for their internal structure. So do high-level languages
such as Fortran, Pascal, C, and Ada, which allow designs expressed in terms of these
abstractions to be translated automatically into executable code.
We consider next the question of which abstractions are appropriate and useful in a
parallel programming model. Clearly, mechanisms are needed that allow explicit
discussion about concurrency and locality and that facilitate development of scalable
and modular programs. Also needed are abstractions that are simple to work with and
that match the architectural model, the multicomputer. While numerous possible
abstractions could be considered for this purpose, two fit these requirements
particularly well: the task and channel. These are illustrated in Figure 1.7 and can be
summarized as follows:
Figure 1.8: The four basic task actions. In addition to reading and writing local
memory, a task can send a message, receive a message, create new tasks (suspending
until they terminate), and terminate.
Figure 1.9: Two solutions to the bridge construction problem. Both represent the
foundry and the bridge assembly site as separate tasks, foundry and bridge. The first
uses a single channel on which girders generated by foundry are transported as fast
as they are generated. If foundry generates girders faster than they are consumed
by bridge, then girders accumulate at the construction site. The second solution uses
a second channel to pass flow control messages from bridge to foundry so as to
avoid overflow.
A disadvantage of this scheme is that the foundry may produce girders much faster
than the assembly crew can use them. To prevent the bridge site from overflowing
with girders, the assembly crew instead can explicitly request more girders when
stocks run low. This refined approach is illustrated in Figure 1.9(b), with the stream of
requests represented as a second channel. The second channel can also be used to shut
down the flow of girders when the bridge is complete.
Figure 1.10: The task as building block. (a) The foundry and bridge tasks are
building blocks with complementary interfaces. (b) Hence, the two tasks can be
plugged together to form a complete program. (c) Tasks are interchangeable: another
task with a compatible interface can be substituted to obtain a different program.
The task is a natural building block for modular design. As illustrated in Figure 1.10, a
task encapsulates both data and the code that operates on those data; the ports on
which it sends and receives messages constitute its interface. Hence, the advantages of
modular design summarized in the previous paragraph are directly accessible in the
task/channel model.
Strong similarities exist between the task/channel model and the popular object-
oriented programming paradigm. Tasks, like objects, encapsulate data and the code
that operates on those data. Distinguishing features of the task/channel model are its
concurrency, its use of channels rather than method calls to specify interactions, and
its lack of support for inheritance.
In the bridge construction example, determinism means that the same bridge will be
constructed regardless of the rates at which the foundry builds girders and the
assembly crew puts girders together. If the assembly crew runs ahead of the foundry,
it will block, waiting for girders to arrive. Hence, it simply suspends its operations
until more girders are available, rather than attempting to continue construction with
half-completed girders. Similarly, if the foundry produces girders faster than the
assembly crew can use them, these girders simply accumulate until they are needed.
Determinism would be guaranteed even if several bridges were constructed
simultaneously: As long as girders destined for different bridges travel on distinct
channels, they cannot be confused.
The message-passing model does not preclude the dynamic creation of tasks, the
execution of multiple tasks per processor, or the execution of different programs by
different tasks. However, in practice most message-passing systems create a fixed
number of identical tasks at program startup and do not allow tasks to be created or
destroyed during program execution. These systems are said to implement a single
program multiple data (SPMD) programming model because each task executes the
same program but operates on different data. As explained in subsequent chapters, the
SPMD model is sufficient for a wide range of parallel programming problems but
does hinder some parallel algorithm developments.
The first two algorithms described have an SPMD structure, the third creates tasks
dynamically during program execution, and the fourth uses a fixed number of tasks
but has different tasks perform different functions.
Figure 1.11: A parallel algorithm for the one-dimensional finite difference problem.
From top to bottom: the one-dimensional vector X , where N=8 ; the task structure,
showing the 8 tasks, each encapsulating a single data value and connected to left and
right neighbors via channels; and the structure of a single task, showing its two
inports and outports.
A parallel algorithm for this problem creates N tasks, one for each point in X . The i th
task is given the value and is responsible for computing, in T steps, the
Figure 1.12: Task structures for computing pairwise interactions for N=5 . (a) The
unidirectional ring used in the simple, nonsymmetric algorithm. (b) The
unidirectional ring with additional channels used to return accumulated values in the
symmetric algorithm; the path taken by the accumulator used for task 0 is shown as a
solid line.
Our second example uses a similar channel structure but requires a more complex
communication algorithm. Many problems require the computation of all N(N-
1) pairwise interactions , , between N data, . Interactions
may be symmetric, in which case and only N(N-
1)/2 interactions need be computed. For example, in molecular dynamics we may
require the total force vector acting on each atom , defined as follows:
Each atom is represented by its mass and Cartesian coordinates. denotes the
mutual attraction or repulsion between atoms and ; in this
example, , so interactions are symmetric.
A simple parallel algorithm for the general pairwise interactions problem might
create N tasks. Task i is given the datum and is responsible for computing the
interactions . One might think that as each task needs a datum from
every other task, N(N-1) channels would be needed to perform the necessary
communications. However, a more economical structure is possible that uses
only N channels. These channels are used to connect the N tasks in a unidirectional
ring (Figure 1.12(a)). Hence, each task has one inport and one outport. Each task first
initializes both a buffer (with the value of its local datum) and an accumulator that
will maintain the result of the computation. It then repeatedly
1.4.3 Search
The next example illustrates the dynamic creation of tasks and channels during
program execution. Algorithm 1.1 explores a search tree looking for nodes that
correspond to ``solutions.'' A parallel algorithm for this problem can be structured as
follows. Initially, a single task is created for the root of the tree. A task evaluates its
node and then, if that node is not a solution, creates a new task for each search call
(subtree). A channel created for each new task is used to return to the new task's
parent any solutions located in its subtree. Hence, new tasks and channels are created
in a wavefront as the search progresses down the search tree (Figure 1.13).
Figure 1.13: Task structure for the search example. Each circle represents a node in
the search tree and hence a call to the search procedure. A task is created for each
node in the tree as it is explored. At any one time, some tasks are actively engaged in
expanding the tree further (these are shaded in the figure); others have reached
solution nodes and are terminating, or are waiting for their offspring to report back
with solutions. The lines represent the channels used to return solutions.
If the execution time per problem is constant and each processor has the same
computational power, then it suffices to partition available problems into equal-sized
sets and allocate one such set to each processor. In other situations, we may choose to
use the task structure illustrated in Figure 1.14. The input and output tasks are
responsible for reading and writing the input and output files, respectively. Each
worker task (typically one per processor) repeatedly requests parameter values from
the input task, computes using these values, and sends results to the output task.
Because execution times vary, the input and output tasks cannot expect to receive
messages from the various workers in any particular order. Instead, a many-to-one
communication structure is used that allows them to receive messages from the
various workers in arrival order.
The input task responds to a worker request by sending a parameter to that worker.
Hence, a worker that has sent a request to the input task simply waits for the
parameter to arrive on its reply channel. In some cases, efficiency can be improved
by prefetching , that is, requesting the next parameter before it is needed. The
worker can then perform computation while its request is being processed by the input
task.