0% found this document useful (0 votes)
78 views25 pages

1.1 Parallelism and Computing: 1.1.1 Trends in Applications

Parallel computers use multiple processors that work together to solve computational problems. Parallelism is becoming more common as applications demand more computing power. Both commercial and scientific applications are driving demand, as data-intensive commercial applications and computation-intensive simulations require ever more resources. Computer designers are addressing limits on single processor speed through parallel architectures, as integrating more processors onto chips allows concentrating resources to meet growing computational needs.

Uploaded by

karunakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views25 pages

1.1 Parallelism and Computing: 1.1.1 Trends in Applications

Parallel computers use multiple processors that work together to solve computational problems. Parallelism is becoming more common as applications demand more computing power. Both commercial and scientific applications are driving demand, as data-intensive commercial applications and computation-intensive simulations require ever more resources. Computer designers are addressing limits on single processor speed through parallel architectures, as integrating more processors onto chips allows concentrating resources to meet growing computational needs.

Uploaded by

karunakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

1.

1 Parallelism and Computing


  A parallel computer is a set of processors that are able to work cooperatively to
solve a computational problem. This definition is broad enough to include parallel
supercomputers that have hundreds or thousands of processors, networks of
workstations, multiple-processor workstations, and embedded systems. Parallel
computers are interesting because they offer the potential to concentrate
computational resources---whether processors, memory, or I/O bandwidth---on
important computational problems.

Parallelism has sometimes been viewed as a rare and exotic subarea of computing,
interesting but of little relevance to the average programmer. A study of trends in
applications, computer architecture, and networking shows that this view is no longer
tenable. Parallelism is becoming ubiquitous, and parallel programming is becoming
central to the programming enterprise.

1.1.1 Trends in Applications


 

As computers become ever faster, it can be tempting to suppose that   they will


eventually become ``fast enough'' and that appetite for increased computing power
will be sated. However, history suggests that as a particular technology satisfies
known applications, new applications will arise that are enabled by that technology
and that will demand the development of new technology. As an amusing illustration
of this phenomenon, a report prepared for the British government in the late 1940s
concluded that Great Britain's computational requirements could be met by two or
perhaps three computers. In those days, computers were used primarily for computing
ballistics tables. The authors of the report did not consider other applications in
science and engineering, let alone the commercial applications that would soon come
to dominate computing. Similarly, the initial prospectus for Cray Research predicted a
market for ten supercomputers; many hundreds have since been sold.

Traditionally, developments at the high end of computing have been motivated by


numerical simulations of complex systems such as weather,   climate, mechanical
devices, electronic circuits, manufacturing   processes, and chemical reactions.
However, the most significant forces driving the development of faster computers
today are emerging commercial applications that require a computer to be able to
process large amounts of data in sophisticated ways. These applications   include
video conferencing, collaborative work environments,   computer-aided diagnosis in
medicine, parallel databases used for   decision support, and advanced graphics and
virtual reality,   particularly in the entertainment industry. For example, the integration
of parallel computation, high-performance networking, and multimedia technologies
is leading to the development of video   servers, computers designed to serve
hundreds or thousands of simultaneous requests for real-time video. Each video
stream can involve both data transfer rates of many megabytes per second and large
amounts of processing for encoding and decoding. In graphics, three-dimensional data
sets are now approaching   volume elements (1024 on a side). At 200 operations per
element, a display updated 30 times per second requires a computer capable of 6.4
 operations per second.

Although commercial applications may define the architecture of most future parallel
computers, traditional scientific applications will remain important users of parallel
computing technology. Indeed, as nonlinear effects place limits on the insights offered
by purely theoretical investigations and as experimentation becomes more costly or
impractical, computational studies of complex systems are becoming ever more
important. Computational costs typically increase as the fourth power or more of the
``resolution'' that determines accuracy, so these studies have a seemingly insatiable
demand for more computer power. They are also often characterized by large memory
and input/output requirements. For example, a ten-year simulation of the earth's
climate using a state-of-the-art model may involve     floating-point operations---
ten days at an execution speed of   floating-point operations per second (10
gigaflops). This same simulation can easily generate a hundred gigabytes (  bytes)
or more of data. Yet as Table 1.1 shows, scientists can easily imagine refinements to
these models that would increase these computational requirements 10,000 times.

 
Table 1.1: Various refinements proposed to climate models, and the increased
computational requirements associated with these refinements. Altogether, these
refinements could increase computational requirements by a factor of between   
and  .  

In summary, the need for faster computers is driven by the demands of both data-
intensive applications in commerce and computation-intensive applications in science
and engineering. Increasingly, the requirements of these fields are merging, as
scientific and engineering applications become more data intensive and commercial
applications perform more sophisticated computations.

1.1.2 Trends in Computer Design


 

The performance of the fastest computers has grown exponentially from   1945 to the
present, averaging a factor of 10 every five years. While the first computers
performed a few tens of floating-point   operations per second, the parallel computers
of the mid-1990s achieve tens of billions of operations per second (Figure 1.1).
Similar trends can be observed in the low-end computers of different eras: the
calculators, personal computers, and workstations. There is little to suggest that this
growth will not continue. However, the computer architectures used to sustain this
growth are changing radically---from sequential to parallel.

 
Figure 1.1: Peak performance of some of the fastest supercomputers, 1945--1995.
The exponential growth flattened off somewhat in the 1980s but is accelerating again
as massively parallel supercomputers become available. Here, ``o'' are
uniprocessors, ``+'' denotes modestly parallel vector computers with 4--16
processors, and ``x'' denotes massively parallel computers with hundreds or
thousands of processors. Typically, massively parallel computers achieve a lower
proportion of their peak performance on realistic applications than do vector
computers. 

  The performance of a computer depends directly on the time required to perform a


basic operation and the number of these basic operations   that can be performed
concurrently. The time to perform a basic   operation is ultimately limited by the
``clock cycle'' of the processor, that is, the time required to perform the most primitive
operation. However, clock cycle times are decreasing slowly and appear to be
approaching physical limits such as the speed of light (Figure 1.2). We cannot depend
on faster processors to provide increased computational performance.

 
Figure 1.2: Trends in computer clock cycle times. Conventional vector
supercomputer cycle times (denoted ``o'') have decreased only by a factor of 3 in
sixteen years, from the CRAY-1 (12.5 nanoseconds) to the C90 (4.0). RISC
microprocessors (denoted ``+'') are fast approaching the same performance. Both
architectures appear to be approaching physical limits. 

To circumvent these limitations, the designer may attempt to utilize internal


concurrency in a chip, for example, by operating simultaneously on all 64 bits of two
numbers that are to be multiplied. However, a fundamental result in Very Large
Scale   Integration (VLSI) complexity theory says that this strategy is expensive. This
result states that for certain transitive computations (in which any output may depend
on any input), the chip area A and the time T required to perform this computation are
related so that   must exceed some problem-dependent function of problem size.
This result can be explained informally by assuming that a computation must move a
certain amount of information from one side of a square chip to the other. The amount
of information that can be moved in a time unit is limited by the cross section of the
chip,  . This gives a transfer rate of  , from which the   relation is obtained.
To decrease the time required to move the information by a certain factor, the cross
section must be increased by the same factor, and hence the total area must be
increased by the square of that factor.

This   result means that not only is it difficult to build individual components that
operate faster, it may not even be desirable to do so. It may be cheaper to use more,
slower components. For example, if we have an area   of silicon to use in a
computer, we can either build   components, each of size A and able to perform an
operation in time T , or build a single component able to perform the same operation
in time T/n . The multicomponent system is potentially n times faster.

Computer designers use a variety of techniques to overcome these   limitations on


single computer performance, including pipelining (different stages of several
instructions execute concurrently) and multiple function units (several multipliers,
adders, etc., are controlled by a single instruction stream). Increasingly, designers are
incorporating multiple ``computers,'' each with its own processor, memory, and
associated interconnection logic. This approach is   facilitated by advances in VLSI
technology that continue to decrease the number of components required to
implement a computer. As the cost of a computer is (very approximately) proportional
to the number of components that it contains, increased integration also increases the
number of processors that can be included in a computer for a particular cost. The
result is continued growth in processor counts (Figure 1.3).
 
Figure 1.3: Number of processors in massively parallel computers (``o'') and vector
multiprocessors (``+''). In both cases, a steady increase in processor count is
apparent. A similar trend is starting to occur in workstations, and personal computers
can be expected to follow the same trend. 

1.1.3 Trends in Networking


  Another important trend changing the face of computing is an enormous   increase in
the capabilities of the networks that connect computers. Not long ago, high-speed
networks ran at 1.5 Mbits per second; by the end of the 1990s, bandwidths in excess
of 1000 Mbits per second will be commonplace. Significant improvements in
reliability are also expected. These trends make it feasible to develop applications that
use physically distributed resources as if they were part of the same computer. A
typical application of this sort may utilize processors on multiple remote computers,
access a selection of remote databases, perform rendering on one or more graphics
computers, and provide real-time output and control on a workstation.

  We emphasize that computing on networked computers (``distributed computing'') is


not just a subfield of parallel computing. Distributed computing is deeply concerned
with problems such as reliability, security, and heterogeneity that are generally
regarded as tangential in parallel computing. (As Leslie Lamport has observed, ``A
distributed system is one in which the failure of a computer you didn't even know
existed can render your own computer unusable.'') Yet the basic task of developing
programs that can run on many computers at once is a parallel computing problem. In
this respect, the previously distinct worlds of parallel and distributed computing are
converging.

1.2 A Parallel Machine Model


The rapid penetration of computers into commerce, science, and education owed
much to the early standardization on a single machine   model, the von Neumann
computer. A von Neumann computer comprises a central processing unit (CPU)
connected to a storage unit (memory) (Figure 1.4). The CPU executes a stored
program that specifies a sequence of read and write operations on the memory. This
simple model has proved remarkably robust. Its persistence over more than forty years
has allowed the study of such important topics as algorithms and programming
languages to proceed to a large extent independently of developments in computer
architecture. Consequently, programmers can be trained in the abstract art of
``programming'' rather than the craft of ``programming machine X'' and can design
algorithms for an abstract von Neumann machine, confident that these algorithms will
execute on most target computers with reasonable efficiency.

 
Figure 1.4: The von Neumann computer. A central processing unit (CPU) executes a
program that performs a sequence of read and write operations on an attached
memory. 

Our study of parallel programming will be most rewarding if we can identify a


parallel machine model that is as general and useful as the von Neumann sequential
machine model. This machine model must be both simple and realistic: simple to
facilitate understanding and programming, and realistic to ensure that programs
developed for the model execute with reasonable efficiency on real computers.

1.2.1 The Multicomputer


A parallel machine model called the multicomputer fits these   requirements. As
illustrated in Figure 1.5, a   multicomputer comprises a number of von Neumann
computers, or nodes, linked by an interconnection network. Each computer   executes
its own program. This program may access local memory and may send and receive
messages over the network. Messages are used to communicate with other computers
or, equivalently, to read and write remote memories. In the idealized network, the cost
of sending a message between two nodes is independent of both node location and
other network traffic, but does depend on message length.

 
Figure 1.5: The multicomputer, an idealized parallel computer model. Each node
consists of a von Neumann machine: a CPU and memory. A node can communicate
with other nodes by sending and receiving messages over an interconnection
network. 

A defining attribute of the multicomputer model is that accesses to local (same-node)


memory are less expensive than accesses to remote (different-node) memory. That is,
read and write are less costly than send and receive. Hence, it is desirable that
accesses to local data be more frequent than accesses to remote data. This property,
called locality, is a third fundamental requirement   for parallel software, in addition to
concurrency and scalability.   The importance of locality depends on the ratio of
remote to local access costs. This ratio can vary from 10:1 to 1000:1 or greater,
depending on the relative performance of the local computer, the network, and the
mechanisms used to move data to and from the network.

1.2.2 Other Machine Models


 

 
 
Figure 1.6: Classes of parallel computer architecture. From top to bottom: a
distributed-memory MIMD computer with a mesh interconnect, a shared-memory
multiprocessor, and a local area network (in this case, an Ethernet). In each case, P
denotes an independent processor. 

  We review important parallel computer architectures (several are illustrated in


Figure 1.6) and discuss briefly how these differ from the idealized multicomputer
model.

The multicomputer is most similar to what is often called the distributed-memory


MIMD (multiple instruction multiple data) computer. MIMD means that each
processor can execute a   separate stream of instructions on its own local data;
distributed memory means that memory is distributed among the processors, rather
than placed in a central location. The principal difference between a multicomputer
and the distributed-memory MIMD computer is that in the latter, the cost of sending a
message between two nodes may not be independent of node location and other
network   traffic. These issues are discussed in Chapter 3.   Examples of this class of
machine include the IBM SP, Intel Paragon,   Thinking Machines CM5, Cray
T3D,   Meiko CS-2, and   nCUBE.

Another important class of parallel computer is the multiprocessor, or shared-memory


MIMD computer. In multiprocessors,   all processors share access to a common
memory, typically via a bus or a hierarchy of buses. In the idealized Parallel Random
Access Machine (PRAM) model, often used in theoretical studies of parallel
algorithms, any processor can access any memory element in the same amount of
time. In practice, scaling this architecture usually introduces some form of memory
hierarchy; in particular, the frequency with which the shared memory is accessed may
be reduced by storing   copies of frequently used data items in
a cache associated   with each processor. Access to this cache is much faster than
access   to the shared memory; hence, locality is usually important, and the differences
between multicomputers and multiprocessors are really just questions of degree.
Programs developed for multicomputers can also execute efficiently on
multiprocessors, because shared memory permits an efficient implementation of
message passing. Examples of this class   of machine include the Silicon Graphics
Challenge,   Sequent Symmetry,   and the many multiprocessor workstations.

A more specialized class of parallel computer is the SIMD   (single instruction


multiple data) computer. In SIMD machines, all processors execute the same
instruction stream on a different piece of data. This approach can reduce both
hardware and software complexity but is appropriate only for specialized problems
characterized by a high degree of regularity, for example, image processing and
certain numerical simulations. Multicomputer algorithms cannot in general be
executed efficiently on SIMD computers. The MasPar MP is   an example of this
class   of machine.

Two classes of computer system that are sometimes used as parallel   computers are
the local area network (LAN), in which computers in   close physical proximity (e.g.,
the same building) are connected by a   fast network, and the wide area network
(WAN), in which geographically   distributed computers are connected. Although
systems of this sort introduce additional concerns such as reliability and security,
they   can be viewed for many purposes as multicomputers, albeit with high   remote-
access costs. Ethernet and asynchronous transfer mode (ATM)   are commonly used
network technologies.

 
1.3 A Parallel Programming Model
 

The von Neumann machine model assumes a processor able to execute sequences of
instructions. An instruction can specify, in addition to various arithmetic operations,
the address of a datum to be read or written in memory and/or the address of the next
instruction to be executed. While it is possible to program a computer in terms of this
basic model by writing machine language, this method is for most purposes
prohibitively complex, because we must keep track of millions of memory locations
and organize the execution of thousands of machine   instructions. Hence, modular
design techniques are applied, whereby complex programs are constructed from
simple components, and components are structured in terms of higher-level
abstractions such as data structures, iterative loops, and procedures. Abstractions such
as procedures make the exploitation of modularity easier by allowing objects to be
manipulated without concern for their internal structure. So do high-level languages
such as Fortran, Pascal, C, and Ada, which allow designs expressed in terms of these
abstractions to be translated automatically into executable code.

  Parallel programming introduces additional sources of complexity: if we were to


program at the lowest level, not only would the number of instructions executed
increase, but we would also need to manage explicitly the execution of thousands of
processors and coordinate millions of interprocessor interactions. Hence, abstraction
and modularity are at least as important as in sequential programming. In fact, we
shall emphasize modularity as a fourth fundamental   requirement for parallel
software, in addition to concurrency, scalability, and locality.

1.3.1 Tasks and Channels


 
 
Figure 1.7: A simple parallel programming model. The figure shows both the
instantaneous state of a computation and a detailed picture of a single task. A
computation consists of a set of tasks (represented by circles) connected by channels
(arrows). A task encapsulates a program and local memory and defines a set of ports
that define its interface to its environment. A channel is a message queue into which a
sender can place messages and from which a receiver can remove messages,
``blocking'' if messages are not available. 

  We consider next the question of which abstractions are appropriate   and useful in a
parallel programming model. Clearly, mechanisms are needed that allow explicit
discussion about concurrency and locality and that facilitate development of scalable
and modular programs. Also needed are abstractions that are simple to work with and
that match the architectural model, the multicomputer. While numerous possible
abstractions could be considered for this purpose, two fit   these requirements
particularly well: the task and channel. These are illustrated in Figure 1.7 and can be
summarized as follows:
 
Figure 1.8: The four basic task actions. In addition to reading and writing local
memory, a task can send a message, receive a message, create new tasks (suspending
until they terminate), and terminate. 

1. A parallel computation consists of one or more tasks. Tasks execute


concurrently. The number of tasks can vary during program execution.
2. A task encapsulates a sequential program and local memory. (In effect, it is a
virtual von Neumann machine.) In addition, a set
of inports and outports define its interface to its   environment.
3. A task can perform four basic actions in addition to reading and writing its
local memory (Figure 1.8): send messages on its outports, receive messages
on its inports, create new tasks, and terminate.
4. A send operation is asynchronous: it completes immediately. A receive
operation is synchronous: it causes execution of the task to block until a
message is available.
5.   Outport/inport pairs can be connected by message queues called channels.
Channels can be created and deleted, and references to channels (ports) can
be included in messages, so connectivity can vary dynamically.
6.   Tasks can be mapped to physical processors in various ways; the mapping
employed does not affect the semantics of a program. In particular, multiple
tasks can be mapped to a single processor. (We can also imagine a single task
being mapped to multiple processors, but that possibility is not considered
here.)

  The task abstraction provides a mechanism for talking about locality:   data


contained in a task's local memory are ``close''; other data are ``remote.'' The channel
abstraction provides a mechanism for indicating that computation in one task requires
data in another task in order to proceed. (This is termed a data dependency ).
The   following simple example illustrates some   of these features.

Example  .  Bridge Construction: 

  Consider the following real-world problem. A bridge is to be assembled from girders


being constructed at a foundry. These two activities are organized by providing trucks
to transport girders from the foundry to the bridge site. This situation is illustrated in
Figure 1.9(a), with the foundry and bridge represented as tasks and the stream of
trucks as a channel. Notice that this approach allows assembly of the bridge and
construction of girders to proceed in parallel without any explicit coordination: the
foundry crew puts girders on trucks as they are produced, and the assembly crew adds
girders to the bridge as and when they arrive.

 
Figure 1.9: Two solutions to the bridge construction problem. Both represent the
foundry and the bridge assembly site as separate tasks,  foundry and  bridge. The first
uses a single channel on which girders generated by  foundry are transported as fast
as they are generated. If  foundry generates girders faster than they are consumed
by  bridge, then girders accumulate at the construction site. The second solution uses
a second channel to pass flow control messages from  bridge to  foundry so as to
avoid overflow. 

A disadvantage of this scheme is that the foundry may produce girders much faster
than the assembly crew can use them. To prevent the bridge site from overflowing
with girders, the assembly crew instead can explicitly request more girders when
stocks run low. This refined approach is illustrated in Figure 1.9(b), with the stream of
requests represented as a second channel. The second channel can also be used to shut
down the flow of girders when the bridge is complete.

We now examine some other properties of this task/channel programming model:


performance, mapping independence, modularity, and determinism.  

Performance. Sequential programming abstractions such as procedures and data


structures are effective because they can be mapped simply and efficiently to the von
Neumann computer. The task and channel have a similarly direct mapping to the
multicomputer. A task represents a piece of code that can be executed sequentially, on
a single processor. If two tasks that share a channel are mapped to different
processors, the channel connection is implemented as interprocessor communication;
if they are mapped to the same   processor, some more efficient mechanism can be
used.

  Mapping Independence. Because tasks interact using the same mechanism (channels)


regardless of task location, the result computed by a program does not depend on
where tasks execute. Hence, algorithms can be designed and implemented without
concern for the number of processors on which they will execute; in fact, algorithms
are frequently designed that create many more tasks than processors. This is
a   straightforward way of achieving scalability : as the number   of processors
increases, the number of tasks per processor is reduced but the algorithm itself need
not be modified. The creation of more tasks than processors can also serve to mask
communication delays, by providing other computation that can be performed while
communication is performed to access remote data.

Modularity. In modular program design, various components of a program are


developed separately, as independent modules, and then combined to obtain a
complete program. Interactions between   modules are restricted to well-defined
interfaces. Hence, module   implementations can be changed without modifying other
components, and the properties of a program can be determined from the
specifications for its modules and the code that plugs these modules together. When
successfully applied, modular design reduces program complexity and facilitates code
reuse.

 
Figure 1.10: The task as building block. (a) The  foundry and  bridge tasks are
building blocks with complementary interfaces. (b) Hence, the two tasks can be
plugged together to form a complete program. (c) Tasks are interchangeable: another
task with a compatible interface can be substituted to obtain a different program. 

The task is a natural building block for modular design. As illustrated in Figure 1.10, a
task encapsulates both data and the code that operates on those data; the ports on
which it sends and receives messages constitute its interface. Hence, the advantages of
modular design summarized in the previous paragraph are directly accessible in the
task/channel model.

Strong similarities exist between the task/channel model and the   popular object-
oriented programming paradigm. Tasks, like objects,   encapsulate data and the code
that operates on those data. Distinguishing features of the task/channel model are its
concurrency, its use of channels rather than method calls to specify interactions, and
its lack of support for inheritance.

Determinism. An algorithm or program is deterministic if execution with a particular


input always yields the same output. It is nondeterministic if multiple executions with
the same input can give different outputs. Although nondeterminism is sometimes
useful   and must be supported, a parallel programming model that makes it easy to
write deterministic programs is highly desirable.   Deterministic programs tend to be
easier to understand. Also, when   checking for correctness, only one execution
sequence of a parallel program needs to be considered, rather than all possible
executions.

The ``arms-length'' interactions supported by the task/channel model makes


determinism relatively easy to guarantee. As we shall see in Part II when we consider
programming tools, it suffices to verify that each channel has a single sender and a
single receiver and that a task receiving on a channel blocks until a receive operation
is complete. These conditions can be relaxed when nondeterministic interactions are
required.

In the bridge construction example, determinism means that the same bridge will be
constructed regardless of the rates at which the foundry builds girders and the
assembly crew puts girders together. If the assembly crew runs ahead of the foundry,
it will block, waiting for girders to arrive. Hence, it simply suspends its operations
until more girders are available, rather than attempting to continue construction with
half-completed girders. Similarly, if the foundry produces girders faster than the
assembly crew can use them, these girders simply accumulate until they are needed.
Determinism would be guaranteed even if several bridges were constructed
simultaneously: As long as girders destined for different bridges travel on   distinct
channels, they cannot be confused.

1.3.2 Other Programming Models


 

In subsequent chapters, the task/channel model will often be used to describe


algorithms. However, this model is certainly not the only approach that can be taken
to representing parallel computation. Many other models have been proposed,
differing in their flexibility, task interaction mechanisms, task granularities, and
support for locality, scalability, and modularity. Here, we review several alternatives.

  Message passing. Message passing is probably the most   widely used


parallel   programming model today. Message-passing programs, like task/channel
programs, create multiple tasks, with each task encapsulating local data. Each task is
identified by a unique name, and tasks interact by sending and receiving messages to
and from named tasks. In this respect, message passing is really just a minor variation
on the task/channel model, differing only in the mechanism used for data transfer. For
example, rather than sending a message on ``channel ch,'' we may send a message to
``task 17.'' We study the message-passing model in more detail in Chapter 8, where we
discuss the Message Passing Interface. In that chapter, we explain that the definition
of channels is a useful discipline even when designing message-passing programs,
because it forces us to conceptualize the communication structure of a parallel
program.

The message-passing model does not preclude the dynamic creation of tasks, the
execution of multiple tasks per processor, or the execution of different programs by
different tasks. However, in practice most message-passing systems create a fixed
number of identical tasks at program startup and do not allow tasks to be created or
destroyed during program execution. These systems are said to implement a   single
program multiple data (SPMD) programming model because each   task executes the
same program but operates on different data. As explained in subsequent chapters, the
SPMD model is sufficient for a wide range of parallel programming problems but
does hinder some parallel algorithm developments.

Data Parallelism. Another commonly used parallel   programming model, data


parallelism, calls for exploitation of the concurrency that derives from the application
of the same operation to   multiple elements of   a data structure, for example, ``add 2
to all elements of this array,'' or ``increase the salary of all employees with 5 years
service.'' A data-parallel program consists of a sequence of such operations. As each
operation on each data element can be thought of as an independent task, the natural
granularity of a data-parallel computation is small, and the concept of ``locality'' does
not arise naturally. Hence, data-parallel compilers often require the programmer to
provide information about how data are to be distributed over processors, in other
words, how data are to be partitioned into tasks. The compiler can then translate the
data-parallel program into an SPMD formulation, thereby generating communication
code automatically. We discuss the data-parallel model in more detail in
Chapter 7 under the topic of High Performance Fortran. In that chapter, we show that
the algorithm design and analysis techniques developed for the task/channel model
apply directly to data-parallel programs; in particular, they provide the concepts
required to understand the locality and scalability of data-parallel programs.

Shared Memory. In the shared-memory programming model, tasks share a common


address space, which they read and write   asynchronously. Various mechanisms such
as locks and   semaphores may be used to control access to the shared memory.
An   advantage of this model from the programmer's point of view is that the notion of
data ``ownership'' is lacking, and hence there is no   need to specify explicitly the
communication of data from producers to consumers. This model can simplify
program development. However, understanding and managing locality becomes more
difficult, an important consideration (as noted earlier) on most shared-memory
architectures. It can also be more difficult to write deterministic programs.

1.4 Parallel Algorithm Examples


We conclude this chapter by presenting four examples of parallel algorithms. We do
not concern ourselves here with the process by which these algorithms are derived or
with their efficiency; these   issues are discussed in Chapters 2 and 3, respectively.
The goal is simply to introduce parallel algorithms and their description in terms of
tasks and channels.

The first two algorithms described have an SPMD structure, the third creates tasks
dynamically during program execution, and the fourth uses a fixed number of tasks
but has different tasks perform different functions.

1.4.1 Finite Differences


 

 
Figure 1.11: A parallel algorithm for the one-dimensional finite difference problem.
From top to bottom: the one-dimensional vector X , where N=8 ; the task structure,
showing the 8 tasks, each encapsulating a single data value and connected to left and
right neighbors via channels; and the structure of a single task, showing its two
inports and outports. 

We first consider a one-dimensional finite difference problem, in which we have a


vector   of size N and must compute  , where
That is, we must repeatedly update each element of X , with no element being updated
in step t+1 until its neighbors have been updated in step t .

A parallel algorithm for this problem creates N tasks, one for each point in X . The i th

task is given the value   and is responsible for computing, in T steps, the

values  . Hence, at step t , it must obtain the values   and   


from tasks i-1 and i+1 . We specify this data transfer by defining channels that link
each task with ``left'' and ``right'' neighbors, as shown in Figure 1.11, and requiring
that at step t , each task i other than task 0 and task N-1

1. sends its data   on its left and right outports,

2. receives   and   from its left and right inports, and

3. uses these values to compute  .

Notice that the N tasks can execute independently, with the only   constraint on


execution order being the synchronization enforced by the receive operations. This
synchronization ensures that no data value is updated at step t+1 until the data values
in neighboring tasks have been updated at step t . Hence, execution is deterministic.

1.4.2 Pairwise Interactions


 

 
Figure 1.12: Task structures for computing pairwise interactions for N=5 . (a) The
unidirectional ring used in the simple, nonsymmetric algorithm. (b) The
unidirectional ring with additional channels used to return accumulated values in the
symmetric algorithm; the path taken by the accumulator used for task 0 is shown as a
solid line. 

  Our second example uses a similar channel structure but requires a more complex
communication algorithm. Many problems require the computation of all N(N-
1) pairwise interactions  ,  , between N data,  . Interactions
may be symmetric, in which case   and only N(N-
1)/2 interactions need be computed. For example, in molecular dynamics we may
require the total force vector   acting on each atom  , defined as follows:

Each atom is represented by its mass and Cartesian coordinates.   denotes the
mutual attraction or repulsion between atoms   and  ; in this
example,  , so interactions are symmetric.

A simple parallel algorithm for the general pairwise interactions problem might
create N tasks. Task i is given the datum   and is responsible for computing the
interactions  . One might think that as each task needs a datum from
every other task, N(N-1) channels would be needed to perform the necessary
communications. However, a more economical structure is possible that uses
only N channels. These channels are used to connect the N tasks in a unidirectional
ring (Figure 1.12(a)). Hence, each task has one inport and one outport. Each task first
initializes both a buffer (with the value of its local datum) and an accumulator that
will maintain the result of the computation. It then repeatedly

1. sends the value contained in its buffer on its outport,


2. receives a datum on its inport into its buffer,
3. computes the interaction between this datum and its local datum, and
4. uses the computed interaction to update its local accumulator.

This send-receive-compute cycle is repeated N-1 times, causing the N data to flow


around the ring. Every task sees every datum and is able to compute all N-
1 interactions involving its datum. The algorithm involves N-1 communications per
task.
It turns out that if interactions are symmetric, we can halve both the number of
interactions computed and the number of communications by refining the
communication structure. Assume for simplicity that N is odd. An
additional N communication channels are created, linking each task to the task
offset   around the ring (Figure 1.12(b)). Each time an interaction   is
computed between a local datum   and an incoming datum  , this value is
accumulated not only in the accumulator for   but also in another accumulator that is
circulated with  . After   steps, the accumulators associated with the circulated
values are returned to their home task using the new channels and combined with the
local accumulators. Hence, each symmetric interaction is computed only once: either
as   on the node that holds   or as   on the node that holds  .

1.4.3 Search
 

  The next example illustrates the dynamic creation of tasks and channels during
program execution. Algorithm 1.1 explores a search tree looking for nodes that
correspond to ``solutions.'' A parallel algorithm for this problem can be structured as
follows. Initially, a single task is created for the root of the tree. A task evaluates its
node and then, if that node is not a solution, creates a new task for each search call
(subtree). A channel created for each new task is used to return to the new task's
parent any solutions located in its subtree. Hence, new tasks and channels are created
in a wavefront as the search progresses down the search tree (Figure 1.13).
 

 
Figure 1.13: Task structure for the search example. Each circle represents a node in
the search tree and hence a call to the  search procedure. A task is created for each
node in the tree as it is explored. At any one time, some tasks are actively engaged in
expanding the tree further (these are shaded in the figure); others have reached
solution nodes and are terminating, or are waiting for their offspring to report back
with solutions. The lines represent the channels used to return solutions. 

1.4.4 Parameter Study


 

  In so-called embarrassingly parallel problems, a computation consists of a number of


tasks that can execute more or less independently, without communication. These
problems are usually easy to adapt for parallel execution. An example is a parameter
study, in which the same computation must be performed using a range of   different
input parameters. The parameter values are read from an input file, and the results of
the different computations are written to an output file.
 
Figure 1.14: Task structure for parameter study problem. Workers (W) request
parameters from the input task (I) and send results to the   output task (O). Note the
many-to-one connections: this program is   nondeterministic in that the input and
output tasks receive data from workers in whatever order the data are generated.
Reply channels, represented as dashed lines, are used to communicate parameters
from the input task to workers. 

If the execution time per problem is constant and each processor has the same
computational power, then it suffices to partition available problems into equal-sized
sets and allocate one such set to each processor. In other situations, we may choose to
use the task structure illustrated in Figure 1.14. The input and output tasks are
responsible for reading and writing the input and output files, respectively. Each
worker task (typically one per processor) repeatedly requests parameter values from
the input task, computes using these values, and sends results to the output task.
Because execution times vary, the input and output tasks cannot expect to receive
messages from the various workers in any particular order. Instead, a many-to-one
communication structure is used that allows them to receive messages from the
various workers in arrival order.

The input task responds to a worker request by sending a parameter to that worker.
Hence, a worker that has sent a request to the input task simply waits for the
parameter to arrive on its reply channel.   In some cases, efficiency can be improved
by prefetching ,   that is, requesting the next parameter before it is needed. The
worker can then perform computation while its request is being processed by the input
task.

  Because this program uses many-to-one communication structures, the order in


which computations are performed is not necessarily determined. However, this
nondeterminism affects only the allocation of problems to workers and the ordering of
results in the output file, not the actual results computed.

You might also like