0% found this document useful (0 votes)
8 views

Openmp

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Openmp

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

.

F E AT U R E A R T I C L E

OpenMP: An Industry-
Standard API for Shared-
Memory Programming
LEONARDO DAGUM AND RAMESH MENON
SILICON GRAPHICS INC.

♦ ♦ ♦

OpenMP, the portable alternative to message passing, offers a


powerful new way to achieve scalability in software. This article
compares OpenMP to existing parallel-programming models.

A
pplication developers have long recognized ence. These are generally called scalable shared memory
that scalable hardware and software are nec- multiprocessor architectures.1 For SSMP systems, the na-
essary for parallel scalability in application tive programming model is shared memory, and message
performance. Both have existed for some time passing is built on top of the shared-memory model. On
in their lowest common denominator form, and scalable such systems, software scalability is straightforward to
hardware—as physically distributed memories connected achieve with a shared-memory programming model.
through a scalable interconnection network (as a multi- In a shared-memory system, every processor has di-
stage interconnect, k-ary n-cube, or fat tree)—has been rect access to the memory of every other processor,
commercially available since the 1980s. When develop- meaning it can directly load or store any shared address.
ers build such systems without any provision for cache The programmer also can declare certain pieces of mem-
coherence, the systems are essentially “zeroth order” ory as private to the processor, which provides a simple
scalable architectures. They provide only a scalable in- yet powerful model for expressing and managing paral-
terconnection network, and the burden of scalability falls lelism in an application.
on the software. As a result, scalable software for such Despite its simplicity and scalability, many parallel ap-
systems exists, at some level, only in a message-passing plications developers have resisted adopting a shared-
model. Message passing is the native model for these ar- memory programming model for one reason: portabil-
chitectures, and developers can only build higher-level ity. Shared-memory system vendors have created their
models on top of it. own proprietary extensions to Fortran or C for paral-
Unfortunately, many in the high-performance com- lel-software development. However, the absence of
puting world implicitly assume that the only way to portability has forced many developers to adopt a
achieve scalability in parallel software is with a message- portable message-passing model such as the Message
passing programming model. This is not necessarily true. Passing Interface (MPI) or Parallel Virtual Machine
A class of multiprocessor architectures is now emerging (PVM). This article presents a portable alternative to
that offers scalable hardware support for cache coher- message passing: OpenMP.

46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING


.

OpenMP was designed to exploit certain char- Table 1: Comparing standard parallel-programming models.
acteristics of shared-memory architectures. The X3H5 MPI Pthreads HPF OpenMP
ability to directly access memory throughout the
Scalable no yes sometimes yes yes
system (with minimum latency and no explicit
address mapping), combined with fast shared-
Incremental yes no no no yes
memory locks, makes shared-memory architec-
parallelization
tures best suited for supporting OpenMP.
Portable yes yes yes yes yes
Why a new standard?
Fortran binding yes yes no yes yes
The closest approximation to a standard shared-
memory programming model is the now-
High level yes no no yes yes
dormant ANSI X3H5 standards effort.2 X3H5
was never formally adopted as a standard largely
Supports data yes no no yes yes
because interest waned as distributed-memory
parallelism
message-passing systems (MPPs) came into
vogue. However, even though hardware vendors
Performance no yes no tries yes
support it to varying degrees, X3H5 has limita-
oriented
tions that make it unsuitable for anything other
than loop-level parallelism. Consequently, ap-
plications adopting this model are often limited
in their parallel scalability. applications, as well as government laboratories,
MPI has effectively standardized the message- have a large volume of Fortran 77 code that
passing programming model. It is a portable, needs to get parallelized in a portable fashion.
widely available, and accepted standard for writ- The rapid and widespread acceptance of shared-
ing message-passing programs. Unfortunately, memory multiprocessor architectures—from
message passing is generally a difficult way to the desktop to “glass houses”—has created a
program. It requires that the program’s data pressing demand for a portable way to program
structures be explicitly partitioned, so the entire these systems. Developers need to parallelize ex-
application must be parallelized to work with isting code without completely rewriting it, but
the partitioned data structures. There is no in- this is not possible with most existing parallel-
cremental path to parallelize an application. language standards. Only OpenMP and X3H5
Furthermore, modern multiprocessor architec- allow incremental parallelization of existing
tures increasingly provide hardware support for code, of which only OpenMP is scalable (see
cache coherence; therefore, message passing is Table 1). OpenMP is targeted at developers who
becoming unnecessary and overly restrictive for need to quickly parallelize existing scientific
these systems. code, but it remains flexible enough to support a
Pthreads is an accepted standard for shared much broader application set. OpenMP pro-
memory in low-end systems. However, it is not vides an incremental path for parallel conver-
targeted at the technical, HPC space. There is sion of any existing software. It also provides
little Fortran support for pthreads, and it is not scalability and performance for a complete re-
a scalable approach. Even for C applications, the write or entirely new development.
pthreads model is awkward, because it is lower-
level than necessary for most scientific applica-
tions and is targeted more at providing task par- What is OpenMP?
allelism, not data parallelism. Also, portability At its most elemental level, OpenMP is a set of
to unsupported platforms requires a stub library compiler directives and callable runtime library
or equivalent workaround. routines that extend Fortran (and separately, C
Researchers have defined many new languages and C++) to express shared-memory parallelism.
for parallel computing, but these have not found It leaves the base language unspecified, and ven-
mainstream acceptance. High-Performance For- dors can implement OpenMP in any Fortran
tran (HPF) is the most popular multiprocessing compiler. Naturally, to support pointers and al-
derivative of Fortran, but it is mostly geared to- locatables, Fortan 90 and Fortran 95 require the
ward distributed-memory systems. OpenMP implementation to include additional
Independent software developers of scientific semantics over Fortran 77.

JANUARY–MARCH 1998 47
.

Table 2: Comparing X3H5 directives, OpenMP, and MIPS Pro Doacross functionality. Several vendors have prod-
X3H5 OpenMP MIPS Pro ucts—including compilers, de-
Overview velopment tools, and perfor-
Orphan scope None, lexical Yes, binding Yes, through mance-analysis tools—that are
scope only rules specified callable runtime OpenMP aware. Typically, these
Query functions None Standard Yes tools understand the semantics
Runtime functions None Standard Yes of OpenMP constructs and
Environment variables None Standard Yes hence aid the process of writing
Nested parallelism Allowed Allowed Serialized programs. The OpenMP Archi-
Throughput mode Not defined Yes Yes tecture Review Board includes
Conditional compilation None _OPENMP,!$ C$ representatives from Digital,
Sentinel C$PAR !$OMP C$ Hewlett-Packard, Intel, IBM,
C$PAR& !$OMP& C$& Kuck and Associates, and Sili-
con Graphics.
Control structure All of these companies are
Parallel region Parallel Parallel Doacross actively developing compilers
Iterative Pdo Do Doacross and tools for OpenMP. Open
Noniterative Psection Section User coded MP products are available to-
Single process Psingle Single, User coded day from Silicon Graphics and
Master other vendors. In addition, a
Early completion Pdone User coded User coded number of independent soft-
Sequential Ordering Ordered PDO Ordered None ware vendors plan to use Open-
MP in future products. (For
Data environment information on individual pro-
Autoscope None Default(private) shared default ducts, see www.openmp.org.)
Default(shared)
Global objects Instance Parallel Threadprivate Linker: -Xlocal
A simple example
(p + 1 instances) (p instances) (p instances)
Reduction attribute None Reduction Reduction Figure 1 presents a simple ex-
Private initialization None Firstprivate None ample of computing p using
Copyin Copyin OpenMP.4 This example il-
Private persistence None Lastprivate Lastlocal lustrates how to parallelize a
simple loop in a shared-mem-
Synchronization ory programming model. The
Barrier Barrier Barrier mp_barrier code would look similar with
Synchronize Synchronize Flush synchronize either the Doacross or the X3-
Critical section Critical Section Critical mp_setlock H5 set of directives (except
mp_unsetlock that X3H5 does not have a
Atomic update None Atomic None reduction attribute, so you
Locks None Full mp_setlock would have to code it yourself).
functionality mp_unsetlock Program execution begins as
a single process. This initial
process executes serially, and
we can set up our problem in a
standard sequential manner,
OpenMP leverages many of the X3H5 con- reading and writing stdout as necessary.
cepts while extending them to support coarse- When we first encounter a parallel con-
grain parallelism. Table 2 compares OpenMP struct, in this case a parallel do, the runtime
with the directive bindings specified by X3H5 forms a team of one or more processes and cre-
and the MIPS Pro Doacross model,3 and it sum- ates the data environment for each team mem-
marizes the language extensions into one of ber. The data environment consists of one pri-
three categories: control structure, data envi- vate variable, x, one reduction variable,
ronment, or synchronization. The standard also sum, and one shared variable, w. All references
includes a callable runtime library with accom- to x and sum inside the parallel region address
panying environment variables. private, nonshared copies. The reduction at-

48 IEEE COMPUTATIONAL SCIENCE & ENGINEERING


.

program compute_pi it becomes fairly straightforward to parallelize


integer n, i individual loops incrementally and thereby im-
double precision w, x, sum, pi, f, a mediately realize the performance advantages of
c function to integrate a multiprocessor system.
f(a) = 4.d0 / (1.d0 + a*a) For comparison with message passing, Figure
print *, ‘Enter number of intervals: ‘ 2 presents the same example using MPI. Clearly,
read *,n there is additional complexity just in setting up
c calculate the interval size the problem, because we must begin with a team
w = 1.0d0/n of parallel processes. Consequently, we need to
sum = 0.0d0 isolate a root process to read and write stdout.
!$OMP PARALLEL DO PRIVATE(x), SHARED(w) Because there is no globally shared data, we
!$OMP& REDUCTION(+: sum) must explicitly broadcast the input parameters
do i = 1, n (in this case, the number of intervals for the in-
x = w * (i - 0.5d0) tegration) to all the processors. Furthermore, we
sum = sum + f(x) must explicitly manage the loop bounds. This re-
enddo quires identifying each processor (myid) and
pi = w * sum knowing how many processors will be used to ex-
print *, ‘computed pi = ‘, pi
stop
end program compute_pi
include ‘mpif.h’
Figure 1. Computing p in parallel using OpenMP. double precision mypi, pi, w, sum, x, f, a
integer n, myid, numprocs, i, rc
c function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
tribute takes an operator, such that at the end of
the parallel region it reduces the private copies call MPI_INIT( ierr )
to the master copy using the specified operator. call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
All references to w in the parallel region address call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
the single master copy. The loop index variable,
i, is private by default. The compiler takes if ( myid .eq. 0 ) then
care of assigning the appropriate iterations to print *, ‘Enter number of intervals:‘
the individual team members, so in parallelizing read *, n
this loop the user need not even on know how endif
many processors it runs. call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
There might be additional control and syn- c calculate the interval size
chronization constructs within the parallel re- w = 1.0d0/n
gion, but not in this example. The parallel re- sum = 0.0d0
gion terminates with the end do, which has an do i = myid+1, n, numprocs
implied barrier. On exit from the parallel region, x = w * (i - 0.5d0)
the initial process resumes execution using its sum = sum + f(x)
updated data environment. In this case, the only enddo
change to the master’s data environment is the mypi = w * sum
reduced value of sum. c collect all the partial sums
This model of execution is referred to as the call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
fork/join model. Throughout the course of a pro- $ MPI_COMM_WORLD,ierr)
gram, the initial process can fork and join many c node 0 prints the answer.
times. The fork/join execution model makes it if (myid .eq. 0) then
easy to get loop-level parallelism out of a se- print *, ‘computed pi = ‘, pi
quential program. Unlike in message passing, endif
where the program must be completely decom- call MPI_FINALIZE(rc)
posed for parallel execution, the shared-mem- stop
ory model makes it possible to parallelize just at end
the loop level without decomposing the data
structures. Given a working sequential program, Figure 2. Computing p in parallel using MPI.

JANUARY–MARCH 1998 49
.

#include <pthread.h> ecute the loop (numprocs).


#include <stdio.h> When we finally get to the
pthread_mutex_t reduction_mutex;
loop, we can only sum into our
pthread_t *tid;
private value for mypi. To re-
int n, num_threads;
double pi, w;
duce across processors we use
double f(a) the MPI_Reduce routine and
double a; sum into pi. The storage for
{ pi is replicated across all pro-
return (4.0 / (1.0 + a*a)); cessors, even though only the
} root process needs it. As a gen-
eral rule, message-passing pro-
void *PIworker(void *arg)
grams waste more storage than
{
shared-memory programs.5
int i, myid;
double sum, mypi, x;
Finally, we can print the result,
/* set individual id to start at 0 */ again making sure to isolate
myid = pthread_self()-tid[0]; just one process for this step to
/* integrate function */ avoid printing numprocs
sum = 0.0; messages.
for (i=myid+1; i<=n; i+=num_threads) { It is also interesting to see
x = w*((double)i - 0.5); how this example looks using
sum += f(x);
pthreads (see Figure 3). Natu-
}
rally, it’s written in C, but we
mypi = w*sum;
/* reduce value */
can still compare functionality
pthread_mutex_lock(&reduction_mutex); with the Fortran examples giv-
pi += mypi; en in Figures 1 and 2.
pthread_mutex_unlock(&reduction_mutex); The pthreads version is more
return(0); complex than either the Open-
} MP or the MPI versions:
void main(argc,argv)
int argc;
• First, pthreads is aimed at
char *argv[];
providing task parallelism,
{
int i;
whereas the example is one
/* check command line */ of data parallelism—paral-
if (argc != 3) { lelizing a loop. The exam-
printf(“Usage: %s Num-intervals Num-threads\n”, argv[0]); ple shows why pthreads
exit(0); has not been widely used
} for scientific applications.
/* get num intervals and num threads from command line */ • Second, pthreads is some-
n = atoi(argv[1]);
what lower-level than we
num_threads = atoi(argv[2]);
need, even in a task- or
w = 1.0 / (double) n;
pi = 0.0;
threads-based model. This
tid = (pthread_t *) calloc(num_threads, sizeof(pthread_t)); becomes clearer as we go
/* initialize lock */ through the example.
if (pthread_mutex_init(&reduction_mutex, NULL))
fprintf(stderr, “Cannot init lock\n”), exit(1); As with the MPI version, we
/* create the threads */ need to know how many threads
for (i=0; i<num_threads; i++) will execute the loop and we
if(pthread_create(&tid[i], NULL, PIworker, NULL))
must determine their IDs so we
fprintf(stderr,”Cannot create thread %d\n”,i), exit(1);
can manage the loop bounds.
/* join threads */
for (i=0; i<num_threads; i++)
We get the thread number as a
pthread_join(tid[i], NULL); command-line argument and
printf(“computed pi = %.16f\n”, pi); use it to allocate an array of
} thread IDs. At this time, we
also initialize a lock, reduc-
Figure 3. Computing p in parallel using pthreads. tion_mutex, which we’ll

50 IEEE COMPUTATIONAL SCIENCE & ENGINEERING


.

need for reducing our partial sums into a global Table 3: Time (in seconds) to compute π using 109 intervals
sum for p. Our basic approach is to start a with three standard parallel-programming models.
worker thread, PIworker, for every processor CPUs OpenMP MPI Pthreads
we want to work on the loop. In PIworker,
1 107.7 121.4 115.4
we first compute a zero-based thread ID and use
this to map the loop iterations. The loop then 2 53.9 60.7 62.5
computes the partial sums into mypi. We add 4 27.0 30.3 32.4
these into the global result pi, making sure to 6 17.9 20.4 22.0
protect against a race condition by locking. Fi- 8 13.5 15.2 16.7
nally, we need to explicitly join all our threads
before we can print out the result of the inte-
gration.
All the data scoping is implicit; that is, global
variables are shared and automatic variables are sist of only a few parallel regions, with most of
private. There is no simple mechanism in p- the execution taking place within those regions.
threads for making global variables private. Also, In implementing a coarse-grained parallel al-
implicit scoping is more awkward in Fortran be- gorithm, it becomes desirable, and often neces-
cause the language is not as strongly scoped as sary, to be able to specify control or synchro-
C. nization from anywhere inside the parallel
In terms of performance, all three models are region, not just from the lexically contained por-
comparable for this simple example. Table 3 tion. OpenMP provides this functionality by
presents the elapsed time in seconds for each specifying binding rules for all directives and al-
program when run on a Silicon Graphics Ori- lowing them to be encountered dynamically in
gin2000 server, using 109 intervals for each in- the call chain originating from the parallel re-
tegration. All three models are exhibiting excel- gion. In contrast, X3H5 does not allow direc-
lent scalability on a per node basis (there are two tives to be orphaned, so all the control and syn-
CPUs per node in the Origin2000), as expected chronization for the program must be lexically
for this embarrassingly parallel algorithm. visible in the parallel construct. This limitation
restricts the programmer and makes any non-
trivial coarse-grained parallel application virtu-
Scalability ally impossible to write.
Although simple and effective, loop-level par-
allelism is usually limited in its scalability, be-
cause it leaves some constant fraction of se- A coarse-grain example
quential work in the program that by Amdahl’s To highlight additional features in the standard,
law can quickly overtake the gains from parallel Figure 4 presents a slightly more complicated
execution. It is important, however, to distin- example, computing the energy spectrum for a
guish between the type of parallelism (for ex- field. This is essentially a histogramming prob-
ample, loop-level versus coarse-grained) and lem with a slight twist—it also generates the se-
the programming model. The type of paral- quence in parallel. We could easily parallelize
lelism exposed in a program depends on the al- the histogramming loop and the sequence gen-
gorithm and data structures employed and not eration as in the previous example, but in the in-
on the programming model (to the extent that terest of performance we would like to his-
those algorithms and data structures can be rea- togram as we compute in order to preserve
sonably expressed within a given model). locality.
Therefore, given a parallel algorithm and a scal- The program goes immediately into a parallel
able shared-memory architecture, a shared- region with a parallel directive, declaring
memory implementation scales as well as a mes- the variables field and ispectrum as
sage-passing implementation. shared, and making everything else private
OpenMP introduces the powerful concept of with a default clause. The default clause
orphan directives that simplify the task of imple- does not affect common blocks, so setup re-
menting coarse-grain parallel algorithms. Or- mains a shared data structure.
phan directives are directives encountered out- Within the parallel region, we call initial-
side the lexical extent of the parallel region. ize_field() to initialize the field and is-
Coarse-grain parallel algorithms typically con- pectrum arrays. Here we have an example of

JANUARY–MARCH 1998 51
.

parameter(N = 512, NZ = 16) implicit barrier. Finally, we use the single di-
common /setup/ npoints, nzone rective when we initialize a single internal field
dimension field(N), ispectrum(NZ) point. The end single directive also can take
data npoints, nzone / N, NZ / a nowait clause, but to guarantee correctness
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(field, ispectrum) we need to synchronize here.
call initialize_field(field, ispectrum) The field gets computed in compute_field.
call compute_field(field, ispectrum) This could be any parallel Laplacian solver, but
call compute_spectrum(field, ispectrum) in the interest of brevity we don’t include it here.
!$OMP END PARALLEL With the field computed, we are ready to com-
call display(ispectrum) pute the spectrum, so we histogram the field val-
stop ues using the atomic directive to eliminate race
end conditions in the updates to ispectrum. The
end do here has a nowait because the parallel
subroutine initialize_field(field, ispectrum) region ends after compute spectrum() and
common /setup/ npoints, nzone there is an implied barrier when the threads join.
dimension field(npoints), ispectrum(nzone)
!$OMP DO
do i=1, nzone OpenMP design objective
ispectrum(i) = 0.0 OpenMP was designed to be a flexible standard,
enddo easily implemented across different platforms.
!$OMP END DO NOWAIT As we discussed, the standard compriss four dis-
!$OMP DO tinct parts:
do i=1, npoints
field(i) = 0.0 • control structure,
enddo • the data environment,
!$OMP END DO NOWAIT • synchronization, and
!$OMP SINGLE • the runtime library.
field(npoints/4) = 1.0
!$OMP END SINGLE Control structure
return OpenMP strives for a minimalist set of con-
end trol structures. Experience has indicated that
only a few control structures are necessary for
subroutine compute_spectrum(field, ispectrum) writing most parallel applications. For example,
common /setup/ npoints, nzone in the Doacross model, the only control struc-
dimension field(npoints), ispectrum(nzone) ture is the doacross directive, yet this is ar-
!$OMP DO guably the most widely used shared-memory
do i= 1, npoints programming model for scientific computing.
index = field(i)*nzone + 1 Many of the control structures provided by
!$OMP ATOMIC X3H5 can be trivially programmed in OpenMP
ispectrum(index) = ispectrum(index) + i with no performance penalty. OpenMP includes
enddo control structures only in those instances where
!$OMP END DO NOWAIT a compiler can provide both functionality and
return performance over what a user could reasonably
end program.
Our examples used only three control struc-
Figure 4. A coarse-grained example. tures: parallel, do, and single. Clearly, the
compiler adds functionality in parallel and
do directives. For single, the compiler adds
orphaning the do directive. With the X3H5 di- performance by allowing the first thread reach-
rectives, we would have to move these loops into ing the single directive to execute the code.
the main program so that they could be lexically This is nontrivial for a user to program.
visible within the parallel directive. Clearly,
that restriction makes it difficult to write good Data environment
modular parallel programs. We use the nowait Associated with each process is a unique data
clause on the end do directives to eliminate the environment providing a context for execution.

52 IEEE COMPUTATIONAL SCIENCE & ENGINEERING


.

Concurrency
IEEE

Coming in 1998
Engineering of Complex Distributed Systems track
Concurrency
IEEE
Presenting requirements for complex distributed systems, recent research results, and
PARALLEL, DISTRIBUTED & MOBILE COMPUTING / OCTOBER–DECEMBER 1997
technological developments apt to be transferred into mature applications and
products.
Actors & Agents
Representing a cross sectin of current work involving actors and agents—autonomy,
identity, interaction, communication, coordination, mobility, persistence, protocols,
Like Comparing distribution, and parallelism.
Alligators to Armadillos
Object-Oriented Systems track
Showcasing traditional and innovative uses of object-oriented languages, systems, and
technologies.
Also, regular columns on mobile computing, distributed multimedia applications,
distributed databases, and high-performance computing trends from around the world.
Distributed databases: rethinking integrity
Parallel computing trends in European industry
Application-centric parallel multimedia software IEEE Concurrency chronicles the latest advances in high-performance computing,
Also Distributed programming tools

® ®
IEEE COMPUTER SOCIETY distributed systems, parallel processing, mobile computing, embedded systems,
multimedia applications, and the Internet.

Check us out at http://computer.org/concurrency

The initial process at program start-up has an X3H5, but experience has indicated a real need
initial data environment that exists for the du- for them.
ration of the program. It contructs new data Global objects can be made private with
environments only for new processes created the threadprivate directive. In the interest
during program execution. The objects consti- of performance, OpenMP implements a “p-
tuting a data environment might have one of copy” model for privatizing global objects:
three basic attributes: shared, private, or threadprivate will create p copies of the
reduction. global object, one for each of the p members in
The concept of reduction as an attribute is the team executing the parallel region. Often,
generalized in OpenMP. It allows the compiler however, it is desirable either from memory
to efficiently implement reduction opera- constraints or for algorithmic reasons to priva-
tions. This is especially important on cache- tize only certain elements because of a com-
based systems where the compiler can eliminate pound global object. OpenMP allows individ-
any false sharing. On large-scale SSMP archi- ual elements of a compound global object to
tectures, the compiler also might choose to im- appear in a private list.
plement tree-based reductions for even better
performance. Synchronization
OpenMP has a rich data environment. In ad- There are two types of synchronization: im-
dition to the reduction attribute, it allows plicit and explicit. Implicit synchronization
private initialization with firstprivate points exist at the beginning and end of parallel
and copyin, and private persistence with constructs and at the end of control constructs
lastprivate. None of these features exist in (for example, do and single). In the case of

JANUARY–MARCH 1998 53
.

do sections, and single, the implicit syn- sentinel. This allows calls to the runtime library
chronization can be removed with the nowait to be protected as compiler directives, so Open-
clause. MP code can be compiled on non-OpenMP sys-
The user specifies explicit synchronization to tems without linking in a stub library or using
manage order or data dependencies. Synchro- some other awkward workaround.
nization is a form of interprocess communication OpenMP provides standard environment
and, as such, can greatly affect program perfor- variables to accompany the runtime library
mance. In general, minimizing a program’s syn- functions where it makes sense and to simplify
chronization requirements (explicit and implicit) the start-up scripts for portable applications.
achieves the best performance. For this reason, This helps application developers who, in addi-
OpenMP provides a rich set of synchronization tion to creating portable applications, need a
features so developers can best tune the synchro- portable runtime environment.
nization in an application.
We saw an example using the Atomic direc-
tive. This directive allows the compiler to take

O
advantage of available hardware for implement- penMP is supported by a number of
ing atomic updates to a variable. OpenMP also hardware and software vendors, and
provides a Flush directive for creating more we expect support to grow. OpenMP
complex synchronization constructs such as has been designed to be extensible
point-to-point synchronization. For ultimate and evolve with user requirements. The
performance, point-to-point synchronization OpenMP Architecture Review Board was created
can eliminate the implicit barriers in the energy- to provide long-term support and enhancements
spectrum example. All the OpenMP synchro- of the OpenMP specifications. The OARB char-
nization directives can be orphaned. As discussed ter includes interpreting OpenMP specifications,
earlier, this is critically important for imple- developing future OpenMP standards, address-
menting coarse-grained parallel algorithms. ing issues of validation of OpenMP implementa-
tions, and promoting OpenMP as a de facto stan-
Runtime library and environment variables dard.
In addition to the directive set described, Possible extensions for Fortran include greater
OpenMP provides a callable runtime library support for nested parallelism and support for
and accompanying environment variables. The shaped arrays. Nested parallelism is the ability
runtime library includes query and lock func- to create a new team of processes from within an
tions. The runtime functions allow an applica- existing team. It can be useful in problems
tion to specify the mode in which it should run. exhibiting both task and data parallelism. For ex-
An application developer might wish to maxi- ample, a natural application for nested paral-
mize the system’s throughput performance, lelism would be parallelizing a task queue where-
rather than time to completion. In such cases, in the tasks involve large matrix multiplies.
the developer can tell the system to dynamically Shaped arrays refers to the ability to explicitly
set the number of processes used to execute assign the storage for arrays to specific memory
parallel regions. This can have a dramatic effect nodes. This ability is useful for improving per-
on the system’s throughput performance with formance on Non-Uniform Memory architec-
only a minimal impact on the program’s time to tures (NUMAs) by reducing the number of
completion. non-local memory references made by a proces-
The runtime functions also allow a developer sor.
to specify when to enable nested parallelism, The OARB is currently developing the spec-
which allows the system to act accordingly ification of C and C++ bindings and is also de-
when it encounters a nested parallel construct. veloping validation suites for tesing OpenMP
On the other hand, by disabling it, a developer implementations. ♦
can write a parallel library that will perform in
an easily predictable fashion whether encoun-
tered dynamically from within or outside a par-
allel region. References
OpenMP also provides a conditional compi- 1. D.E. Lenoski and W.D. Weber, Scalable Shared-
lation facility both through the C language pre- Memory Multiprocessing, Morgan Kaufmann, San
processor (CPP) and with a Fortran comment Francisco, 1995.

54 IEEE COMPUTATIONAL SCIENCE & ENGINEERING


.

Coming Next Issue


IEEE

& their applications


Feature Transformation and Subset Selection
As computer and database technologies have advanced, humans are relying more heavily on
computers to accumulate, process, and make use of data. Machine learning, knowledge discovery,
and data mining are some of the AI tools that help us accomplish those tasks. To use those tools
effectively, however, data must be preprocessed before it can be presented to any learning, dis-
covering, or visualizing algorithm. As this issue will show, feature transformation and subset selec-
tion are two vital data-preprocessing tools for making effective use of data.

Also Coming in 1998


AI IN HEALTH CARE • Self-Adaptive Software
INTELLIGENT AGEN
SELF-ADAPT
IVE SOFTW
TS • Autonomous Space Systems
DATA-MIN ARE
INTELLIG
ING TOO
LS
• Knowlege Representation: Ontologies
AUTON ENT V
OMOU
S SPA
EHICLE
S • Intelligent Agents: The Crossroads between AI and Information Technology
CE SY
...AND
STEM
S • Intelligent Vehicles
MOR
E!
• Intelligent Information Retrieval

IEEE
IEEE Intelligent Systems (formerly IEEE Expert) covers the full range of intelligent system develop-
ments for the AI practitioner, researcher, educator, and user.

IEEE Intelligent Systems


2. B. Leasure, ed., Parallel Processing Model for High- parallel systems. He is the author of over 30 refereed
Level Programming Languages, proposed draft, publications relating to these subjects. He received his
American National Standard for Information Pro- MS and PhD in aeronautics and astronautics from
cessing Systems, Apr. 5, 1994. Stanford. Contact him at M/S 580, 2011 N. Shoreline
3. MIPSpro Fortran77 Programmer’s Guide, Silicon Blvd., Mountain View, CA 94043-1389; dagum@
Graphics, Mountain View, Calif., 1996; http:// sgi.com.
techpubs.sgi.com/library/dynaweb_bin/0640/bi/
nph-dynaweb.cgi/dynaweb/SGI_Developer/
MproF77_PG/.
4. S. Ragsdale, ed., Parallel Programming Primer, Intel
Scientific Computers, Santa Clara, Calif., March Ramesh Menon is Silicon Graphics’ representative to
1990. the OpenMP Architicture Review Board and served as
5. J. Brown, T. Elken, and J. Taft, Silicon Graphics the board’s first chairman. He managed the writing of
Technical Servers in the High Throughput Environ- the OpenMP Fortran API. His research interests include
ment, Silicon Graphics Inc., 1995; http://www. parallel-programming models, performance charac-
sgi.com/tech/challenge.html. terization, and computational mechanics. He received
an MS in mechanical engineering from Duke Univer-
sity and a PhD in aerospace engineering from Texas
A&M. He was awarded a National Science Foundation
Leonardo Dagum works for Silicon Graphics in the Fellowship and was a principal contributor to the NSF
System Performance group, where he helped define Grand Challenge Coupled Fields project at the Uni-
the OpenMP Fortran API. His research interests include veristy of Colorado, Boulder. Contact him at
parallel algorithms and performance modelling for [email protected].

JANUARY–MARCH 1998 55

You might also like