Hummel 1997
Hummel 1997
SUMMARY
We consider the suitability of the Java concurrent constructs for writing high-performance
SPMD code for parallel machines. More specifically, we investigate implementing a financial
application in Java on a distributed-memory parallel machine. Despite the fact that Java was
not expressly targeted to such applications and architectures per se, we conclude that efficient
implementations are feasible. Finally, we propose a library of Java methods to facilitate SPMD
programming. 1997 by John Wiley & Sons, Ltd.
1. MOTIVATION
Although Java was not specifically designed as a high-performance parallel-computing
language, it does include concurrent objects (threads), and its widespread acceptance
makes it an attractive candidate for writing portable computationally-intensive parallel
applications. In particular, Java has become a popular choice for numerical financial codes,
an example of which is arbitrage – detecting when the buying and selling of securities
is temporarily profitable. These applications involve sophisticated modeling techniques
such as successive over-relaxation (SOR) and Monte Carlo methods[1]. Other numerical
financial applications include data mining (pattern discovery) and cryptography (secure
transactions).
In this paper, we use an SOR code for evaluating American options (see Figure 1)[1], to
explore the suitability of using Java as a high-performance parallel-computing language.
This work is being conducted in the context of a research effort to implement a Java run-
time system (RTS) for the IBM POWERparallel System SP machine[2], which is designed
to effectively scale to large numbers of processors. The RTS is being written in C with
calls to MPI (message passing interface)[3] routines. Plans are to move to a Java plus MPI
version when one becomes available.
The typical programming idiom for highly parallel machines is called data-parallel or
single-program multiple-data (SPMD), where the data provide the parallel dimension.
Parallelism is conceptually specified as a loop whose iterates operate on elements of a,
perhaps multidimensional, array. Data dependences between parallel-loop iterates lead
to a producer–consumer type of sharing, wherein one iterate writes variables that are
later read by another, or collective communication, wherein all iterates participate. The
communication pattern between iterates is often very regular, for example a bidirectional
flow of variables between consecutive iterates (as in the code in Figure 1).
This paper explores the suitability of the Java concurrency constructs for writing SPMD
programs. In particular, the paper:
1. identifies the differences between the parallelism supported by Java and data paral-
lelism
∗ Also with Polytechnic University and Cornell Theory Centre.
2. discusses compiler optimizations of Java programs, and the limits imposed on such
optimizations because of the memory-consistency model defined at the Java language
level
3. discusses the key features of data-parallel programming, and how these features can
be implemented using the Java concurrent and synchronization constructs
4. identifies a set of library methods that facilitate data parallel programming in Java.
parallelism (respectively):
As discussed in the next Section, the Java parallel constructs were designed for interactive
and Internet (distributed) programming, and lie somewhere between the above two ex-
tremes. However, with some minor modifications SPMD programming could be expressed
in Java more naturally.
ory. For distributed machines, Java’s Remote Method Invocation provides an RPC-like
interface that is appropriate for coarse grained parallelism.
Java threads synchronize through statements or methods that have a synchronized
attribute. These statements and methods function as monitors: only one thread is allowed to
execute synchronized code that access the same variables at a time. Locks are not provided
explicitly in the language, but are employed implicitly in the implementation of monitors
(they are embedded in the bytecode instructions monitorenter and monitorexit[12]).
Conceptually, each object has a lock and a wait queue associated with it; consequently, a
synchronized block of code is associated with the objects it accesses.
When two threads compete for entry to synchronized codes by attempting to lock the
same object, the thread that successfully acquires the lock continues execution, while the
other thread is placed on the wait queue of the object. When executing synchronized code,
a thread can explicitly transfer control by using notify, notifyAll and wait methods.
notify moves one randomly chosen thread from the wait queue of the associated object to
the run queue, while notifyAll moves all threads from the wait queue to the run queue.
If the thread owning the monitor executes a wait, it relinquishes the lock and places itself
on its wait queue, allowing another thread to enter the monitor.
The Java shared-memory model is not the sequentially consistent model[13], which
is commonly used for writing multi-threaded programs. Instead, Java employs a form of
weak consistency[14], to allow for shared-data optimization opportunities. In the sequential
consistency model, any update to a shared variable must be visible to all other threads. Java
allows a weaker memory model wherein an update to a shared variable is only guaranteed
to be visible to other threads when they execute synchronized code. For instance, a value
assigned by a thread to a variable outside a synchronized statement may not be immediately
visible to other threads because the implementation may elect to cache the variable until
the next synchronization point (with some restrictions). The following subsection discusses
the impact of the Java shared variable rules on compiler optimizations.
volatile so that it will not be cached by any thread. Any update to a volatile variable
by a thread is immediately visible to all other threads; however, this disables some code
optimizations.1
Java defines the required ordering and coupling of the interactions between the memories
(the complete set of rules is complex and is not repeated here). For instance, the read/write
actions to be performed by the main memory for a variable must be executed in the order
of arrivals. Likewise, the rules call for the main memory to be updated at synchronization
points by invalidating all copies in the working memory of a thread on entry to a monitor
(monitorenter) and by writing to main memory all newly generated values on exit from
a monitor (monitorexit).
Although the Java shared data model is weakly consistent, it is not weak enough to allow
the oblivious caching and out of order access of variables between synchronization points.
For example, Java requires that (a) the updates by one thread be totally ordered and (b) the
updates to any given variable/lock be totally ordered. Other languages, such as Ada[17],
only require that updates to shared variables be effective at points where threads (tasks
in Ada) explicitly synchronize, and therefore, an RTS can make better use of memory
hierarchies[18]. A Java implementation does not have enough freedom to fully exploit
optimization opportunities provided by caches and local memories. A stronger memory
consistency model at the language level forces an implementation to adopt at least the
same, if not a stronger, memory consistency model. A strong consistency model at the
implementation level increases the complexity of code optimizations – the compiler/RTS
has to consider interactions between threads at all points where shared variables are updated,
not just at synchronization points[19].
1 Note that the term volatile adopted by Java designates a variable as noncacheable, whereas in parallel pro-
gramming this term has traditionally meant that a variable is cacheable but may be updated by other processors[17].
626 S. F. HUMMEL, T. NGO AND H. SRINIVASAN
Although Java provides language constructs for parallelism, there are several factors that
make expressing data parallelism in Java awkward. First, as noted in the previous Section,
there is no mechanism for creating and starting threads in a ThreadGroup simultaneously.
A forAll must be implemented as a serial loop in Java:
A simple high-level construct for creating multiple threads simultaneously is more de-
sirable. Furthermore, if thread creation is expensive and thread startup is not well synchro-
nized across the parallel machine, the serialization will result in idle time that degrades the
speedup of a program.
Secondly, implementing the producer–consumer data sharing using the Java
synchronized statements, wait, notify and notifyAll, is problematic for the fol-
lowing reasons. Because notifys are not queued, it is complicated to enforce a specific
ordering. Consider the producer–consumer code in Figure 3: since there is no guarantee
that the Produce method will be invoked after the Consume method, an early notify
will be lost. A programmer must therefore add mutual exclusion and auxiliary variables to
SPMD PROGRAMMING IN JAVA 627
enforce the correct ordering, resulting in a low-level style of programming. It is not clear
that a compiler can always generate efficient code for these producer–consumer methods.
Another problem is that the implementation of notify selects a random thread on
the associated wait queue of an object; however, producer–consumer semantics requires
that the notify selects a specific thread from the wait queue. Therefore, to implement
producer–consumer synchronization, we use a separate object for each producer–consumer
pair to ensure that only one thread can appear on the wait queue.
The language specification[11] claims that wait and notify are ‘. . . especially ap-
propriate in situations where threads have a producer–consumer relationship’; however,
this mainly applies to bounded buffer applications where it is not necessary to associate
a producer with a specific consumer. Producer–consumer Java methods are given in [21],
for example, that use synchronized statements and methods to implement a linked list. A
somewhat simpler Java implementation of producer–consumer methods that adheres to our
semantics is given in Figure 4.
Collective communication in Java must also employ mutual exclusion and auxiliary
variables. An example barrier code is given in Figure 5.
Note that the RTS can capitalize on the MPI library to implement the Java producer–
consumer data sharing and collective communications. In this case, the message passing
routines are C functions with Java wrappers and can be invoked as native methods in a
data-parallel Java program.
A data-parallel version of the main loop of the SOR code from Figure 1 is given in
Figure 6. The main loop is executed by each thread on its local partition of the valuesOld
and valuesNew arrays. Each thread computes valuesNew using its valuesOld and the
boundary valuesOld elements of it neighbors. The thread then copies valuesNew into
valuesOld for the next outer-loop iterate. The individual producer–consumer sharing is
subsumed by barriers. The code in Figure 6 could be made more data parallel using the
native library methods, e.g. forAll, that we propose in the next Section.
The library methods for managing threads are given in Table 1. The start all method
implements a forAll loop, and is parameterized with a keyword indicating how shared
data-structures are localized and, in particular, how initial values are assigned. Static distri-
butions are blocked – that is, each thread is given an equal-sized consecutive chunk of each
array dimension. Dynamic distributions are also blocked, but elements may be migrated
during execution by the RTS to improve load balancing. Our RTS supports fractiling[22]
dynamic scheduling, which allocates iterates in decreasing size chunks.
The spmd setProducerMap and spmd setConsumerMap methods create thread com-
munication channels for an arbitrary topology. We also provide library methods to spec-
ify several common communication channel topologies, such as one-, two- and three-
dimensional grids or a binary tree (spmd thread 1dgrid, spmd thread 2dgrid,
spmd thread 3dgrid, spmd thread tree).
The library methods for thread communication are given in Table 2. The spmd produce
and spmd consume methods are parameterized by a keyword specifying which thread the
caller is being synchronized with. The parameter can be a specific thread or a logical thread
based on the ThreadGroup topology. There are methods for broadcasting and multicasting
variables to threads in a ThreadGroup. A barrier method is provided to synchronize
phases of a parallel algorithm. The library includes methods for common reduction and
scan operations, which are the cornerstones of many parallel algorithms[23]. A reduction
630 S. F. HUMMEL, T. NGO AND H. SRINIVASAN
applies a binary operation to reduce the elements of an array into a single element, while a
scan applies a binary operation to reduce subsequences of the elements of an array.
Name Function
Name Function
6. CONCLUSION
The concurrent constructs of Java were selected to facilitate the writing of interactive
and Internet applications. In this paper, we explored writing SPMD programs in Java,
and demonstrated that it is possible, if somewhat awkward, with the current language
specification. We identified several features that would make expressing SPMD parallelism
in Java more natural, for example, by including keywords to specify that threads are to be
started simultaneously, and by requiring that notifys be queued. In lieu of the addition of
these features, we propose that a standard library of SPMD routines be adopted. This paper
is an initial step towards this end.
Since the goal of parallel computing is primarily efficiency, there are other changes to the
language that are desirable for SPMD computing. One of the most important of these is the
explicit declaration of shared variables and the relaxing of the Java memory consistency
model, so that more code optimizations can be performed.
SPMD PROGRAMMING IN JAVA 631
REFERENCES
1. J. Dewynne, P. Wilmott and S. Howison, Option Pricing Mathematical Models and Computation,
Oxford Financial Press, 1993.
2. T. Agerwala, J. L. Martin, J. H. Mizra, D. C. Sadler, D. M. Dias and M. Snir, ‘SP2 system
architecture’, IBM Syst. J., 2(34),152–184 (1995).
3. MPI Forum, ‘Document for a standard message passing interface’, Technical Report CS-93-214,
University of Tennessee, November 1993.
4. P. Brinch Hansen, ‘The programming language Concurrent Pascal’, IEEE Trans. Softw. Eng.,
1(2), 199–206 (1975).
5. Parallel Computing Forum, ‘PCF Parallel FORTRAN extensions’, Special issue, FORTRAN
Forum, 10(3), (1991).
6. IBM, Parallel Fortran Language and Library Reference, March 1988. Pub. No. SC23-0431-0.
7. S. Flynn Hummel and R. Kelly, ‘A rationale for massively parallel programming with sets’,
J. Program. Lang., 1, (1993).
8. I. Foster, R. Olsen and S. Tuecke, ‘Programming in Fortran M, version 1.0’, Technical Report,
Argonne National Laboratory, October 1993.
9. H. F. Jordan, M. S. Benten, G. Alaghband and R. Jakob, ‘The force: A highly portable parallel
programming language’, in E. C. Plachy and Peter M. Kogge (Eds.), Proc. 1989 International
Conf. on Parallel Processing, vol. II, St. Charles, IL, August 1989, pp. II-112–II-117.
10. High Performance Fortran Forum, ‘High Performance Fortran language specification, version
1.0’, Technical Report CRPC-TR92225, Rice University, May 1993.
11. B. Joy, J. Gosling and G. Steele, The Java Language Specification, Addison-Wesley, 1996.
12. T. Lindholm and F. Yellin, The Java Virtual Machine Specification, Addison-Wesley, 1997.
13. Leslie Lamport, ‘How to make a multiprocessor computer that correctly executes multiprocess
programs’, IEEE Trans. Comput., C-28(9), 690–691 (1979).
14. Michel Dubois, Christoph Scheurich and Faye Briggs, ‘Memory access buffering in multipro-
cessors’, in Conf. Proc. 13th Annual International Symp. on Computer Architecture, Tokyo,
June 1986, pp. 434–442.
15. Manish Gupta, Edith Schonberg and Harini Srinivasan, ‘A unified data-flow framework for
optimizing communication’, IEEE Trans. Parallel Distrib. Syst., 7 (7), (1996).
16. S. P. Amarasinghe and M. S. Lam, ‘Communication optimization and code generation for
distributed memory machines’, in Proc. ACM SIGPLAN ’93 Conference on Programming
Language Design and Implementation, Albuquerque, NM, June 1993.
17. Ada 95 Rationale, Intermetrics Inc, 1995.
18. S. Flynn Hummel, R. B. K. Dewar and E. Schonberg, ‘A storage model for Ada on hierarchical-
memory multiprocessors’, in A. Alverez (Ed.), Proc. of the Ada-Europe Int. Conf., Cambridge
University Press, 1989.
19. Harini Srinivasan and Michael Wolfe, ‘Analyzing programs with explicit parallelism’, in Utpal
Banerjee, David Gelernter, Alexandru Nicolau and David A. Padua (Eds.), Languages and
Compilers for Parallel Computing, Springer-Verlag, 1992, pp. 405–419.
20. Robert Strom and Kenneth Zadeck, personal communication, March 1996.
21. D. Lea, Concurrent Programming in Java, Addison-Wesley, 1997.
22. S. Flynn Hummel, I. Banicescu, C. Wang and J. Wein, ‘Load balancing and data locality
via fractiling: An experimental study’, in Boleslaw K. Szymanski and Balaram Sinharoy (Eds.),
Proc. Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers,
Kluwer Academic Publishers, Boston, MA, 1995, pp. 85–89.
23. G. Blelloch, ‘Scan primitives and parallel vector models’, Ph.D. Dissertation MIT/LCS/TR-463,
MIT, October 1989.