Object Oriented Paralell
Object Oriented Paralell
highlights
• The PObC++ language implements the concept of Object-Oriented Parallel Programming (OOPP).
• OOPP reconciles distributed-memory parallel programming with OO programming principles.
• OOPP separates concerns about inter-object and inter-process communication.
• OOPP makes it possible the encapsulation of distributed-memory parallel computations in objects.
• Performance of PObC++ programs is almost similar to the performance of C++/MPI programs.
1. Introduction
The better cost-benefit of parallel computing platforms for High Performance Computing (HPC),1 due to the success of off-
the-shelf distributed-memory parallel computing platforms, such as Clusters [1] and Grids [2], has motivated the emergence
of new classes of applications from computational sciences and engineering. Besides high performance requirements,
these applications introduce stronger requirements of modularity, abstraction, safety and productivity for the existing
parallel programming tools [3]. Unfortunately, parallel programming is still hard to incorporate into the usual large scale
software development platforms that may be developed to deal with such kinds of requirements [4]. Also, automatic
parallelization is useful only in restricted contexts, such as in scientific computing libraries [5]. Skeletal programming
[6], a promising alternative for high-level parallel programming, has not achieved the acceptance expected [7]. These
days, libraries of message-passing subroutines that conform to the MPI (Message Passing Interface) standard [8] are
0167-6423/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.scico.2013.03.014
66 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
widely adopted by parallel programmers, offering expressiveness, portability and efficiency across a wide range of parallel
computing platforms. However, they still present a low level of abstraction and modularity in dealing with the requirements
of the emerging large scale applications in HPC domains.
In the context of corporative applications, object-oriented programming (OOP) has been consolidated as the main
programming paradigm to promote development productivity and software quality. Object-orientation is the result of
two decades of research in programming tools and techniques motivated by the need to deal with the increasing levels
of software complexity since the software crisis context of the 1960s [9]. Many programming languages have been designed
to support OOP, such as: C++, Java, C#, SmallTalk, Ruby, Objective-C, and so on. Despite their success in the software industry,
object-oriented languages are not popular in HPC, dominated by Fortran and C, as a consequence of the high level of
abstraction and modularity offered by these languages. When parallelism comes onto the scene, the situation is worse, due
to the lack of safe ways to incorporate explicit message-passing parallelism to these languages without breaking important
principles, such as the functional independence of objects and their encapsulation.
This paper presents PObC++ (Parallel Object C++), a new parallel extension to C++ which implements the ideas behind
OOPP (Object Oriented Parallel Programming), a style of parallel programming where objects are intrinsically parallel, so
deployed in a set of nodes of a distributed-memory parallel computer, and communication is distinguished in two layers:
intra-object communication, for common process interaction by message-passing, and inter-object communication, for usual
object coordination by method invocations. In OOPP, objects are called p-objects (parallel objects). The decision to support
C++ comes from the wide acceptance of C++ in HPC. However, OOPP might be supported by other OO languages, such as
Java and C#. The main premise that guides the design of PObC++ is the preservation of basic object-orientation principles
while introducing a style of programming based on message-passing, inheriting the well-known programming practices
using MPI (Message Passing Interface) [8].
Section 2 discusses the current context regarding message-passing parallel programming and object-oriented
programming, as well as their integration. Section 3 presents the main premises and concepts behind OOPP, showing how
it is supported by PObC++. This section ends off by presenting the overall architecture of the current prototype of PObC++
compiler. Section 4 presents three case studies of PObC++ programming, aimed at giving evidence of the expressiveness,
programming productivity, and the performance of OOPP. Finally, Section 5 will present our conclusions, describe ongoing
research, and plant ideas for further research initiatives.
This work attempts to bring together two widely accepted programming techniques in a coherent way:
• Message-Passing (MP), intended for HPC applications, which have stronger performance requirements as the main driving
force, generally found in scientific and engineering domains;
• Object-Orientation (OO), intended for large-scale applications, which have stronger productivity requirements for
development and maintenance, generally found in corporative domains.
The following sections review concepts of the two above programming techniques that are important in the context of
this work, also providing a discussion about the strategies that have been applied for their integration (related works).
MPI is a standard specification for libraries of subroutines for message-passing parallel programming that are portable
across distributed-memory parallel computing platforms [8]. MPI was developed in the mid 1990s by a consortium inte-
grating representatives from academia and industry, interested in a message-passing interface that could be implemented
efficiently in virtually any distributed parallel computer architecture, replacing the myriad of proprietary interfaces devel-
oped at that time by supercomputer vendors for the specific features and needs of their machines. It was observed that such
diversity results in higher costs for users of high-end parallel computers, due to the poor portability of their applications
between architectures from distinct vendors. Also, the lack of standard practices breaks the technical evolution and dissem-
ination of computer architectures and programming techniques for parallel computing. MPI was initially proposed as a kind
of parallel programming ‘‘assembly’’, on top of which specific purpose, higher-level parallel programming interfaces could
be developed, including parallel versions of successful libraries of subroutines for scientific computing and engineering.
However, MPI is now mostly used to develop final applications. The MPI specification is now maintained by the MPI Forum
(http://www.mpi-forum-org).
MPI is now the main representative of message-passing parallel programming. Perhaps it is the only parallel
programming interface, both portable and general purpose, to efficiently exploit the performance of high-end distributed
parallel computing platforms. Since the end of the 1990s, any new installed cluster or MPP2 has supported some
implementation of MPI. In fact, most vendors of parallel computers adopt MPI as their main programming interface, offering
highly optimized implementations for their architectures. MPI is also considered to be one of the main reasons for the
increase in popularity of cluster computing, due to the availability of efficient open-source implementations for Linux-based
platforms. MPI has become popular even in shared memory parallel computers, since its wide acceptance among parallel
programmers is seen as a way of reducing the learning curve of parallel programming for these architectures.
Many free implementations of MPI, most of them open-source, have been developed, supporting a wide range of
computer platforms, such as MPICH, OpenMPI, LAM-MPI, and MSMPI. Also, unofficial versions of the MPI specification in
languages not supported by the official specification have also been implemented, such as Boost.MPI (C++), MPI.NET (C#),
and JavaMPI (Java).
Two versions of the MPI specification have been proposed by the MPI forum, officially specified in Fortran and C: MPI-1
and MPI-2. MPI-2 extends MPI-1 with many innovations proposed by the community of MPI users. Hundreds of subroutines
are supported, with various purposes, enumerated below:
• point-to-point communication (MPI-1);
• collective communication (MPI-1);
• communication scopes (MPI-1);
• process topologies (MPI-1);
• data types (MPI-1);
• one-sided communication (MPI-2);
• dynamic process creation (MPI-2);
• parallel input and output (MPI-2).
For the purposes of this paper, it is relevant to provide only an overview of the MPI programming model. A complete
description of their subroutines can be obtained in the official specification document, available at the MPI Forum website.
There are also many tutorials publicly available on the web.
point/collective communication and communication scopes. In the following sections, we give a brief overview of these
subsets of subroutines.
Object-orientation is an influential data abstraction mechanism whose basis was introduced in the mid 1960s, with the
Simula’67 programming language [10,11]. Following Simula’67, the most prominent object-oriented language was Smalltalk
[12], developed at Xerox PARC in the 1970s. The designers of Smalltalk adopted the pervasive use of objects as a computation
basis for the language, being the first to coin the term object-oriented programming (OOP). During the 1990s, OOP became
the mainstream in programming, mostly influenced by the rising in popularity of graphical user interfaces (GUI), where
OOP techniques were extensively applied. However, the interest in OOP rapidly surpassed the use in GUI’s, as the software
engineers and programmers recognized the power of OOP principles in dealing with the increasing complexity and scale of
software. Today, the most used OOP languages are C++, Java, and C#.
Modern object-oriented languages are powerful programming artifacts. Often, their rich syntax, complex semantics, and
comprehensive set of libraries hide the essential principles of object-orientation. In this section, we present the essential
characteristics of object-oriented imperative languages, in its pure sense, focusing on the ones that are relevant for the
purposes of this paper.
2.2.1. Objects
In an imperative programming setting, a pure object is a runtime software entity consisting of the following parts:
2.2.2. Encapsulation
The most primitive principle behind object-oriented programming (OOP) is encapsulation, also called information hiding,
which states that an object which knows the interface of another object does not need to make assumptions about its internal
details to use its functionality. It only needs to concentrate on the interface of the objects they depend on. In fact, an OOP
language statically prevents an object from accessing the internal state of another object, by exposing only its interface.
Encapsulation prevents programmers from concentrating on irrelevant details about the internal structure of a particular
implementation of an object. In fact, the implementation details and attributes of an object may be completely modified
70 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
without affecting the parts of the software that depend on the object, provided its interface, as well as its behavior, is
preserved. In this sense, encapsulation is an important property of OOP in dealing with software complexity and scale.
More importantly, encapsulation brings to programmers the possibility of working at higher levels of safety and security, by
allowing only essential and valid accesses to be performed on critical subsets of the program state.
2.2.3. Classes
A class is defined as a set of similar objects, presenting a set of similar attributes and methods. Classes may also be
introduced as the programming-time counterparts of objects, often called prototypes or templates, specifying the attributes
and methods that objects instantiated from them must carry at run time.
Let A be a class with a set of attributes α and a set of methods µ. A parallel programmer may derive a new class from A,
called A′ , with a set of attributes α ′ and a set of methods µ′ , such that α ⊆ α ′ and µ ⊆ µ′ . This is called inheritance [14]. A
is a superclass (generalization) of A′ , whereas A′ is a subclass (specialization) of A. By the substitution principle, an object of
class A′ can be used in a context where an object of class A is required. Thus, in a good design, all the valid internal states
and state transformations of A are also valid in A′ . Such safety requirement cannot be enforced by the usual OOP languages.
Inheritance of classes can be single or multiple. In single inheritance, a derived class has exactly one superclass, whereas
in multiple inheritance a class may be derived from a set of superclasses. Modern OOP languages, such as Java and C#, have
abolished multiple inheritance, still supported by C++, by adopting the single inheritance mechanism once supported by
Smalltalk. To deal with use cases of multiple inheritance, Java introduced the notion of interface. An interface declares a set
of methods that must be supported by objects that implement it. Interfaces define a notion of type for objects and classes.
2.2.4. Abstraction
Classes and inheritance bring four important abstraction mechanisms to OOP [14]:
• Classification/instantiation constitutes the essence of the use of classes. As already defined, classes group objects with
similar structure (methods and attributes). Objects represent instances of classes.
• Aggregation/decomposition comes from the ability to have objects as attributes of other objects. Thus, a concept
represented by an object may be described by their constituent parts, also defined as objects, forming a recursive
hierarchy of objects that represent the structure behind the concept.
• Generalization/specialization comes from inheritance, making it possible to recognize commonalities between different
classes of objects by creating superclasses from them. Such an ability makes possible a kind of polymorphism that is typical
in modern OO languages, where an object reference, or variable, that is typed with a class may refer to an object of any
of its subclasses.
• Grouping/individualization is supported due to the existence of collection classes, which allows for the grouping together
of objects with common interests according to the application needs. With polymorphism, collections of objects of related
classes, by inheritance relations, may be valid.
2.2.5. Modularity
Modularity is a way of managing complexity in software, by promoting the division of large scale and complex systems
into collections of simple and manageable parts. There are some accepted criteria in classifying the level of modularity
achieved by a programming method: decomposability, composability, understandability, continuity, and protection [15].
OOP promotes the organization of the software in classes from which the objects that perform the application will be
instantiated at run time. In fact, classes will be the building blocks of OOP software. In a good design, classes capture simple
and well-defined concepts in the application domain, orchestrating them to perform the application in the form of objects
(decomposability). Classes promote the reuse of software parts, since the concept captured by a class of objects may be
present in several applications (composability). Indeed, abstraction mechanisms makes it possible to reuse only those class
parts that are common between objects in distinct applications. Encapsulation and a high functional independence degree
promote independence between classes, making it possible to understand the meaning of a class without examining the code
of other classes it depends on (understandability). Also, they avoid the propagation of modifications in the requirements of
a given class implementation to other classes (continuity). Finally, exception mechanisms makes it possible to restrict the
scope of the effect of an error condition at runtime (protection).
2.4. Contributions
From the above context, the authors argue that attempts to reconcile distributed-memory parallel programming
and object-oriented languages break important object-orientation principles and/or do not reach the level of flexibility,
generality and high performance of parallel programming using the MPI standard. Concerning these problems, this paper
includes the following contributions:
• An alternative perspective of object-oriented programming where objects are parallel by default, called OOPP (Object
Oriented Parallel Programming);
• The design of a language based on OOPP, called PObC++ (Parallel Object C++), demonstrating the viability of OOPP as a
practical model;
• A prototype of PObC++, which may be used to validate OOPP usage and performance;
• A comparison between the performance of a PObC++ program and the performance of its C++/MPI (non-OO) counterpart,
which evidences that object-orientation does not add significant overheads;
• Discussions about programming techniques behind OOPP, using a set of selected case studies.
We found that the main reason for the difficulties in reconciling object-orientation and distributed-memory parallel
programming lies in the usual practice of mixing concerns and processes in the same dimension of software decomposition
[32]. In distributed-memory parallel programming, concerns that must be implemented by a team of processes (parallel
concerns) are common. Consequently, an individual object, which addresses some application concern, must be distributed,
which means that they must be located at a set of nodes of the parallel computing platform. In the usual practice, an object
is always located in the address space of a single node, and teams of objects are necessary to address parallel concerns.
72 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
On the left-hand side of Fig. 3 (‘‘BY PROCESSES’’), the common practice of parallel programming in OOP languages is
illustrated, where individual objects execute in a single address space and parallel concerns are implemented by teams of
objects that communicate through either message-passing or remote method invocations. In the latter approach, there is
no clear distinction between messages for parallel interaction and object coordination. Moreover, these kinds of client–
server relations are not appropriate for communication among parallel interacting peers. In the former approach, parallel
interaction is a clandestine form of object coordination, by using some low-level communication library, such as MPI or
Sockets, possibly breaking the encapsulation of objects and reducing their functional independence.
On the right-hand side of Fig. 3 (‘‘BY CONCERNS’’), the practice that we argue to be the best suited one for distributed-
memory parallel programming with OOP languages is illustrated. It is the base of the Object-Oriented Parallel Programming
(OOPP), the approach we are proposing. Objects that cooperate to implement a parallel concern now constitute a parallel
object, here referred to as p-object. Each individual object is a unit of the p-object. Application concerns are now encapsulated
in a p-object, where parallel interactions are no longer clandestine. In fact, parallel interaction and object coordination are
distinguished at different hierarchical levels, leading to the concepts of intra-object and inter-object communication. Intra-
object communication may be performed using message-passing, which is better suited for parallel interaction between
peer units, whereas inter-object communication may use local method invocations.
From the above considerations, we argue that a fully concern-centric decomposition approach improves the functional
independence of objects, now parallel objects, by eliminating the additional coupling of objects and classes which result
from a process-centric decomposition approach. We propose a language for OOPP, called PObC++, a parallel extension to
C++, implemented on top of MPI for enabling process creation, communication, and synchronization. C++ is adopted because
it is widely accepted and disseminated among parallel programmers due to its high performance, mainly in computational
sciences and engineering. However, the parallel concepts introduced in C++ may be easily introduced to Java or C#, the two
mainstream programming languages in general application domains.
PObC++ supports a parallel programming style inspired by the MPI standard, which has been widely accepted among HPC
programmers since the mid 1990s. It may be distinguished from other object-oriented parallel programming alternatives
in the following aspects:
• objects keep atomicity of concerns, since each unit of a p-object can address the role of a process with respect to a concern
that is implemented in parallel;
• objects send messages to other objects only by usual method invocations, avoiding clandestine communication through
low-level message passing as in existing approaches;
• fully explicit parallelism is supported by means of an explicit notion of process and intra-object message-passing
communication, providing full control over typical parallel programming responsibilities (load balancing, data locality,
and so on).
The following subsections introduce the main concepts and abstractions behind OOPP, using simple examples in PObC++.
PObC++ attempts to reconcile full explicit message-passing parallel programming with object-orientation, by introducing
the smallest possible set of new concepts and abstractions. For this reason, pragmatic decisions have been made for
supporting MPI without breaking the principles behind OOPP. We think that such an approach may lead to a better learning
curve for new users of PObC++. Further works will study how to improve OOPP, making it more attractive to parallel
programmers.
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 73
A Parallel Object (p-object) is defined by a set of units, each one located at a processing node of a distributed-memory
parallel computer. A p-object is an object in the pure sense of object-oriented programming, addressing some application
concern and communicating with other objects through method invocations. Distinct p-objects of an application may be
located at distinct subsets of processing nodes of the parallel computer, overlapped or disjoint.
The state of a p-object (global state) is defined by a set whose elements are the states of each one of its units (local states).
Local states are defined just as the states of single objects (Section 2.2).
A p-object may accept parallel methods and singleton methods. Singleton methods are accepted by individual units of
the p-object. In turn, a parallel method is accepted by a subset of the units of the p-object. Let A and B be p-objects, such that
units of A, the caller units, performs an invocation to a parallel method m of B, the callee units. Each caller unit of A must
be located at the same processing node of the callee unit of B. In a parallel method invocation, message-passing operations
may be used for synchronization and communication between the callee units of the p-object.
Fig. 4 illustrates parallel method invocations (solid arrows) and singleton method invocations (dashed arrows). The calls
to um1, from A to B, and um2, um3, and um4, from C to D, illustrate singleton method invocations. The p-object A performs
calls to the parallel methods pm1 and pm2, respectively accepted by the p-objects B and C. Notice that both B and C are
located at a subset of the processing nodes where A is located, in such a way that the pairs of units involved in a method
call (a2 /b1 , a2 /c2 , a3 /b2 , and a4 /c3 ) are placed in the same processing node. Therefore, method invocations are always
local, involving units of distinct p-objects placed inside the same processing node. This is inter-object communication, which
makes possible coordination between the objects for achieving application concerns. In turn, inter-process communication
is always encapsulated inside a p-object, by message-passing among its units. This is intra-object communication, which aims
to implement inter-process communication patterns of parallel programs.
A Parallel Class (p-class) represents a prototype for a set of p-objects with common sets of units, methods and attributes,
distinguished only by their execution state. Fig. 5 illustrates the structure of a parallel class in PObC++, introducing the
74 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
c l a s s Hello1 {
/ ∗ p a r a l l e l c l a s s method s i g n a t u r e ∗ /
c l a s s Hello2 {
public : void sayHello ( ) ;
private : i n t i ; / / c l a s s a t t r i b u t e
unit a {
/ ∗ p a r a l l e l c l a s s method implementation ∗ /
/ ∗ p a r a l l e l method with d e f a u l t
void sayHello ( ) {
implementation o u t s i d e c l a s s ∗ /
cout << " Hello ! I am u n i t a " ;
public :
}
void sayHello ( ) ;
}
unit a {
unit b {
public : double n1 ; / / unit attribute
/ ∗ p a r a l l e l c l a s s method implementation ∗ /
}
void sayHello ( ) {
cout << " Hello ! I am u n i t b" ;
unit b {
}
private : double n2 ; / / unit attribute
}
public : double n1 ; / / unit attribute
}
p a r a l l e l unit c {
/ ∗ p a r a l l e l c l a s s method implementation ∗ /
unit c {
void sayHello ( ) [ Communicator comm] {
public :
i n t rank = comm. getRank ( ) ;
double n1 ; / / unit attribute
cout << " Hello ! I am the "
/ ∗ s i n g l e t o n u n i t method ∗ /
<< rank << "−th u n i t c "
i n t getMy_i ( ) {
}
return i ++;
}
}
}
p a r a l l e l unit d {
}
/ ∗ p a r a l l e l c l a s s method implementation ∗ /
void sayHello ( ) {
/ ∗ p a r a l l e l method d e f a u l t implementation ∗ /
i n t rank = comm. getRank ( ) ;
void Hello2 : : sayHello ( ) {
cout << " Hello ! I am the "
cout << " I am some u n i t o f p−c l a s s Hello " ;
<< rank << "−th u n i t d" ;
}
}
public :
/ ∗ p a r a l l e l method implementations ∗ /
/ ∗ p a r a l l e l u n i t method implem . ∗ /
void Hello2 : : b : : sayHello ( ) {
p a r a l l e l void
cout << " Hello ! I am u n i t b" ;
sayBye ( ) [ Communicator my\_comm] {
}
i n t rank = my\_comm . getRank ( ) ;
cout << " Bye ! I am the "
void Hello2 : : c : : sayHello ( ) {
<< rank << "−th u n i t d" ;
cout << " Hello ! I am u n i t c " ;
}
}
}
}
(a) (b)
unit abstraction. Also, it introduces the possible syntactical scopes in a p-class declaration: class scope and unit scope. Fig. 6
exemplifies the PObC++ syntax for p-classes.
Units of a p-class may be singleton units or parallel units. In the instantiation of a p-class, only one instance of a singleton
unit will be launched in a processing node, whereas an arbitrary number of instances of a parallel unit may be launched,
each one in a distinct process node. A reader who is familiar with parallel programming will find that parallel units capture
the essence of SPMD programming.
An attribute may be declared in the unit scope or in the class scope. In the former case, it is called a unit attribute, whose
instance must exist only in the address space of the unit where it is declared. In the latter case, it is called a class attribute,
which must have an independent instance in the address space of each unit of the p-object. In fact, a class attribute is a
syntactic sugar for a unit attribute declared in each unit of the p-object with the same type and name.
Methods may also be declared in the class scope or in the unit scope. In the former case, they are parallel class methods,
which are accepted by all the units of a p-object. In the latter case, they are either singleton unit methods, which are accepted
by individual units, or parallel unit methods, which are accepted by parallel units. Parallel unit methods are declared in the
scope of parallel units using the parallel modifier.
The reader may notice that parallel class methods and parallel unit methods correspond to the parallel methods of p-objects
discussed in Section 3.1, where communication and synchronization between parallel objects takes place. In turn, their
singleton methods relates to the singleton unit methods of PObC++ classes. To enforce these restrictions in programming, only
parallel methods have access to communicators in their scope. A communicator is a special kind of object that provides an
interface for communication and synchronization among units of a p-object in the execution of parallel methods.
An implementation of a parallel class method, possibly distinct, must be provided in the scope of each unit of the
p-class. Alternatively, a default implementation may be provided in the class scope, which may be overridden by specific
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 75
implementations provided in the scope of one or more units. Default implementations of parallel class methods have
access only to class attributes, whereas methods declared in the unit scope, singleton or parallel, may access class and unit
attributes.
In Fig. 6(a), the p-class Hello1 declares four units: a, b, c, and d. The first two are singleton units, whereas the last two are
parallel ones. The parallel keyword announces a parallel unit. Hello1 has a parallel class method, named sayHello, without a
default implementation. Therefore, it is implemented by each unit. There is also an example of parallel unit method, named
sayBye, declared by unit d. The code of sayBye makes reference to a communicator object received as a special argument
using the brackets notation, named my_comm. In the code of sayHello, the declaration of the communicator argument is
implicit. In such case, it may be referenced using the keywork comm. Implicit declarations of communicator objects is a kind
of syntactic sugar, since most of parallel methods will use a single communicator object. In a call to a parallel method, the
communicator object may also be passed implicitly if there is only one communicator object, implicitly defined, in the scope.
Section 3.3 will provide more details about the semantics and use of communicators.
The p-class Hello2, in Fig. 6(b), illustrates attributes and methods declared in class and unit scopes. For instance, a copy
of the class attribute i exists in each unit a, b, and c. They are independent variables, and can be accessed and updated
locally, by class and unit methods. Notice that a double precision float-point unit attribute n1 is declared in each unit
scope. n2 is another unit attribute, but accessible only in the scope of unit b. The parallel class method sayHello now has
a default implementation, defined outside the class as recommended by C++ programming conventions (C++ syntax also
allows definitions inside the class). The default implementation of sayHello is overridden by specialized implementations
in units b and c, also defined outside the class declaration. Indeed, only the unit a will execute the default implementation
of sayHello. Finally, the method getMy_i is a singleton unit method in the scope of c. Thus, it has access to the class attribute
i and to the unit attribute n1.
3.2.1. Inheritance
Inheritance between p-classes in PObC++ is similar to inheritance between usual C++ classes. The unique peculiarity is
the requirement of unit preservation, which states that the units of a subclass are all those inherited from the superclass.
This is illustrated in Fig. 7.
Adding or removing units of a superclass may violate the type substitution principle of type systems that support
subtyping, which state that an object of a class may be used in any context where an object of one of its superclasses is
required. For instance, let A be a p-class with n distinct units and let A′ be another p-class inherited from A by introducing an
additional unit, distinct from the other ones. Thus, A′ has n + 1 distinct units. Now suppose that a p-class B, having n distinct
units, declares a class attribute v of type A. In execution, a p-object of B instantiates a p-object of A for the attribute v , by
calling the constructor of each unit of A in each one of the n units of B. Now, suppose that one decides to replace the p-object
of A by a p-object of A′ for the variable v . According to the type substitution principle, this would be a safe operation, since A′
is a subtype of A, a consequence of inheritance. But, this is not possible in the described scenario, since the p-object of A′ has
n + 1 units, one more unit than the old p-object of A. Therefore, it is not possible to determine a distinct processing node
inside the p-object of B to place the additional unit.
of units actually instantiated for the parallel unit, as well as information about the topological organization of such units,
may be fetched by invoking the methods of one or more communicators that may be provided by the caller, as illustrated in
the example introduced in the next section. This is also valid for parallel class methods.
In OOPP, the orthogonalization between concerns, encapsulated in p-objects, and processes, results in a clean separation
between two types of messages:
• inter-object communication: messages exchanged between parallel objects, implementing the orchestration among the
set of application concerns, concretely carried out by p-objects, in order to implement the overall application concern. In
general, such messages are carried out by means of method invocations, defining a client–server relationship between
p-objects;
• intra-object communication: messages exchanged among the units of p-objects, usually by means of message-passing,
defining peer-to-peer relationships among units of a p-object. Such messages define the interactions among application
processes, required by most of parallel algorithms.
In the usual parallel approaches of OOP languages, there is no clear distinction between these kinds of messages. As a
consequence, one of the following approaches is adopted:
• parallel synchronization and communication are implemented by means of method invocations between objects, which
is inappropriate for parallel programming, since method invocations lead to client–server relationships between pairs
of processes, or pairs of subsets of processes, whereas most of the parallel algorithms assume peer-to-peer relationships
among them; or
• low-level message passing between objects, defining a communication backdoor for clandestine interaction between
objects, resulting in low modularity and high coupling among the objects that implement a parallel concern.
The well-known M × N coupling problem [35] leads to convincing arguments about the inappropriateness of the first
approach. Let M and N be two sets of processes residing in disjoint sets of processing nodes, probably with different
cardinalities, that want to communicate some data structure whose distribution differ in the two sets of processes. If
each process is implemented as an object, following the usual ‘‘BY PROCESSES’’ perspective of Fig. 3, the logic of the
communication interaction needed to exchange data between the objects of each set tends to be scattered across all the
objects, with some objects making the role of the client side and others playing the role of the server side. Moreover, many
methods may be necessary to control the data transfers back and forth between the two sides of the M × N coupling of
processes. Using OOPP, such coupling could be simply implemented by a single p-object with M + N units that encapsulate
all the coupling logic using peer-to-peer message-passing communication operations.
With respect to the second approach, objects tend to lose their functional independence. Therefore, they cannot
be analyzed in isolation, breaking important modularity principles behind object-orientation that were introduced in
Section 2.2.
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 77
class MatrixMultiplier
void M a t r i x M u l t i p l i e r : : manager : : d i s t r i b u t e ( )
{
[ Communicator comm]
public : void d i s t r i b u t e ( ) ;
{
int∗ c o l l e c t ( ) ;
int fool_a , fool_b ;
comm. s c a t t e r ( a −1,& f o o l _ a , rankof ( manager ) ) ;
unit manager {
comm. s c a t t e r ( b−1,&fool_b , rankof ( manager ) ) ;
private :
}
i n t ∗ a , ∗b , ∗ c , n ;
public :
i n t ∗ M a t r i x M u l t i p l i e r : : manager : : c o l l e c t ( )
void set_ab ( i n t n_ , i n t ∗ a_ , i n t ∗ b_ )
[ Communicator comm]
{ n = n_ ; a = a_ ; b = b_ ; }
{
}
comm. gather ( − 1, c , rankof ( manager ) ) ;
return c + 1;
p a r a l l e l unit c e l l {
}
private : i n t a , b , c = 0;
int i , j , n;
void M a t r i x M u l t i p l i e r : : c e l l : : d i s t r i b u t e ( )
void c a l c u l a t e _ r a n k s _ n e i g h b o r s
[ Communicator comm]
( CartesianCommunicator ,
{
int , int , i n t ∗ , i n t ∗ ,
comm. s c a t t e r (&a , rankof ( manager ) ) ;
int ∗ , int ∗);
comm. s c a t t e r (&b , rankof ( manager ) ) ;
public : p a r a l l e l i n t ∗ compute ( ) ;
}
}
}
int∗ MatrixMultiplier : : c e l l : : c o l l e c t ( )
[ Communicator comm]
c l a s s Main
{
{
comm. gather (&c , 1 , rankof ( manager ) ) ;
public :
return &c ;
i n t main ( ) ;
}
private :
void M a t r i x M u l t i p l i e r : : c e l l : : compute ( )
Communicator comm_data ;
[ CartesianCommunicator comm]
CartesianCommunicator create_comm_compute ( ) ;
{
i n t west_1_rank , east_1_rank ,
unit root
north_1_rank , south_1_rank ;
{
i n t west_n_rank , east_n_rank ,
i n t main ( ) [ Communicator world_comm ]
north_n_rank , south_n_rank ;
{
M a t r i x M u l t i p l i e r : : manager ∗mm
i n t i = comm. c o o r d i n a t e s [ 0 ] ;
= new M a t r i x M u l t i p l i e r : : manager ( ) ;
i n t j = comm. c o o r d i n a t e s [ 1 ] ;
comm_data = world_comm . clone ( ) ;
c a l c u l a t e _ r a n k s _ n e i g h b o r s (comm, i , j ,
create_comm_compute ( ) ;
&west_n_rank , &east_n_rank ,
&north_n_rank , &south_n_rank ) ;
mm−>d i s t r i b u t e ( ) [ comm_data ] ;
c = mm−>c o l l e c t ( ) [ comm_data ] ;
/ / i n i t i a l alignment
}
comm. isend < int >( west_n_rank , 0 , & a , 1 ) ;
}
comm. isend < int >( north_n_rank , 0 , & b , 1 ) ;
comm. recv < int >( east_n_rank , 0 , & a , 1 ) ;
p a r a l l e l unit peer
comm. recv < int >( south_n_rank , 0 , & b , 1 ) ;
{
private :
c a l c u l a t e _ r a n k s _ n e i g h b o r s (comm, 1 , 1 ,
CartesianCommunicator comm_compute ;
&west_1_rank , &east_1_rank ,
&north_1_rank , &south_1_rank ) ;
i n t main ( ) [ Communicator world_comm ]
{
/ / start systolic calculation
M a t r i x M u l t i p l i e r : : c e l l ∗mm
c += a ∗ b ;
= new M a t r i x M u l t i p l i e r : : c e l l ( ) ;
for ( k =0; k < n−1; k ++)
{
comm_data = world_comm . clone ( ) ;
comm. isend < int >( east_1_rank , 0 , & a , 1 ) ;
comm_compute = create_comm_compute ( ) ;
comm. isend < int >( south_1_rank , 0 , & b , 1 ) ;
comm. recv < int >( west_1_rank , 0 , & a , 1 ) ;
mm−>d i s t r i b u t e ( ) [ comm_data ] ;
comm. recv < int >( north_1_rank , 0 , & b , 1 ) ;
mm−>compute ( ) [ comm_compute ] ;
c += a ∗ b ;
mm−>c o l l e c t ( ) [ comm_data ] ;
}
}
}
}
}
(a) (b)
CartesianCommunicator
Main : peer : : create_comm_compute ( )
CartesianCommunicator {
Main : root : : create_comm_compute ( ) Group g r o u p _ a l l = comm. group ( ) ;
[ Communicator my_comm] Group group_peers
{ = g r o u p _ a l l . exclude ( ranksof ( root ) ) ;
Group g r o u p _ a l l = my_comm. group ( ) ; Communicator comm_peers
Group group_peers = comm. c r e a t e ( group_peers ) ;
= g r o u p _ a l l . exclude ( ranksof ( root ) ) ;
my_comm. c r e a t e ( group_peers ) ; i n t s i z e = comm_peers . s i z e ( ) ;
i n t dim_size = s q r t ( s i z e ) ;
/ ∗ t he r e t u r n e d communicator i s a i n t [ 2 ] dims = { dim_size , dim_size } ;
n u l l communicator , s i n c e r o o t bool [ 2 ] periods = { true , true } ;
i s not i n g r o u p _ p e e r s ∗ /
return
return n u l l ; new CartesianCommunicator
} ( comm_peers , 2 , dims ,
periods , f a l s e ) ;
}
(a) (b)
Fig. 9. create_comm_compute.
of-the-art object-oriented versions of the MPI interface, such as MPI.NET [21] and Boost.MPI [20], proposed by the same
research group at Indiana University.
Let A be a p-object. Let B be a p-object that has a reference to A. A has been instantiated by B or a reference to A has been
passed in a call to some parallel method of B. Thus, B may perform invocations to parallel methods of A. On each parallel
invocation, B must provide a communicator, which may be either instantiated by B using the PObC++ API or received from
another p-object. Communicators can be reused in distinct parallel invocations, possibly applied to distinct parallel methods
of distinct p-objects.
The creation of communicators is illustrated in Fig. 8, where two communicators are created: comm_data, for invocations
of parallel class methods distribute and collect; and comm_compute, for invocations of the parallel unit method compute.
The communicator comm_data groups all the processes (units of the Main class). For this reason, it is created by
cloning the global communicator (world_comm), which is automatically passed to the main function of each process. The
communicator comm_compute is a cartesian communicator, created in an invocation to the private parallel class method
create_comm_compute, whose code is presented in Fig. 9. It involves only the group of cell units. Using the MPI pattern,
a communicator is always created from other communicators in a collective operation involving all the members of the
original communicator. For this reason, create_comm_compute must also be executed by the root unit, returning null since
root is not included in the group of the created communicator.3
Notice that a communicator is not explicitly passed to create_comm_compute. In this case, the communicator of the
enclosing parallel method (world_comm) is implicitly passed to create_comm_compute, but renamed to my_comm in its
scope.
The implicit declaration and passing of communicators are only syntactic sugar for simplifying the use of communicators
in simple parallel programs. They are the reason for the use of brackets for delimiting declaration of communicators. In fact,
there are two situations where it is useful to explicitly declare a communicator identifier:
The rankof and ranksof operators. In Figs. 8 and 9, the operators rankof and ranksof are used to determine the rank of given
units in the communicator of a parallel class method. The rankof operator must be applied to the identifier of a singleton unit
of the p-object, returning an integer that is the rank of the unit in the communicator of the parallel class method, whereas
ranksof must be applied to the identifier of a parallel unit, returning an array of integers. They are useful and allowed only
in the code of parallel class methods of p-objects that have distinct units. In the first call to rankof or ranksof, communication
takes place to determine the ranks of the units of the p-object in the current communicator. For this reason, in the first call to a
parallel class method all the involved units must call rankof/ranksof collectively. In the subsequent call, such synchronization
is not necessary, since the units remember the calculated ranks.
3 It is important to emphasize that this is a requirement of MPI for making possible to create communicators involving a subgroup of the group of process
of an existing communicator. This requirement must be followed by any parallel programming language or scientific computing library implemented on
top of MPI.
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 79
void M a t r i x M u l t i p l i e r : : peer : : c a l c u l a t e _ r a n k s _ n e i g h b o r s
( CartesianCommunicator& comm, i n t s h i f t _ x , i n t s h i f t _ y
i n t ∗ west_rank , i n t ∗ east_rank , i n t ∗ north_rank , i n t ∗ south_rank )
{
i n t dim_size_z = comm. Dimensions [ 0 ] ;
i n t dim_size_y = comm. Dimensions [ 1 ] ;
i n t i = comm. c o o r d i n a t e s [ 0 ] ;
i n t j = comm. c o o r d i n a t e s [ 1 ] ;
As in MPI, communicators carry topological information about the units. In the default case, units are organized linearly,
with ranks defined by consecutive integers from 0 to size − 1, where size is the number of processes. Alternatively, units
may be organized according to a cartesian topology in N dimensions, having Pi units in each dimension, for i from 0 to N − 1.
The most general topology is the graph topology, where each unit is associated to a set of adjacent units. In general, the unit
topologies of p-objects follow the communication structure of the underlying algorithm their parallel methods implement.
Such information can be useful by the runtime system to make a best mapping of units onto the processing nodes of the
parallel computing platform, trying to load balance computations and minimizing communication overheads. In the example
of Fig. 8, the communicator comm_compute has a cartesian topology with two dimensions and wraparound links, whereas
comm_data has the default linear topology.
The communication operations supported by a communicator are those ones supported by communicator objects in
Boost.MPI and MPI.NET, both point-to-point and collective subsets. Communicators also support operations to identify
units and their relative location in the underlying communicator topology. These operations are dependent on the topology
described by the communicator:
• Linear topology: The operation size returns the number of processes in the communicator group; the operation rank
returns the process identification, which is an integer ranging from 0 to size − 1;
• Graph topology: The operations rank and size have the same meaning as for linear topologies. In addition, the operation
neighbors returns an array of integers containing the ranks of the adjacent units of the current unit. Similarly, if the
current unit wants to know the ranks of the adjacent units of another unit, using its rank, then it may use operation
neighborsOf. The operation num_egdes returns the number of edges of the graph of units. Finally, the operation edges
returns an adjacent matrix representing the graph.
• Cartesian topology: The operations rank, size, neighbors and num_edges are all supported by these kinds of
communicators, since a cartesian topology is a special case of graph topology. In addition, the operation dimensions
returns an array of integers containing the length of each dimension, whereas the operation coordinates returns
the coordinate of the unit in each dimension. The number of dimensions is the length of these arrays. The periodic
operation returns an array of boolean values that says if a dimension is periodic or not. Finally, there are the operations
getCartesianRank, which returns the rank of a unit in a given set of coordinates, and getCartesianCoordinates, which
returns the coordinates of a unit with a given rank. In the example of Fig. 8, the parallel method compute calls the
unit method calculate_ranks_neighbors twice to calculate the ranks of the four neighbors of the current cell in the two-
dimensional mesh at distance shift_x in the x direction (west-east) and shift_y in the y direction (north-south). The code
of calculate_ranks_neighbors is shown in Fig. 10. The i and j coordinates of the current cell are determined by calling
coordinates. After calculating the coordinates of the neighbor cells, their ranks are obtained by calling getCartesianRank;
In OOPP, objects cannot be transmitted through communicators, but only values of non-object data types, primitive and
structured ones. This is not intended to simplify the implementation effort, since Boost.MPI already gives support for object
transmission. In fact, we consider communication of objects a source of performance overheads which are difficult to predict,
due to marshaling/unmarshaling (serialization) requirements. Performance prediction is an important requirement of HPC
applications. Indeed, this restriction makes it possible for the optimization of communication primitives.
The above restriction is not too restrictive for programming, since object transmission may be easily implemented by
packing the state of the object in a buffer, sending it through a message to the destination, and loading the packed state
into the target object. However, we are considering introducing two linguistic abstractions to deal with use cases of object
transmission through message-passing, so called migration and (remote) cloning. Migration is a variant of cloning where the
reference to the object in the original address space is lost after the copying procedure.
80 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
• Unit methods, defined in unit scope, with no access to the p-object communicator.
• Parallel methods, declared either in the class scope (parallel class method) or in the unit scope (parallel unit method).
As depicted in Fig. 4, a parallel method invocation is composed of a set of local method invocations between pairs of units of
the caller and the callee p-objects that are located in the same processing nodes. Thus, there is no remote method invocation.
In most of the parallel programs, since parallel methods are the only synchronization and communication points of the
parallel program among units that reside in distinct processing nodes, it is expected that the calls to a parallel method
performed by units of a p-object must complete together, avoiding excessive synchronization overheads and deadlocks
caused by the execution of synchronous point-to-point operations and barrier synchronizations between units involved in a
parallel method invocation. In such cases, the numbers of calls to the same parallel method by each unit would be the same.4
However, such a restriction cannot be enforced statically, giving to the programmers the responsibility of the coherent use
of parallel methods, such as in MPI programming. Fortunately, we view the flexibility of letting synchronization between
parallel methods being explicitly controlled by the programmer as an interesting opportunity to investigate non-trivial
parallel synchronization techniques.
Inter-object communication by method invocations makes it unnecessary to introduce the concept of inter-
communicators, supported by MPI, in PObC++. According to the MPI standard, inter-communicators make possible message
exchanging between groups of processes in disjoint communicators (intra-communicators) possible.
3.4. Instantiation
A p-object is instantiated by another p-object by the collective instantiation of each one of their units in distinct processing
nodes, using the usual C++ new operator applied to the unit identification, which has the form ⟨class_name⟩::⟨unit_name⟩.
This is illustrated in the code of the p-class Main of Fig. 8, where the units MatrixMultiplier::manager and MatrixMultiplier::cell
are instantiated.
No communication occurs between the units during instantiation. As described before, communication only occurs in
parallel method invocations and the caller is responsible for creating an appropriate communicator and passing it to the
parallel method it wants to invoke.
The instantiation of a p-object is an indivisible operation. However, since there is no communication or synchronization
in the instantiation procedure, units may be instantiated at different points of execution. The programmer is responsible to
ensure that all units of a p-object are properly instantiated before invoking any of their parallel methods. It is a programming
error to forget instantiation of one or more units. Notice that there is no restriction to the number of units of a parallel unit
to be instantiated in distinct processing nodes. The identification of each unit is defined by the communicator passed to each
parallel method invocation and may change across distinct invocations.
3.5. Implementation
PObC++ is an open source project hosted at http://pobcpp.googlecode.com, released under the BSD license. It is composed
of a compiler and a standard library.
3.5.1. Compiler
In order to rapidly produce a fast and reliable compiler prototype, a source-to-source compiler written in C++ was
designed by modifying Elsa, an Elkhound-based C/C++ parser. Elkhound [36] is a parser generator that uses the GLR parsing
algorithm, an extension of the well-known LR algorithm that handles nondeterministic and ambiguous grammars. In
4 It is not correct to say that the violation to this programming recommendation always lead to programming errors. In certain situations, a unit of a
p-object may have no work to perform in a call to a parallel method. In such case, if the parallel method is not invoked for this unit, no error occurs. However,
this is a exceptional situation, probably due to a bad parallel algorithm design, which may lead to programming errors since the programmer must infer
statically which calls to the parallel method do not have participation of a given unit.
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 81
particular, the Elkhound’s algorithm performs disambiguation during the type checking process in order to generate a valid
AST (Abstract Syntax Tree). The phases of the PObC++ compiler are depicted in Fig. 11.
The modifications performed during the third phase transform each unit of a p-class into a valid C++ class, augmented
with meta-information about the units defined within the p-class. These adjustments generate only C++ valid code and do
not interfere with the rest of the program. In fact, only the code containing p-class declarations needs to be compiled by the
PObC++ compiler.
C++ code can be used without modification in PObC++ programs. Thus, virtually any C/C++ library can be integrated with
PObC++ programs. Since the programmer uses the PObC++ compiler to generate C++, any C++ compiler may be used to
generate the native code. Such features are essential to promote the straightforward integration of PObC++ programs with
scientific libraries and legacy code written in C and C++. This is the motivation for the case study that will be presented in
Section 4.2.
4. Case studies
The case studies presented in this section intend to give an overview of programming techniques, expressiveness, and
the potential performance of programs written in the OOPP style.
The first case study presents a parallel numerical integrator that demonstrates skeletal-based parallel programming in
PObC++. Also, it presents a performance evaluation where a PObC++ program is compared to its version written purely
in C++/Boost.MPI, aiming at providing evidence that the OOPP abstractions supported by PObC++, which makes the use
of high-level programming techniques such as skeletons possible, do not add significant overhead to the performance of
parallel programs, using C++/MPI programming as a baseline.
The second case study illustrates the integration of PObC++ with existing scientific computing libraries, an important
requisite for the acceptance of this language among programmers in scientific and engineering domains. For this, we have
created an OOPP interface for a subset of PETSc, a widely used library of subroutines for a solution of sparse algebraic systems
for Fortran, C, and C++. According to its developers, PETSc follows an object-based design, despite using a procedural interface
since it must be supported by Fortran and C. For this, it supports matrices, vectors and solvers as objects, providing an
interface for their instantiation and handling. Therefore, OOPP provides a complete object-oriented alternative for PETSc
interface and implementation, providing all the benefits of OOP to PETSc programmers.
The third case study illustrates abstraction and encapsulation of parallelism interaction in p-objects. An abstract p-class
named Sorter has two subclasses named BucketSort and OddEvenSort. They implement distinct parallel algorithms for
sorting an array of integers, which uses different patterns of process interactions. The user of a Sorter p-object does not need
to know how interaction between sorting units takes place. Using the IS kernel of NAS Parallel Benchmarks (NPB) [40] for
implementing the bucketsort algorithm, a performance evaluation compares the performance of a PObC++ program with its
version written in C/MPI, ignoring not only the overheads of OOPP abstractions, but also the overheads of object-orientation
and Boost.MPI indirections.
We have implemented a parallel numerical integrator using two p-classes: Farm and Integrator. The former one is an
abstract class, implementing the skeleton farm, which implements an abstraction for a well-known parallelism strategy,
including its pattern of inter-process communication. The class Integrator extends Farm aiming to implement a parallel
numerical integration using farm-based parallelism. We have reused the numerical integration algorithm implemented
by the NINTLIB library, based on the Romberg’s Method [37]. NINTLIB is not parallel. Therefore, it has been necessary to
parallelize it.
82 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
Skeletal programming. The term algorithmic skeleton was first coined by Murray Cole two decades ago to describe reusable
patterns of parallel computations whose implementation may be tuned to specific parallel execution platforms [38].
Skeletons have been widely investigated by the academic community, being considered a promising approach for high-
level parallel programming [6]. Skeletal programming is an important parallel programming technique of OOPP.
5 CENAPAD-UFC is composed by 48 Bull 500 blades, each one with two Intel Xeon X5650 processors and 24 GB of DDR3 1333 MHz RAM memory. Each
processor has 6 Westmere EP cores. The processors communicate through a Infiniband QDR (36 ports) interconnection. More information about CENAPAD
at http://www.cenapad.ufc.br and https://www.lncc.br/sinapad/.
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 83
class Integrator :
template <typename Job , typename Result > public Farm< I n t e g r a t o r J o b , double>
c l a s s Farm {
{ unit manager
public : {
void synchronize_jobs ( ) ; private :
void s y n c h r o n i z e _ r e s u l t s ( ) ; i n t i n f , sup ;
i n t dim_num , p a r t i t i o n _ s i z e ;
unit manager
{ public :
private : Manager ( i n t i n f , i n t sup ,
Job ∗ a l l _ j o b s ; i n t dim_num , i n t p s i z e ) :
Result∗ a l l _ r e s u l t s ; i n f ( i n f ) , sup ( sup ) ,
dim_num(dim_num ) ,
public : p a r t i t i o n _ s i z e ( psize ) { }
void add_jobs ( Job ∗ job ) ;
Result get_next_result ( ) ; public :
Result∗ g e t _ a l l _ r e s u l t s ( ) ; void generate_subproblems ( ) ;
v i r t u a l void ∗ pack_jobs ( Job ∗ jo bs ) ; double combine_results ( ) ;
v i r t u a l R e s u l t unpack_result ( void ∗ r e s u l t ) ; };
};
p a r a l l e l unit worker
p a r a l l e l unit worker {
{ private :
private : i n t number_of_partitions ;
Job ∗ l o c a l _ j o b s ; i n t next_unsolved_subproblem ;
Result∗ l o c a l _ r e s u l t s ; double ( ∗ f u n c t i o n ) ( double ∗ ) ;
public : public :
p a r a l l e l void perform_jobs ( ) ; Worker ( double ( ∗ f ) ( double ∗ ) ,
v i r t u a l R e s u l t work ( Job job ) ; i n t t o l , i n t nop ) :
v i r t u a l Job unpack_jobs ( void ∗ jo bs ) ; function ( f ) ,
v i r t u a l void ∗ p a c k _ r e s u l t ( R e s u l t ∗ r e s u l t ) ; number_of_partitions ( nop ) ,
}; next_unsolved_subproblem ( 0 ) ,
}; tolerance ( tol ) { }
};
};
(a) (b)
Fig. 12. The Farm class (a) and (b). The Integrator class (b).
c l a s s IntegratorMain
{
public : i n t main ( ) ;
(...)
Table 1
Average execution times (in seconds) and confidence interval (α = 5%) for comparing PObC++ to C++ for different workloads and number of
processing nodes in parallel multidimensional integration.
n=6 n=7 n=8
δmin δmin δmin
P PObC++ C++/MPI Diff PObC++ C++/MPI Diff PObC++ C++/MPI Diff
δmax δmax δmax
+1.5% +1.0% +1.4%
1 5.93±.001 5.83±.007 y 109.1±.018 107.8±.089 y 1971±.354 1942±2.09 y
+1.8% +1.2% +1.7%
+1.3% +1.0% +1.4%
2 3.00±.002 2.95±.003 y 55.0±.010 54.5±.031 y 997±.311 983±.848 y
+1.6% +1.1% +1.6%
+1.1% +0.9% +1.1%
4 1.50±.001 1.49±.001 y 27.7±.013 27.3±.019 y 500±.144 493±.641 y
+1.4% +1.1% +1.4%
+0.6% −1.2% +1.2%
8 0.76±.001 0.75±.001 y 13.8±.040 13.9±.113 n 252±.228 247±.337 y
+1.1% +1.1% +1.6%
−5.1% −1.9% −0.5%
16 0.38±.001 0.39±.005 y 6.9±.011 7.0±.033 y 125±.058 124±.156 y
−2.2% −0.7% −0.1%
−2.9% −3.2% −0.3%
32 0.20±.003 0.20±.003 n 3.6±.046 3.5±.054 n 64.9±.911 63.4±.764 n
+2.6% +2.4% +4.9%
Seq 5.43 104.1 1921
Table 2
Speedup of parallel multidimensional integration with PObC++ and C++.
n=6 n=7 n=8
P PObC++ C++/MPI PObC++ C++/MPI PObC++ C++/MPI
1 0.9 0.9 0.9 0.9 0.9 0.9
2 1.8 1.8 1.8 1.9 1.9 1.9
4 3.6 3.6 3.7 3.8 3.9 3.8
8 7.1 7.1 7.5 7.4 7.8 7.7
16 14.2 13.9 15.0 14.8 15.7 15.4
32 27.1 27.1 28.9 29.7 30.3 30.2
N = number of dimensions
The average execution times and confidence intervals presented in Table 1 have been calculated from a sample of 40
executions. For the purposes of the analysis, it is supposed that the distribution of the observed execution times is normal.
For improving reliability, outliers have been discarded using the Box Plot method for k = 4.0 (length of the upper and lower
fences). Without the outliers, each sample has more than 30 observations, which is recommended for ensuring statistical
confidence according to the literature [41]. The clusters have been dedicated to the experiment, resulting in relatively low
standard deviations and, as a consequence, tight confidence intervals, contributing to the reliability of the analysis. All the
measures and relevant statistical summaries for this experiment can be obtained at http://pobcpp.googlecode.com.
In the experiment, the function
f (x1 , x2 , x3 , . . . , xn ) = x1 2 + x2 2 + x3 2 + · · · + xn 2
is integrated in the interval [0, 1] in parallel. For that, the parameter dim_partition_size defines the number of partitions of
the interval in each dimension, of the same size, yielding num_jobs = dim_partition_sizen subproblems of integration. It is
also necessary to define the parameter number_of_partitions required by the procedure NINTLIB.romberg_nd, as a multiple
of num_jobs, in order to preserve the amount of computation performed by the sequential version. Empirically, we set
dim_partition_size = 2 and number_of_partitions = 8, by varying n between 6 and 8.
In Table 1, the lower (δmin ) and upper (δmax ) estimations of the overhead of PObC++ in relation to C++/MPI, for each pair
(P , n), is presented. They are calculated as following: Let [x0 , x1 ] [y0 , y1 ] be the confidence interval for the observations of a
pair (P , n) for PObC++ and C++/MPI, respectively. Then, δmin = (y1 − x0 )/x̄ and δmax = (y0 − x1 )/x̄, ¯ where x̄ stands for the
average execution time of C++/MPI, since overhead is a relative measure of the additional execution time when PObC++ is
used in alternative to C++/MPI. Thus, a negative value for an overhead estimation means that PObC++ outperformed C++/MPI
for the given workload (n) and number of processing nodes (P). If the lower and upper overhead estimations have different
signals, it means that it is not possible to state, for statistical confidence at 5%, that either PObC++ or C++/MPI outperformed
the other. On the other hand, if both estimations have the same signal, either PObC++ (negative) or C++/MPI (positive) is
considered better.
The overhead estimations in Table 1 show that C++/MPI always outperforms PObC++ for 8 or less processing nodes. For
16 and 32 nodes, either PObC++ outperforms C++/MPI or neither one is better than the other. Also, it is important to note
that only 2 out of the 18 positive overhead estimations are greater than 2.0% (+2.4% and 4.9%). These is strong evidence
that PObC++ is a good alternative in relation to C++/MPI, since the performance degradations are insignificant compared
to the gains in modularity and abstraction offered by object-orientation for the programming practice. The performance
equivalence of PObC++ and C+/MPI is not a surprise, since the PObC++ version is translated to a C++/MPI program that is
almost equivalent to a C++/MPI version built from scratch, except for call indirections from the communicator interface to
the MPI subroutines. The essential difference is the better object-oriented properties achieved by PObC++, with potential
E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90 85
class ParallelKSP
{
class ParallelVec
public :
{
PetscErrorCode
public :
Create ( ) ;
PetscErrorCode Create ( ) ;
PetscErrorCode
private :
SetOperators ( P a r a l l e l M a t : : pmat& Amat ,
Vec vec ;
P a r a l l e l M a t : : pmat& Pmat ,
MatStructure f l a g ) ;
p a r a l l e l unit pvec { } ;
PetscErrorCode
};
Solve ( P a r a l l e l V e c : : pvec& b ,
P a r a l l e l V e c : : pvec& x ) ;
PetscErrorCode
P a r a l l e l V e c : : pvec : : Create ( )
/ ∗ Other methods : t o l e r a n c e c o n t r o l ,
{
preconditioning , e t c ∗/
return
...
VecCreate (comm−>get_mpi_comm ( ) ,
&vec ) ;
private : KSP ksp ;
}
p a r a l l e l unit ksp { } ;
};
gains in modularity, abstraction, and usability. The same techniques used by an MPI programmer may be almost directly
applied using PObC++.
Table 2 is a measure of the scalability of the implementation of the parallel multidimensional integration used in the
experiments. Notice that the speedup is almost linear and increases as the workload (n) increases.
PETSc is a library for scientific computing designed as a set of data structures and subroutines to solve linear and non-
linear algebraic systems in parallel, using MPI. It is widely used by engineering and scientific programmers in the numerical
solution of partial differential equations (PDE), which describe phenomena of their interest. PETSc follows an object-based
design on top of non-OOP languages, such as Fortran and C. This section shows a real OOPP interface for PETSc, restricted to
three important modules, demonstrating the integration of PObC++ with scientific computing libraries.
This section presents an abstract p-class called Sorter, which is specialized in two subclasses of p-objects that implement
well-known sorting algorithms, bucketsort and odd–even sort. Such p-classes are called BucketSort and OddEvenSort,
86 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
c l a s s BucketSort : S o r t e r
class Sorter
{
{
public : void s o r t ( ) ;
public : void v i r t u a l s o r t ( ) = 0;
p a r a l l e l unit worker
p a r a l l e l unit worker
{
{
private :
public :
void l o c a l _ s o r t ( i n t ∗ items ,
void s e t ( i n t ∗ items , i n t s i z e ) ;
int size ) ;
protected :
void f i l l _ b u c k e t s ( i n t ∗ key_array_1 ,
i n t ∗ items ;
i n t ∗ key_array_2 ,
int size ;
i n t ∗ bucket_ptr ,
};
int∗ bucket_size )
};
};
};
respectively. They illustrate abstraction and encapsulation in OOPP, in relation to communication and synchronization
among units of a p-object, since the communication sequences among units of the two sorters use distinct parallelism
interaction patterns.
/ ∗ I n i t i a l i z e data s t r u c t u r e s ∗ /
...
/ ∗ Copy t h e elements o f ke y_arr ay t o k e y _ b u f f e r _ s e n d , p a r t i a l l y s o r t e d
a c c o r d i n g t o t h e i r b u c k e t s . The v a l u e b u c k e t _ p t r [ i ] i s t he s t a r t i n d e x
o f t h e i −th bucket i n k e y _ b u f f e r _ s e n d . The v a l u e b u c k e t _ s i z e [ i ] i s th e
number o f i t e m s i n t h e i −th bucket . ∗ /
f i l l _ b u c k e t s ( key_array , key_buffer_send , bucket_ptr , b u c k e t _ s i z e ) ;
/ ∗ Determine t h e g l o b a l s i z e o f each bucket ∗ /
comm. a l l r e d u c e ( bucket_size , Operation < int > . Add , b u c k e t _ s i z e _ t o t a l s ) ;
/ ∗ Determine how many l o c a l i t e m s w i l l be s e n t t o each p r o c e s s ∗ /
comm. a l l t o a l l ( send_count , recv_count ) ;
/ ∗ Send i t e m s t o each p r o c e s s ∗ /
comm. a l l t o a l l f l a t t e n e d ( key_buffer_send , send_counts , key_buffer_recv , outValues ) ;
/∗ S o r t the buckets ∗/
l o c a l _ s o r t ( key_buffer_recv , s i z e ) ;
}
the sort method of BucketSort using the second strategy. The source code is based on the IS (Integer Sorting) kernel of the
NAS Parallel Benchmarks [40], which implements bucketsort for evaluating the performance of clusters and MPP’s (Massive
Parallel Processors).
There are two important points to note about this case study:
• An object-oriented parallel programmer must take care with implicit assumptions of abstract p-classes concerning
distribution of input and output data structures, when deriving concrete p-classes from them, such as the assumption
of Sorter that requires items to be distributed across the workers;
• The two alternative solutions use collective communication methods of the communicator, but the user of the sorter does
not need to be aware of the communication operations performed by the worker units. This is completely transparent
from the perspective of the user of the sorter. Indeed, it is safe to change the parallelism strategy by choosing a distinct
subclass of the Sorter p-class that implements another parallelization strategy.
Table 3
Average execution times (in seconds) and confidence interval (α = 5%) for comparing PObC++ with C for different workloads and number of processing
nodes using the kernel IS (Integer Sort) of NPB (NAS Parallel Benchmarks).
Class B Class C Class D
δmin δmin δmin
P PObC++ C/MPI Diff PObC++ C/MPI Diff PObC++ C/MPI Diff
δmax δmax δmax
+11.0% y +10.4% y – y
1 6.61±0.004 5.92±0.025 26.83±0.012 24.21±0.091 – –
+12.0% +11.3% –
+9.1% y +8.7% y – y
2 3.33±0.002 3.04±0.010 13.49±0.006 12.35±0.046 – –
+9.9% +9.6% –
+8.4% y +8.6% y +18.7% y
4 1.70±0.004 1.56±0.006 6.91±0.002 6.34±0.024 133.0±0.029 111.7±0.334
+9.7% +9.4% +19.4%
+8.7% y +8.3% n +16.9% y
8 0.88±0.001 0.80±0.003 3.60±0.000 3.31±0.010 69.05±0.016 58.90±0.184
+9.8% +8.9% +17.6%
+8.5% y +7.7% y +12.9% y
16 0.47±0.001 0.43±0.001 1.90±0.002 1.75±0.006 37.70±0.008 33.29±0.108
+9.7% +8.5% +13.6%
+7.3% y +6.9% y +8.9% y
32 0.27±0.001 0.25±0.001 1.11±0.000 1.03±0.003 23.21±0.007 14.05±0.055
+8.0% +7.6% +9.5%
+5.1% y +5.3% y +6.6% y
64 0.19±0.000 0.18±0.001 0.75±0.002 0.70±0.002 15.01±0.004 14.05±0.040
+6.0% +6.4% +7.2%
Due to the increasing interest in HPC techniques in the software industry, mainly parallel processing, as well as the
increasing importance of scientific and engineering applications in modern human society, the increasing complexity of
applications in HPC domains has attracted the attention of a significant number of researchers on programming models,
languages and techniques. They are faced with the challenging problem of reconciling well-known techniques to deal with
software complexity and large scale in corporative applications with the high performance requirements demanded by
applications in science and engineering.
Object-oriented programming is considered one of the main responses of programming language designers for dealing
with high complexity and scale of software. Since the 1990s, such programming style has become widespread among
programmers. Despite their success among programmers in business and corporative application domains, object-oriented
languages do not have the same acceptance among programmers in HPC domains, mainly among scientists and engineers.
This is usually explained by the performance overhead caused by some features present in these languages for supporting
higher levels of abstraction, modularity and safety, and by the additional complexity introduced by parallel programming
support.
The results presented in this paper, including design, implementation and performance evaluation of the first PobC++
prototype, are very promising. The examples are evidence that the proposed approach may coherently reconcile the
common programming styles adopted by parallel programmers and by object-oriented programmers, making it possible
for a programmer well educated in both parallel programming using MPI and in OOP, to take rapid advantage of OOPP
features. Moreover, the performance results evidence tolerable performance overheads, despite the gains in modularity and
abstraction when compared to direct MPI programming.
Acknowledgments
This work has been sponsored by CNPq, grant numbers 475826/2006-0 and 480307/2009-1.
References
[1] M. Baker, R. Buyya, D. Hyde, Cluster computing: a high performance contender, IEEE Computer 42 (7) (1999) 79–83.
[2] I. Foster, The Grid: Blueprint for a New Computing Infrastructure, first ed., Morgan Kaufmann, 1998.
[3] D.E. Post, L.G. Votta, Computational science demands a new paradigm, Physics Today 58 (1) (2005) 35–41.
[4] D.E. Bernholdt, J. Nieplocha, P. Sadayappan, Raising level of programming abstraction in scalable programming models, in: IEEE International
Conference on High Performance Computer Architecture (HPCA), Workshop on Productivity and Performance in High-End Computing (P-PHEC),
Madrid, Spain, IEEE Computer Society, 2004, pp. 76–84.
[5] J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, A. White, Sourcebook of Parallel Computing, Morgan Kauffman Publishers, 2003
(Chapters 20–21).
[6] H. Kuchen, M.e. Cole, Algorithm skeletons, Parallel Computing 32 (2006) 447–626.
[7] M. Cole, Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming, Parallel Computing 30 (3) (2004) 389–406.
[8] J. Dongarra, S.W. Otto, M. Snir, D. Walker, A message passing standard for MPP and workstation, Communications of ACM 39 (7) (1996) 84–90.
[9] E. Dijkstra, The humble programmer, Communications of the ACM 15 (10) (1972) 859–866.
[10] O.J. Dahl, SIMULA 67 Common Base Language, Norwegian Computing Center, 1968.
[11] O.J. Dahl, The birth of object orientation: the simula languages, in: Software Pioneers: Contributions to Software Engineering, Programming, and
Operating Systems Series, Springer, 2002, pp. 79–90.
[12] A. Goldberg, D. Robson, Smalltalk-80: the Language and its Implementation, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1983.
[13] H. Milli, A. Elkharraz, H. Mcheick, Understanding separation of concerns, in: Workshop on Early Aspects — Aspect Oriented Software Development,
AOSD’04, 2004, pp. 411–428.
[14] A. Taivalsaari, On the notion of inheritance, ACM Computing Surveys 28 (1996) 438–479. URL: http://doi.acm.org/10.1145/243439.243441.
[15] B. Meyer, Object-Oriented Software Construction, Prentice Hall, Upper Saddle River, NJ, USA, 1988.
[16] M. Baker, B. Carpenter, G. Fox, S.H. Ko, X. Li, mpijava: a java interface to mpi, in: Procedings of the First UK Workshop on Java for High Performance
Network Computing, 1998.
[17] M. Baker, B. Carpenter, Mpj: a proposed java message passing api and environment for high performance computing, in: IPDPS’00: Proceedings of the
15 IPDPS 2000 Workshops on Parallel and Distributed Processing, Springer-Verlag, London, UK, 2000, pp. 552–559.
[18] S. Mintchev, Writing programs in javampi, Tech. Rep. MAN-CSPE-02, School of Computer Science, University of Westminster, London, UK, Oct. 1997.
[19] B.-Y. Zhang, G.-W. Yang, W.-M. Zheng, Jcluster: an efficient java parallel environment on a large-scale heterogeneous cluster, Concurrency and
Computation: Practice and Experience 18 (12) (2005) 1541–1557. http://dx.doi.org/10.1002/cpe.986.
[20] G. Douglas, T. Matthias, Boost.mpi website, May 2010. URL: http://www.boost.org/doc/html/mpi.html.
[21] D. Gregor, A. Lumsdaine, Design and implementation of a high-performance mpi for c# and the common language infrastructure, in: PPoPP’08:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, NY, USA, 2008, pp. 133–142.
http://doi.acm.org/10.1145/1345206.1345228.
[22] L.V. Kale, S. Krishnan, Charm++: a portable concurrent object oriented system based on C++, Tech. rep., Champaign, IL, USA, 1993.
[23] M. Philippsen, M. Zenger, JavaParty — transparent remote objects in java, Concurrency and Computation: Practice and Experience 9 (11) (1997)
1225–1242.
[24] T. Nguyen, P. Kuonen, ParoC++: a requirement-driven parallel object-oriented programming language, in: International Workshop on High-Level
Programming Models and Supportive Environments, IEEE Computer Society, Los Alamitos, CA, USA, 2003, p. 25.
http://doi.ieeecomputersociety.org/10.1109/HIPS.2003.1196492.
[25] T. Nguyen, P. Kuonen, Programming the grid with pop-C++, Future Generation Computer Systems 23 (1) (2007) 23–30.
http://dx.doi.org/10.1016/j.future.2006.04.012.
[26] Y. Aridor, M. Factor, A. Teperman, T. Eilam, A. Schuster, A high performance cluster jvm presenting pure single system image, in: JAVA’00: Proceedings
of the ACM 2000 conference on Java Grande, ACM, New York, NY, USA, 2000, pp. 168–177. http://doi.acm.org/10.1145/337449.337543.
90 E.G. Pinho, F.H. de Carvalho Junior / Science of Computer Programming 80 (2014) 65–90
[27] V. Sarkar, X10: an object oriented aproach to non-uniform cluster computing, in: OOPSLA’05: Companion to the 20th Annual ACM SIGPLAN Conference
on Object-oriented Programming, Systems, Languages, and Applications, ACM, New York, NY, USA, 2005, p. 393.
http://doi.acm.org/10.1145/1094855.1125356.
[28] B.L. Chamberlain, D. Callahan, H.P. Zima, Parallel programmability and the chapel language, International Journal of High Performance Computing
Applications 21 (3) (2007) 291–312. http://dx.doi.org/10.1177/1094342007078442.
[29] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessn, S. Ryu, G. Steele Jr., S. Tobin Hochstad, The Fortress Language Specification Version 1.0, Mar.
2008.
[30] K.A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P.N. Hilfinger, S.L. Graham, D. Gay, P. Colella, A. Aiken, Titanium: a high-
performance Java dialect, in: Java for High-performance Network Computing, Concurrency: Practice and Experience 10 (11–13) (1998) 825–836
(special issue).
[31] E. Lusk, K. Yelick, Languages for high-productivity computing — the DARPA HPCS language support, Parallel Processing Letters 1 (2007) 89–102.
[32] F.H. Carvalho Jr., R. Lins, R.C. Correa, G.A. Araújo, Towards an architecture for component-oriented parallel programming, Concurrency and
Computation: Practice and Experience 19 (5) (2007) 697–719.
[33] A. Grama, A. Gupta, J. Karypis, V. Kumar, Introduction to Parallel Computing, Addison-Wesley, 1976.
[34] S.L. Johnsson, T. Harris, K.K. Mathur, Matrix multiplication on the connection machine, in: Proceedings of the 1989 ACM/IEEE Conference on
Supercomputing, Supercomputing’89, ACM, New York, NY, USA, 1989, pp. 326–332. URL: http://doi.acm.org/10.1145/76263.76298.
[35] F. Bertran, R. Bramley, A. Sussman, D.E. Bernholdt, J.A. Kohl, J.W. Larson, K.B. Damevski, Data redistribution and remote method invocation in parallel
component architectures, in: 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS, IEEE, 2005.
[36] S.G. McPeak, Elkhound: a fast, practical glr parser generator, Tech. rep., Berkeley, CA, USA, 2003.
[37] J. Burkardt, NINTLIB — Multi-dimensional quadrature, web page. http://people.sc.fsu.edu/~burkardt/f_src/nintlib/nintlib.html.
[38] M. Cole, Algorithm Skeletons: Structured Management of Paralell Computation, Pitman, 1989.
[39] OpenMP Architecture Review Board, OpenMP: Simple, Portable, Scalable SMP Programming, 1997. URL: www.openmp.org.
[40] D.H. Bailey, T. Harris, W. Shapir, R. van der Wijngaart, A. Woo, M. Yarrow, The NAS Parallel Benchmarks 2.0, Tech. Rep. NAS-95-020, NASA Ames
Research Center, Dec. 1995, http://www.nas.nasa.org/NAS/NPB.
[41] R. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley-
Interscience, New York, NY, 1991, ISBN: 0471503361.
[42] K.E. Batcher, Sorting networks and their applications, in: Proceedings of AFIPS Spring Joint Computing Conference, vol. 32, 1980, pp. 307–314.