A Solution To The Problem of Parallel Programming: November 2018
A Solution To The Problem of Parallel Programming: November 2018
A Solution To The Problem of Parallel Programming: November 2018
net/publication/329196369
CITATIONS READS
0 1,887
1 author:
Edward Givelberg
The Divine Research Academy
12 PUBLICATIONS 156 CITATIONS
SEE PROFILE
All content following this page was uploaded by Edward Givelberg on 26 November 2018.
T
he problem of parallel programming is the 1 Introduction
most important open problem of computer
engineering. We show that object-oriented Parallel programming is considered to be very diffi-
languages, such as C++, can be interpreted as cult. Over the past several decades countless parallel
parallel programming languages, and standard se- programming languages, specialized systems, libraries
quential programs can be parallelized automati- and tools have been created. Some of these, such as
cally. Parallel C++ code is typically more than ten Google’s MapReduce, provided adequate answers for
times shorter than the equivalent C++ code with specific problem domains, however no general solution
MPI. The large reduction in the number of lines of has emerged. In this paper we show that, surprisingly, a
code in parallel C++ is primarily due to the fact simple and general solution to this problem exists based
that communications instructions, including pack- on parallel interpretation of established and widely
ing and unpacking of messages, are automatically used object-oriented languages, such as C++, Java
generated in the implementation of object opera- and Python. We propose to incrorporate the parallel
tions. We believe that implementation and stan- interpretation into the standards of object-oriented
dardization of parallel object-oriented languages programming languages. Programmers competent in
will drastically reduce the cost of parallel program- these languages will be able to write parallel code with
ming. This work provides the foundation for build- ease, and to translate it to an executable object simply
ing a new computer architecture, the multiproces- by specifying an appropriate compiler flag.
sor computer, including an object-oriented oper- Practically all computing devices today have multiple
ating system and more energy-efficient, and eas- processors. By 2004 the speed of individual processors
ily programmable, parallel hardware architecture. reached a peak and today faster computation is possi-
The key software component of this architecture is ble only by deploying multiple processors in parallel
a compiler for object-oriented languages. We de- on the given task. This has proved to be very difficult,
scribe a novel compiler architecture with a dedi- even when the task is in principle trivially parallelizable.
cated back end for the interconnect fabric, making Parallel programming is slow, tedious and requires ad-
the network a part of a multiprocessor computer, ditional expertise; debugging is extremely difficult. In
rather than a collection of pipes between proces- short, the cost of parallel programming is very high. It
sor nodes. Such a compiler exposes the network is now the main obstacle to the solution of computa-
hardware features to the application, analyzes its tional problems. We are still seeing improvements in
network utilization, optimizes the application as a hardware, and although the speed of individual pro-
whole, and generates the code for the interconnect cessors remains bounded, networking bandwidth is
fabric and for the processors. Since the informa- expected to increase [2], and the cost of hardware
tion technology sector’s electric power consump- components continues to decline. It is possible to build
tion is very high, and rising rapidly, implementa- computing systems with very large number of cheap
tion and widespread adoption of multiprocessor components, but the hardware cannot be adequately
computer architecture will significantly reduce the utilized without parallel programming. Almost all of
world’s energy consumption. software is sequential. We do not know how to build
easily programmable parallel computers. This paper
A solution to the problem of
parallel programming
lays out a plan for doing this. number of possible time-ordered sequences of events.
After 2004 CPU manufacturers started producing The concept of a computational process is not a suitable
multi-core CPUs in an attempt to boost performance abstraction for parallel programming.
with parallel computing, but their primary focus moved The solution lies in object-level parallelism. We per-
to improving the energy efficiency of computation. The ceive the world primarily as a world of objects, a world
energy cost can be significantly reduced when multiple where a multitude of objects interact simultaneously.
cores and shared memory are placed on the same inte- From the programmer’s point of view interactions be-
grated circuit die. The processing cores and the RAM tween objects are meaningful and memorable, unlike
are typically interconnected using a bus, which be- interactions of processes exchanging messages. Object-
comes a bottleneck if the number of cores is increased. oriented languages have been around for decades and
In practice, this architecture is limited to about a dozen today they are the most widely used programming lan-
cores. This problem is resolved by building a network guages in practice, yet, remarkably, all existing object-
on the chip (NoC), which makes CPUs with thousands oriented programming languages are sequential. The
of cores possible. Such systems have been shown to be inherent natural parallelism of the object-oriented pro-
very energy efficient, but programming them is diffi- gramming paradigm has not been understood. We
cult, and they are still in the early stages of research do not know of any proposal to interpret an existing
and development. Presently, the information technol- object-oriented programming language as a parallel
ogy sector consumes approximately 7% of the global programming language.
electricity [7], and this number is growing rapidly. A In section 2.1 we define an abstract framework for
solution to the problem of parallel programming is parallel object-oriented computation. Throughout the
likely to lead to substantial improvement in the energy rest of this paper we use C++, but our results apply
efficiency of computing systems, and to significantly to a variety of object-oriented languages. In section
affect the world’s energy consumption. 3 we show that C++ can be interpreted as a parallel
It is probably impossible to survey all of the ideas programming language within this framework without
that have been proposed to tackle the problem of paral- any change to the language syntax. Parallel interpreta-
lel programming, but the two most important models tion of C++ differs from the standard sequential C++,
are shared memory and message passing. The shared but in section 3.3 we describe a new technique for au-
memory model is based on the naive idea that the com- tomatic parallelization of sequential code. Standard
putation should be constructed using common data sequential C++ code can be ported to run on paral-
structures, which all computing processes use simulta- lel hardware either automatically or with a relatively
neously. Today, most of the parallel computing software small programming effort.
uses threads, which are sequential processes that share In section 4 we show that parallel C++ is a power-
memory. The problems with threads are eloquently ful and intuitive language where object-oriented fea-
described in [15]. Threads is a terrible programming tures are more naturally expressed than in standard
model. Despite this, and the fact that shared memory is sequential C++. In our experience programs written
not scalable, programming frameworks with artificially in parallel C++ are at least ten times shorter than
generated global address space have been proposed. the equivalent programs written in C++ with MPI.
The message passing programming paradigm, on the The large reduction in the number of lines of code in
other hand, acknoweledges that processes generally parallel C++ is primarily due to the fact that communi-
reside on distant CPUs and need to exchange messages. cations instructions, including packing and unpacking
Most of the scientific computing software uses the Mes- of messages, are automatically generated in the imple-
sage Passing Interface (MPI), which over the years has mentation of object operations.
become a bloated collection that includes both low- For decades the central processing unit (CPU) was
level messaging routines of various kinds and library the most important component of the computer. Con-
functions encoding complex algorithms on distributed tinuous improvements in application performance
data structures. Large-scale computations attempting came mainly as a result of engineering increasingly
to maximize the performance of a cluster of multicore faster CPUs. As a result, the complexity of applica-
CPUs need to combine shared memory and message tions is typically measured by estimating the number
passing, incurring significant additional programming of multiplications they perform. In a multiprocessor
cost. system the CPU is no longer central, and the cost of
Shared memory and message passing share a com- moving data is not negligible with respect to the cost
mon fundamental flaw: both approaches use processes, of multiplication, yet the prevailing view continues to
and co-ordinating multiple concurrent processes in a be that all of the important work done by an applica-
computation is a nearly impossible programming task. tion is carried out inside the CPUs, and the network is
We draw an analogy to the problem with the goto merely a collection of pipes between them. This state
statement, which was discredited in the 1960s and of technology is in a stark contrast to observations in
1970s. Unrestrained use of goto leads to an intractable neuroscience which suggest that the network is much
number of possible execution sequences. Similarly, a more important than the individual processors. Yet, the
collection of parallel processes generates an intractable mechanism of computation in the brain is not under-
Page 2 of 12
A solution to the problem of
parallel programming
Page 3 of 12
A solution to the problem of
parallel programming
model has been limited, and it does not provide a so- 3.2 Causal asynchronous execution
lution to the problem of parallel programming. Again,
the reason is that processes exchanging messages is Sequential interpretation of an object-oriented lan-
not a suitable model for parallel programming (see guage precludes parallel computation with remote
Introduction). objects: whenever an object executes a method on
another (remote) object, it is obliged to wait for the
The difficulty in identifying the relevant abstract
completion of this operation before executing the next
concepts and understanding their significance can be
statement. While it is possible for several objects to
appreciated by the fact that 45 years of research have
simultaneously execute methods on a given object, this
failed to clarify the relationship between the actor
will never happen if the application is started as a single
model and object-oriented computing. The object-
object.
oriented model is a higher level of abstraction. The net-
Imagining the object as an intelligent, living, breath-
work, the computational processes and the messages
ing thing, it could proceed with its computation imme-
are not included in the model. They are “implementa-
diately after initiating remote method execution, and
tion details”. Yet, an object can do everything that an
stop to wait for its completion only when its results
actor can, so actors are useful only because they can
are needed. We call this causal asynchronous execu-
implement objects. In section 3 we show that in object-
tion. Causal asynchronous execution enables parallel
oriented languages remote objects can be constructed
computation and provides a natural way to co-ordinate
naturally, making the actor-model libraries obsolete.
objects, as shown in Figure 2.
The object-oriented model supersedes the actor model.
bool completed = remote_object->ExecuteMethod();
// do something while the method is being executed
3 Language interpretation SomeComputation();
// wait for (remote) method completion
We show that C++ code, without any changes to the if (completed)
{
language syntax, can be executed in parallel using the // method execution has completed
abstract framework described in section 2.1. AnotherComputation();
}
3.1 Remote objects Figure 2: Causal asynchronous execution. Checking the return value of a
remote method execution suspends the execution until the remote
method has completed. The execution of SomeComputation overlaps
with communications in the previous statement.
Host * host = new Host("machine1");
Object * object = new(host) Object(parameters);
result = object->ExecuteMethod(some, parameters); In this example the purpose of the if statement is to
suspend the execution of AnotherComputation until
Figure 1: Construction of an object on a remote hosts. First, a virtual host the value of the variable completed is set by the remote
host is constructed on a physical device machine1. The C++ method.
“placement new” operator constructs an object on host and returns
a remote pointer object, which is used to execute a method on the Despite its simplicity, parallel C++, i.e. C++ with
remote object. causal asynchronous interpretation, has great expres-
sive power and is sufficiently rich to implement the
The code in Figure 1 is compliant with the standard most complex parallel computations. The program-
C++ syntax. It creates a virtual host, constructs an mer constructs a parallel computation by co-ordinating
object on it and executes a remote method on that ob- high-level actions on objects, while the underlying net-
ject. We interpret all pointers as generalized pointers. work communications are generated by the compiler.
The virtual host object is provided by the operating Computation and communication overlap naturally, as
system, and is associated with a physical device. We in the example in Figure 2, and large, complex objects
use the “placement new” operator of C++ to construct can be sent over the network as parameters of remote
the object on a given virtual host. Using a general- object methods. This is ideally suited to utilizing high
ized pointer a method is executed on a remote object. network bandwidth and avoiding the latency penalty
By-value parameters are serialized and sent over the incurred by small messages. In section 3.4. we intro-
network to the remote host. Once the remote execu- duce additional mechanisms for fine-grained control
tion completes, the result is sent back. The treatment of parallelism.
of by-reference parameters is more complicated: the
simplest solution is to serialize the parameter, send it 3.3 Automatic code parallelization
to the remote host and, upon completion of the method
execution, to serialize it and to send it back. When The construction of virtual hosts described in section
the parameter is a complex object, and the changes 3.1 allows the programmer to control the placement
made by method execution are relatively small, there of remote objects, but this can be done automatically
may be a more efficient method to update the original by the operating system. Virtual hosts can be created
parameter object. implicitly by the compiler, and the operating system
Page 4 of 12
A solution to the problem of
parallel programming
Page 5 of 12
A solution to the problem of
parallel programming
Figure 5: Example: MapReduce. The workers array is assigned in parallel, void Graph::BuildTree(VertexId root_id)
with each worker being constructed on its virtual host. The compute
methods are also executed in parallel. We rely on the compiler to {
enforce causality in the execution of the reduction loop. It starts int root_owner = VertexOwner(root_id);
executing only after result[0] becomes available, and it executes if (this->id() == root_owner)
sequentially, according to the definition of causal asynchronous frontier.push_back(v[root_id]);
execution. EdgeList * E = new EdgeList[N];
bool finished = false;
while (!finished)
The master process allocates workers on remote {
hosts, initiates a method execution on each worker SortFrontierEdges(E);
and sums up the result. If the data[i] object is not {
// remote, asynchronous, in parallel
located on host[i], it will be copied there over the net- for (int i = 0; i < N; i ++)
work. This code is shorter and easier to write than the graph[i]->SetParents(E[i]);
code that uses Google’s library. Moreover, as we show }
in section 5, the parallel C++ compiler may be able // finish BFS when all frontiers are empty
finished = true;
to generate more efficient code by optimizing network for (int i = 0; i < N; i ++)
operations. finished &= graph[i]->isEmptyFrontier();
}
}
4.3 Breadth-First Search on a large
graph Figure 6: Building the BFS tree. The frontier is initialized with the root vertex
by its owner. Iterations continue until all frontiers are empty. In
each iteration the local frontier edges are sorted into N EdgeList
Distributed BFS on a large graph is a standard bench- lists, one for each graph object. Communication of all the lists begins
mark problem [1]. We implemented a straightforward thereafter. After the execution of all SetParents methods has
finished all graph objects are asked if their frontier is empty.
algorithm in C++/MPI using over 2000 lines of code,
and in parallel C++ with less than 200 lines of code.
The graph data is divided into N objects, each con-
taining an array of vertices with a list of edges for each
vertex. We create N virtual hosts, one for each avail-
able processor, and allocate a graph object on each
host. The main object initiates the BFS by invoking the
BuildTree method on each graph object (see Figure
6). The computation proceeds with several (typically
Page 6 of 12
A solution to the problem of
parallel programming
Page 7 of 12
A solution to the problem of
parallel programming
class SlabFFT1
{
public:
SlabFFT1(Array * array, int N20, int N21);
void ComputeTransform();
class Array private:
{ int N20, N21; // slab indices
public: Page * page_line, * next_page_line;
Array(Domain * ArrayDomain, Domain * PageDomain); void ReadPageLine(
~Array(); ArrayPage * line, int i2, int i3
void allocate(int number_of_devices, Device * d); );
void FFT1(int number_of_cpus, Host ** cpus); void WritePageLine(
private: ArrayPage * line, int i2, int i3
Domain * ArrayDomain; );
Domain * PageDomain; };
ArrayPage * *** page; // 128^3 pointers
}; void SlabFFT1::ComputeTransform()
{
void ReadPageLine(page_line, N20, 0);
Array::allocate(int number_of_devices, Device * d) for (int i2 = N20; i2 < N21; i2 ++)
{ for (int i3 = 0; i3 < N3; i3 ++)
for (int j1 = 0; j1 < N1; j1 ++) {
for (int j2 = 0; j2 < N2; j2 ++) {
for (int j3 = 0; j3 < N3; j3 ++) int L2 = i2;
{ int L3 = i3 + 1;
int k = (j1 + j2 + j3) % number_of_devices; if (L3 == N3)
page[j1][j2][j3] = { L3 = 0; L2 ++; }
new(d[k]) ArrayPage(n1, n2, n3); if (L2 != N21)
} ReadPageLine(next_page_line, L2, L3);
} FFTW1(page_line);
} // next_page_line has been read
WritePageLine(page_line, i2, i3);
Figure 9: The Array class. Domain is a helper class describing 3D page_line = next_page_line;
subdomains of an array. ArrayPage is a small 3D array, which
implements local array operations, such as transpose12 and
}
transpose13 methods. These operations are needed in the Fourier }
transform computation. Global array operations are implemented
using the local methods of ArrayPage. Array pages are allocated in
circulant order. The allocate method constructs N1 × N2 × N3 Figure 11: Fourier transform of a slab.
array pages of size n1 × n2 × n3 on a list of virtual devices. The page_line and next_page_line are 2 local RAM buffers, 4 GB
dimensions are obtained from ArrayDomain and PageDomain, and each. The iterations are sequential, and next_page_line is read
in our case are all equal 128. while page_line is being transformed using the FFTW1 function,
which computes 1282 1D FFTs using the FFTW library.
Page 8 of 12
A solution to the problem of
parallel programming
Page 9 of 12
A solution to the problem of
parallel programming
of the computation, which is analyzed and mapped We used a small number of basic MPI commands to
by the compiler onto the interconnect fabric. The net- implement a transport library for agents’ communica-
work is not merely a collection of passive data pipes tions, and to launch agents on remote hosts as MPI
between processing nodes, but is a key component, processes. All of the MPI functionality used in the pro-
which together with the processors, makes up a com- totype is encapsulated in the transport library and can
puter. Based on this architecture it is now possible to be easily replaced.
design an operating system, develop new hardware
and build a multiprocessor computer (see section 6). 5.2.2 The PCPP compiler
PCPP is source-to-source translation tool which works
5.2 The prototype with a subset of the C++ grammar. It is built using
We built a prototype compiler, called PCPP, and a run- the Clang library tools. (Clang is the front end of the
time system (see Figure 14) for parallel C++. The com- LLVM compiler infrastructure software [14].)
piler translates parallel C++ into C++ code, which PCPP transforms the main of the input program into
is compiled and linked against the runtime library to a stand-alone class, the application’s main class. It
obtain an executable. generates a new main program which initializes the
runtime system, constructs a virtual host and constructs
the application’s main object on it. Next, the new
main reverses these actions, destroying the applica-
tion’s main object on the virtual host, destroying the
virtual host and shutting down the runtime system.
PCPP translates all pointers to remote pointer ob-
jects. For every class of the application PCPP generates
IR instructions for its object operations (constructors,
destructors and methods). Additionally, PCPP replaces
calls to object operations with code that serializes the
parameters and sends them with the corresponding in-
struction to the destination agent. For example, when a
constructor is invoked, one of the serialized parameters
is a remote pointer containing the address of the result
variable, which is a remote pointer variable that should
be assigned with the result of the constructor. The
PCPP-generated IR instruction is a serializable class,
Figure 14: Prototype compiler PCPP. PCPP translates parallel C++ into C++
code, which is compiled and linked against the runtime library to
derived from the base instruction class defined in the
obtain an MPI executable. runtime library. When this instruction is received by
the destination agent, it is unserialized and its execute
method is invoked. This method constructs a local ob-
ject using the unserialized parameters and generates
5.2.1 The runtime library an IR instruction to copy the object pointer to the result
variable on the source agent.
The runtime library implements virtual hosts as agents For causality enforcement we implemented a simple
that execute IR instructions. All messages between guard object, based on the condition_variable of
agents are serialized IR instructions, and for that pur- the C++11 standard library. PCPP generates a guard
pose the runtime library contains a simple serialization object for every output variable of a remote operation.
layer. An agent is implemented as an MPI process with A wait method on the guard object suspends the exe-
multiple threads: a dispatcher thread and a pool of cuting thread until a release method is called on the
worker threads. The dispatcher thread receives an in- same guard object by another thread. A remote pointer
coming message, unserializes it into an IR instruction to this guard object is sent to the destination agent.
and assigns it to a worker thread for execution. Each When the destination agent completes the operation it
worker thread maintains a job queue of IR instructions, sends an IR instruction to the source agent to release
however the pool of worker threads is not limited, and the guard. The wait call is inserted in the application
can grow dynamically. Every worker thread is either code just before the value of the output variable is used.
processing its job queue, is suspended and waiting to We used PCPP to design the runtime system archi-
be resumed, or is idle and available to work. An exe- tecture and to run experiments with parallel C++.
cution of an IR instruction typically involves execution Substantial work needs to be done to produce a fully
of the application’s code and may result in new IR in- functional compiler. A full-fledged formal IR language
structions being sent over the network. We used one has to be developed and a multiple of compiler back
dedicated worker thread in every agent to serialize and ends implemented for various hardware architectures.
send IR instructions to their destination agents. Because the programming model makes no distinc-
Page 10 of 12
A solution to the problem of
parallel programming
tion between local and remote pointers, compiler op- municating asynchronously, and designed with hard-
timization of the code is needed to make local com- ware support for running virtual hosts. The NoC has
putations efficient. It is likely that this problem can very high bandwidth, but congestion control is still a
also be addressed with an appropriate hardware de- problem. It has been shown that application awareness
sign. Causality enforcement requires implementation significantly improves congestion control in a NoC be-
of control flow graph analysis and loop optimization, cause better throttling decisions can be implemented
such as those implemented in parallelizing compilers. [18]. In present systems such application awareness
A conservative implementation would inhibit parallel is very difficult to implement, but the object-oriented
computation and would alert the user whenever de- framework provides the concept of an application, and
pendence analysis fails, allowing the user to change the operating system can be tasked with tracking ap-
the code accordingly. plication’s objects and throttling the application when
appropriate.
Building a multiprocessor computer is an enormous
6 Conclusion: task. It is not possible to discuss all of the important
aspects of this project in a few pages, but for the first
the road to a multiprocessor time in this paper we have described the key ideas that
computer make it possible.
Page 11 of 12
A solution to the problem of
parallel programming
Page 12 of 12