A Solution To The Problem of Parallel Programming: November 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329196369

A solution to the problem of parallel programming

Preprint · November 2018

CITATIONS READS

0 1,887

1 author:

Edward Givelberg
The Divine Research Academy
12 PUBLICATIONS   156 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Edward Givelberg on 26 November 2018.

The user has requested enhancement of the downloaded file.


A solution to the problem of
parallel programming
Edward Givelberg
[email protected]

November 21, 2018

T
he problem of parallel programming is the 1 Introduction
most important open problem of computer
engineering. We show that object-oriented Parallel programming is considered to be very diffi-
languages, such as C++, can be interpreted as cult. Over the past several decades countless parallel
parallel programming languages, and standard se- programming languages, specialized systems, libraries
quential programs can be parallelized automati- and tools have been created. Some of these, such as
cally. Parallel C++ code is typically more than ten Google’s MapReduce, provided adequate answers for
times shorter than the equivalent C++ code with specific problem domains, however no general solution
MPI. The large reduction in the number of lines of has emerged. In this paper we show that, surprisingly, a
code in parallel C++ is primarily due to the fact simple and general solution to this problem exists based
that communications instructions, including pack- on parallel interpretation of established and widely
ing and unpacking of messages, are automatically used object-oriented languages, such as C++, Java
generated in the implementation of object opera- and Python. We propose to incrorporate the parallel
tions. We believe that implementation and stan- interpretation into the standards of object-oriented
dardization of parallel object-oriented languages programming languages. Programmers competent in
will drastically reduce the cost of parallel program- these languages will be able to write parallel code with
ming. This work provides the foundation for build- ease, and to translate it to an executable object simply
ing a new computer architecture, the multiproces- by specifying an appropriate compiler flag.
sor computer, including an object-oriented oper- Practically all computing devices today have multiple
ating system and more energy-efficient, and eas- processors. By 2004 the speed of individual processors
ily programmable, parallel hardware architecture. reached a peak and today faster computation is possi-
The key software component of this architecture is ble only by deploying multiple processors in parallel
a compiler for object-oriented languages. We de- on the given task. This has proved to be very difficult,
scribe a novel compiler architecture with a dedi- even when the task is in principle trivially parallelizable.
cated back end for the interconnect fabric, making Parallel programming is slow, tedious and requires ad-
the network a part of a multiprocessor computer, ditional expertise; debugging is extremely difficult. In
rather than a collection of pipes between proces- short, the cost of parallel programming is very high. It
sor nodes. Such a compiler exposes the network is now the main obstacle to the solution of computa-
hardware features to the application, analyzes its tional problems. We are still seeing improvements in
network utilization, optimizes the application as a hardware, and although the speed of individual pro-
whole, and generates the code for the interconnect cessors remains bounded, networking bandwidth is
fabric and for the processors. Since the informa- expected to increase [2], and the cost of hardware
tion technology sector’s electric power consump- components continues to decline. It is possible to build
tion is very high, and rising rapidly, implementa- computing systems with very large number of cheap
tion and widespread adoption of multiprocessor components, but the hardware cannot be adequately
computer architecture will significantly reduce the utilized without parallel programming. Almost all of
world’s energy consumption. software is sequential. We do not know how to build
easily programmable parallel computers. This paper
A solution to the problem of
parallel programming

lays out a plan for doing this. number of possible time-ordered sequences of events.
After 2004 CPU manufacturers started producing The concept of a computational process is not a suitable
multi-core CPUs in an attempt to boost performance abstraction for parallel programming.
with parallel computing, but their primary focus moved The solution lies in object-level parallelism. We per-
to improving the energy efficiency of computation. The ceive the world primarily as a world of objects, a world
energy cost can be significantly reduced when multiple where a multitude of objects interact simultaneously.
cores and shared memory are placed on the same inte- From the programmer’s point of view interactions be-
grated circuit die. The processing cores and the RAM tween objects are meaningful and memorable, unlike
are typically interconnected using a bus, which be- interactions of processes exchanging messages. Object-
comes a bottleneck if the number of cores is increased. oriented languages have been around for decades and
In practice, this architecture is limited to about a dozen today they are the most widely used programming lan-
cores. This problem is resolved by building a network guages in practice, yet, remarkably, all existing object-
on the chip (NoC), which makes CPUs with thousands oriented programming languages are sequential. The
of cores possible. Such systems have been shown to be inherent natural parallelism of the object-oriented pro-
very energy efficient, but programming them is diffi- gramming paradigm has not been understood. We
cult, and they are still in the early stages of research do not know of any proposal to interpret an existing
and development. Presently, the information technol- object-oriented programming language as a parallel
ogy sector consumes approximately 7% of the global programming language.
electricity [7], and this number is growing rapidly. A In section 2.1 we define an abstract framework for
solution to the problem of parallel programming is parallel object-oriented computation. Throughout the
likely to lead to substantial improvement in the energy rest of this paper we use C++, but our results apply
efficiency of computing systems, and to significantly to a variety of object-oriented languages. In section
affect the world’s energy consumption. 3 we show that C++ can be interpreted as a parallel
It is probably impossible to survey all of the ideas programming language within this framework without
that have been proposed to tackle the problem of paral- any change to the language syntax. Parallel interpreta-
lel programming, but the two most important models tion of C++ differs from the standard sequential C++,
are shared memory and message passing. The shared but in section 3.3 we describe a new technique for au-
memory model is based on the naive idea that the com- tomatic parallelization of sequential code. Standard
putation should be constructed using common data sequential C++ code can be ported to run on paral-
structures, which all computing processes use simulta- lel hardware either automatically or with a relatively
neously. Today, most of the parallel computing software small programming effort.
uses threads, which are sequential processes that share In section 4 we show that parallel C++ is a power-
memory. The problems with threads are eloquently ful and intuitive language where object-oriented fea-
described in [15]. Threads is a terrible programming tures are more naturally expressed than in standard
model. Despite this, and the fact that shared memory is sequential C++. In our experience programs written
not scalable, programming frameworks with artificially in parallel C++ are at least ten times shorter than
generated global address space have been proposed. the equivalent programs written in C++ with MPI.
The message passing programming paradigm, on the The large reduction in the number of lines of code in
other hand, acknoweledges that processes generally parallel C++ is primarily due to the fact that communi-
reside on distant CPUs and need to exchange messages. cations instructions, including packing and unpacking
Most of the scientific computing software uses the Mes- of messages, are automatically generated in the imple-
sage Passing Interface (MPI), which over the years has mentation of object operations.
become a bloated collection that includes both low- For decades the central processing unit (CPU) was
level messaging routines of various kinds and library the most important component of the computer. Con-
functions encoding complex algorithms on distributed tinuous improvements in application performance
data structures. Large-scale computations attempting came mainly as a result of engineering increasingly
to maximize the performance of a cluster of multicore faster CPUs. As a result, the complexity of applica-
CPUs need to combine shared memory and message tions is typically measured by estimating the number
passing, incurring significant additional programming of multiplications they perform. In a multiprocessor
cost. system the CPU is no longer central, and the cost of
Shared memory and message passing share a com- moving data is not negligible with respect to the cost
mon fundamental flaw: both approaches use processes, of multiplication, yet the prevailing view continues to
and co-ordinating multiple concurrent processes in a be that all of the important work done by an applica-
computation is a nearly impossible programming task. tion is carried out inside the CPUs, and the network is
We draw an analogy to the problem with the goto merely a collection of pipes between them. This state
statement, which was discredited in the 1960s and of technology is in a stark contrast to observations in
1970s. Unrestrained use of goto leads to an intractable neuroscience which suggest that the network is much
number of possible execution sequences. Similarly, a more important than the individual processors. Yet, the
collection of parallel processes generates an intractable mechanism of computation in the brain is not under-

Page 2 of 12
A solution to the problem of
parallel programming

stood, and there is presently no computing technology 2 An abstract framework for


which adequately utilizes networking. Computations
are carried out on clusters of processing nodes and it
object-oriented computing
is not known how to build a multiprocessor computer.
The object-oriented framework provides the founda- “Objects are like people. They’re living,
tion for multiprocessor computer architecture, and the breathing things that have knowledge inside
network naturally plays a central role. In this architec- them about how to do things and have mem-
ture network communications consist of instructions ory inside them so they can remember things.”
that implement object operations. This is very different (Steve Jobs)
from the amorphous stream of bits exchanged between
processes in a cluster because it makes possible the con-
2.1 The Model
struction of a parallel C++ compiler with a dedicated
back end for the interconnect fabric. The ability to gen- An object is an abstract autonomous parallel computing
erate network hardware instructions at compile time machine. An application is a collection of objects that
exposes the network hardware features to applications perform a computation by executing methods on each
and enables the compiler to optimize network commu- other. An object is represented by an agent, which is a
nications, and to generate faster, more efficient code. collection of processes that receive incoming method
In section 5.1 we outline the software architecture for execution requests, execute them and send the results
the parallel C++ compiler and runtime system. In sec- back to the client objects. An agent can process multi-
tion 5.2 we describe a simple prototype compiler for ple method execution requests simultaneously. It can
parallel C++ which we used in computational experi- also represent several objects simultaneously, effec-
ments. It was implemented using Clang/LLVM tools, tively implementing a virtual host where these objects
with the runtime system based on MPI. live.
Building a multiprocessor computer is an enormous An object is accessed via a pointer (sometimes re-
task. For the first time, in this paper, we describe the ferred to as a remote pointer, or a generalized pointer),
key ideas that make it possible. Today, scientific com- which contains the address of the virtual host repre-
puting applications are typically implemented with senting the object, as well as the address of the object
MPI and are run on large clusters using a batch job within the virtual host.
submission system. The user writes a job control script An application is started as a single object by the
and obtains an allocation of compute nodes for the operating system, which first creates a virtual host and
application. This development process resembles pro- then constructs the application object on it. The objects
gramming on IBM System/360 in the 1960s. Further- of the application may request the operating system to
more, it is ironic that despite the development of time create new virtual hosts and construct new objects on
sharing in computer systems during the past 50 years, them.
applications now get exclusive use of large portions
of a cluster. Such hardware utilization is very ineffi-
cient. Operating systems have been developed to run 2.2 Related concepts
multiple processes on a single processor, but there is
no operating system to run a single application on The abstract framework of section 2.1 is the founda-
multiple processors, except for the very limited case tion for all of the results of this paper. It is instructive
of threads on a multi-core CPU. Running multiple ap- to examine the shortcomings of some of the previous
plications, which share all of the computing system’s approaches. The C++ standard defines an object as
resources is an even more difficult problem. There is a region of storage [13]. Standards of other object
an obvious need to build an operating system for a mul- oriented languages avoid defining an object directly.
tiprocessor computer in order to improve utilization of In Python an object is described as an “abstraction for
the hardware resources, design computing hardware, data” [3]. The wikipedia entry for object [5] describes
which is more energy efficient, and to reduce the cost it as follows: “an object can be a variable, a data struc-
of development of (parallel) applications. ture, a function, or a method, and as such, is a value
in memory referenced by an identifier”.
We learned about the existence of the actor model
only after we completed the research of this paper.
The actor model [12] is an abstract model for dis-
tributed computing. Actors perform computations and
exchange messages, and can be used for distributed
computing with objects. A number of programming
languages employ the actor model, and many libraries
and frameworks have been implemented to permit
actor-style programming in languages that don’t have
actors built-in [4]. Yet, the usefulness of the actor

Page 3 of 12
A solution to the problem of
parallel programming

model has been limited, and it does not provide a so- 3.2 Causal asynchronous execution
lution to the problem of parallel programming. Again,
the reason is that processes exchanging messages is Sequential interpretation of an object-oriented lan-
not a suitable model for parallel programming (see guage precludes parallel computation with remote
Introduction). objects: whenever an object executes a method on
another (remote) object, it is obliged to wait for the
The difficulty in identifying the relevant abstract
completion of this operation before executing the next
concepts and understanding their significance can be
statement. While it is possible for several objects to
appreciated by the fact that 45 years of research have
simultaneously execute methods on a given object, this
failed to clarify the relationship between the actor
will never happen if the application is started as a single
model and object-oriented computing. The object-
object.
oriented model is a higher level of abstraction. The net-
Imagining the object as an intelligent, living, breath-
work, the computational processes and the messages
ing thing, it could proceed with its computation imme-
are not included in the model. They are “implementa-
diately after initiating remote method execution, and
tion details”. Yet, an object can do everything that an
stop to wait for its completion only when its results
actor can, so actors are useful only because they can
are needed. We call this causal asynchronous execu-
implement objects. In section 3 we show that in object-
tion. Causal asynchronous execution enables parallel
oriented languages remote objects can be constructed
computation and provides a natural way to co-ordinate
naturally, making the actor-model libraries obsolete.
objects, as shown in Figure 2.
The object-oriented model supersedes the actor model.
bool completed = remote_object->ExecuteMethod();
// do something while the method is being executed
3 Language interpretation SomeComputation();
// wait for (remote) method completion
We show that C++ code, without any changes to the if (completed)
{
language syntax, can be executed in parallel using the // method execution has completed
abstract framework described in section 2.1. AnotherComputation();
}

3.1 Remote objects Figure 2: Causal asynchronous execution. Checking the return value of a
remote method execution suspends the execution until the remote
method has completed. The execution of SomeComputation overlaps
with communications in the previous statement.
Host * host = new Host("machine1");
Object * object = new(host) Object(parameters);
result = object->ExecuteMethod(some, parameters); In this example the purpose of the if statement is to
suspend the execution of AnotherComputation until
Figure 1: Construction of an object on a remote hosts. First, a virtual host the value of the variable completed is set by the remote
host is constructed on a physical device machine1. The C++ method.
“placement new” operator constructs an object on host and returns
a remote pointer object, which is used to execute a method on the Despite its simplicity, parallel C++, i.e. C++ with
remote object. causal asynchronous interpretation, has great expres-
sive power and is sufficiently rich to implement the
The code in Figure 1 is compliant with the standard most complex parallel computations. The program-
C++ syntax. It creates a virtual host, constructs an mer constructs a parallel computation by co-ordinating
object on it and executes a remote method on that ob- high-level actions on objects, while the underlying net-
ject. We interpret all pointers as generalized pointers. work communications are generated by the compiler.
The virtual host object is provided by the operating Computation and communication overlap naturally, as
system, and is associated with a physical device. We in the example in Figure 2, and large, complex objects
use the “placement new” operator of C++ to construct can be sent over the network as parameters of remote
the object on a given virtual host. Using a general- object methods. This is ideally suited to utilizing high
ized pointer a method is executed on a remote object. network bandwidth and avoiding the latency penalty
By-value parameters are serialized and sent over the incurred by small messages. In section 3.4. we intro-
network to the remote host. Once the remote execu- duce additional mechanisms for fine-grained control
tion completes, the result is sent back. The treatment of parallelism.
of by-reference parameters is more complicated: the
simplest solution is to serialize the parameter, send it 3.3 Automatic code parallelization
to the remote host and, upon completion of the method
execution, to serialize it and to send it back. When The construction of virtual hosts described in section
the parameter is a complex object, and the changes 3.1 allows the programmer to control the placement
made by method execution are relatively small, there of remote objects, but this can be done automatically
may be a more efficient method to update the original by the operating system. Virtual hosts can be created
parameter object. implicitly by the compiler, and the operating system

Page 4 of 12
A solution to the problem of
parallel programming

void function() tion of the following statement starts (see example in


{ Figure 3). Similarly, we require that when the bar-
SomeComputation();
{ rier statement is the first statement in the iteration,
special_object->compute(); the preceding iteration must finish before the barrier
for (int i = 0; i < N; i ++) statement is executed. These definitions allow the pro-
object[i]->computation(); grammer to describe parallelism in detail, as shown by
}
AnotherComputation(); the examples in Figure 4.
} Clearly, the price for the introduction of mechanisms
for detailed control of parallelism is the increased dis-
Figure 3: Nested compound statement. The N + 1 statements in the nested
compound statement are executed in parallel after
tance in the semantic interpretation between parallel
SomeComputation has completed. AnotherComputation is C++ and standard sequential C++.
executed after the execution of all of these N + 1 statements has
completed.

can assign virtual hosts to physical processors, and


construct the application’s objects on these hosts at run 4 Parallel C++ examples
time.
The remote placement of objects can be applied to We present examples illustrating the expressive power
all programs running on the system. This transfor- of parallel C++. Our parallel C++ research started
mation alone does not parallelize program execution, with the problem of finding an efficient way to carry
but it is likely to improve the overall utilization of a out data-intensive computations, i.e. computations
multiprocessor system. where disks are used like RAM. With the changing
We can expect that any sufficiently complex sequen- ratio between the cost of arithmetic operations and
tial program will contain code sections that parallelize the cost of data movement (see [2]), new applications
automatically. In order to parallelize a given program become possible, but the challenge is to deploy the
remote object placement must be combined with causal hardware in parallel. The computation of a large 3D
asynchronous code execution. An object can avoid Fourier transform (see section 4.4) is a natural test
causality violation at runtime by simply never using problem where a significant part of the computation is
results of remote operations before they become avail- the movement of data. Our code was initially written in
able. Such design does not prevent potential deadlocks. C++ with MPI, but we found that the natural solution
Furthermore, remote pointers can be abused, so ulti- to the problem was to arrange the computation using
mately the programmer is responsible for the causality large objects that can be deployed on remote devices.
correctness of the code. There is, however, a wide With the introduction of parallel C++ the time for
range of use cases where causality can be enforced an implementation of such a solution is reduced from
by the compiler using control flow graph analysis (see many months to a few weeks.
examples in section 4).
Automatic parallelization is a difficult and active area
of research. Much of this work has focused on paral-
lelizing loop execution. Task parallelization typically 4.1 Array objects
requires the programmer to use special language con-
structs to mark the sequential code, which the compiler
The syntax of array operations applies naturally to
can then analyze for parallelization. The object-level
remote pointers. The array in the following example is
parallelization we introduced here requires no new
allocated on a remote host, and the array operations
syntax, and, arguably, well-structured serial object-
require sending the values of x and a[24] over the
oriented code can be parallelized either automatically,
network.
or with minimal programming effort.
double * a = new (remote_host) double([1024];
a[2] = 22.22 + x;
3.4 Detailed control of parallelism double z = a[24] + 3.1;

We now extend causal asynchronous execution to en-


able a more fine-grained control of parallelism. In
this model compound statements and iteration state-
ments are also executed asynchronously, and subject
to causality. In addition, we define the nested com- 4.2 MapReduce
pound statement to be a barrier statement. This means
that prior to its execution the preceding statements in A basic example of MapReduce functionality can be
the parent compound statement must finish execution, implemented with only a few lines of parallel C++
and its own execution must finish before the execu- code, as shown in Figure 5.

Page 5 of 12
A solution to the problem of
parallel programming

for (int i = 0; i < N; i ++) for (int i = 0; i < N; i ++)


{ {{
objectA[i]->computation(); objectA[i]->computation();
objectB[i]->computation(); objectB[i]->computation();
} }}

(a) parallel (b) sequential iterations


for (int i = 0; i < N; i ++) for (int i = 0; i < N; i ++)
{ {
objectA[i]->computation(); {
{ objectA[i]->computation();
objectB[i]->computation(); }
} objectB[i]->computation();
} }

(c) parallel iterations (d) sequential

Figure 4: Iteration statements examples.


(4a): potentially all 2N statements are executed in parallel.
(4b): all iterations are sequential, but each iteration has two potentially parallel statements.
(4c): potentially N iterations are executed in parallel, but each iteration has 2 statements that are executed sequentially.
(4d): all 2N computations are executed sequentially.

int NumberOfWorkers = 44444444;


Worker * workers[NumberOfWorkers];
for (int i = 0; i < NumberOfWorkers; i ++)
workers[i] = new (host[i]) Worker();
for (int i = 0; i < NumberOfWorkers; i ++)
result[i] = workers[i]->compute(data[i]);
double total = 0.0;
for (int i = 0; i < NumberOfWorkers; i ++)
total += result[i];

Figure 5: Example: MapReduce. The workers array is assigned in parallel, void Graph::BuildTree(VertexId root_id)
with each worker being constructed on its virtual host. The compute
methods are also executed in parallel. We rely on the compiler to {
enforce causality in the execution of the reduction loop. It starts int root_owner = VertexOwner(root_id);
executing only after result[0] becomes available, and it executes if (this->id() == root_owner)
sequentially, according to the definition of causal asynchronous frontier.push_back(v[root_id]);
execution. EdgeList * E = new EdgeList[N];
bool finished = false;
while (!finished)
The master process allocates workers on remote {
hosts, initiates a method execution on each worker SortFrontierEdges(E);
and sums up the result. If the data[i] object is not {
// remote, asynchronous, in parallel
located on host[i], it will be copied there over the net- for (int i = 0; i < N; i ++)
work. This code is shorter and easier to write than the graph[i]->SetParents(E[i]);
code that uses Google’s library. Moreover, as we show }
in section 5, the parallel C++ compiler may be able // finish BFS when all frontiers are empty
finished = true;
to generate more efficient code by optimizing network for (int i = 0; i < N; i ++)
operations. finished &= graph[i]->isEmptyFrontier();
}
}
4.3 Breadth-First Search on a large
graph Figure 6: Building the BFS tree. The frontier is initialized with the root vertex
by its owner. Iterations continue until all frontiers are empty. In
each iteration the local frontier edges are sorted into N EdgeList
Distributed BFS on a large graph is a standard bench- lists, one for each graph object. Communication of all the lists begins
mark problem [1]. We implemented a straightforward thereafter. After the execution of all SetParents methods has
finished all graph objects are asked if their frontier is empty.
algorithm in C++/MPI using over 2000 lines of code,
and in parallel C++ with less than 200 lines of code.
The graph data is divided into N objects, each con-
taining an array of vertices with a list of edges for each
vertex. We create N virtual hosts, one for each avail-
able processor, and allocate a graph object on each
host. The main object initiates the BFS by invoking the
BuildTree method on each graph object (see Figure
6). The computation proceeds with several (typically

Page 6 of 12
A solution to the problem of
parallel programming

Figure 7: The 8-node cluster used in the 3D Fourier transform computation.


Every node of the cluster is connected in parallel to 24 one-terabyte
hard drives, with 100 MB/sec read/write throughput for each drive.
The CPUs, with 12 cores and 48GB of RAM each, are interconnected
with a 10 Gb/sec Ethernet.

less than 15) synchronized iterations. Each graph ob-


ject keeps track of the local frontier, which is a set of
vertices on the current boundary that have not been Figure 8: Fourier transform computation. Lines of array pages are loaded into
visited yet. The graph object that owns the root ver- RAM buffers, transformed and written back. For the first and the
second dimension additional transpose operations are needed.
tex initializes its local frontier with the root vertex. In
each iteration the local frontier edges are sorted into
N lists, one for each graph object. The vertex on the
other end of each frontier edge becomes the child in We used the other 4 nodes to run 16 processes of the
the tree, unless it was visited before. The new fron- Fourier transform, 4 processes per node. An Array
tier set consists of the new children. To set the parent object (see Figure 9) is logically a pointer: it is a small
links and to update the frontier set every graph object object which is copied to all processes working on the
executes a method on every other graph object, send- array. Array pages are allocated on 96 hard drives,
ing it the corresponding list of edges. This is done in using virtual hosts for storing persistent objects, which
the SetParents method, whose parameter is a large are implemented in the Device class. We ran two
object of type EdgeList, which is serialized and sent Device agents on each CPU core, each Device using a
over the network. The calls to SetParents execute in single hard drive. In order to maximize the utilization
parallel after the completion of SortFrontierEdges. of CPUs, network bandwidth and the total available
BFS iterations stop when all local frontiers are empty. disk throughput, array pages were allocated in circu-
We used N 2 messages to set the values of the finished lant order (see Figure 9).
variables. This could be more conveniently achieved
The Fourier transform is computed by loading, and
using an allreduce library function, like those imple-
when necessary transposing, lines of 128 pages into
mented in MPI. In parallel C++ such functions could
4GB RAM buffers, performing 1282 one-dimensional
be, for example, implemented in the standard library
transforms in each buffer, and writing the contents
using specialized containers for collective operations.
back to hard drives (see Figure 8). We illustrate the
Notice that the while iterations are executed sequen-
computation of the Fourier transform in the first dimen-
tially because they causally depend on the value of
sion. The processes computing the Fourier transform
finished.
in the first dimension are implemented in the SlabFFT1
class. Each of the 16 SlabFFT1 objects was assigned
4.4 3D Fourier transform an array slab of 128 × 8 × 128 pages to transform it
line by line in 8 × 128 = 1024 iterations (see Figure
We computed the Fourier transform of a 64 TB array 10). The SlabFFT1 objects are independent of each
of 163843 complex double precision numbers on an other (see Figure 11), but they compete for service
8-node cluster shown in Figure 7. The total computa- from the 96 hard drives, and they share the network
tion time was approximately one day, and it could be bandwidth. The SlabFFT1 process overlaps reading a
significantly improved with code optimization. More page line with FFTW1 function, which computes 1282
importantly, the hardware system could be redesigned 1D FFTs using the FFTW library [10]. The 16 SlabFFT1
to achieve a better balance between the components. processes use the ReadPageLine method in parallel,
Using more powerful hardware components a similar and each ReadPageLine call reads 128 pages in paral-
computation can be carried out inexpensively with a 2 lel from the hard drives storing the array, and copies
PB array on a suitably configured small cluster. We im- them over the network into the RAM buffer of the
plemented the Fourier transform using approximately SlabFFT1 object. Figure 12 shows two implementa-
15,000 lines of C++ code with MPI. The equivalent tions of ReadPageLine, demonstrating a very easy way
parallel C++ code is about 500 lines. to shift computation among processors. We used the
We used 4 of the cluster nodes to store the input first implementation in our computation in order to
array, dividing it into 1283 pages of 1283 numbers each. offload some of the work from the SlabFFT1 processes.

Page 7 of 12
A solution to the problem of
parallel programming

class SlabFFT1
{
public:
SlabFFT1(Array * array, int N20, int N21);
void ComputeTransform();
class Array private:
{ int N20, N21; // slab indices
public: Page * page_line, * next_page_line;
Array(Domain * ArrayDomain, Domain * PageDomain); void ReadPageLine(
~Array(); ArrayPage * line, int i2, int i3
void allocate(int number_of_devices, Device * d); );
void FFT1(int number_of_cpus, Host ** cpus); void WritePageLine(
private: ArrayPage * line, int i2, int i3
Domain * ArrayDomain; );
Domain * PageDomain; };
ArrayPage * *** page; // 128^3 pointers
}; void SlabFFT1::ComputeTransform()
{
void ReadPageLine(page_line, N20, 0);
Array::allocate(int number_of_devices, Device * d) for (int i2 = N20; i2 < N21; i2 ++)
{ for (int i3 = 0; i3 < N3; i3 ++)
for (int j1 = 0; j1 < N1; j1 ++) {
for (int j2 = 0; j2 < N2; j2 ++) {
for (int j3 = 0; j3 < N3; j3 ++) int L2 = i2;
{ int L3 = i3 + 1;
int k = (j1 + j2 + j3) % number_of_devices; if (L3 == N3)
page[j1][j2][j3] = { L3 = 0; L2 ++; }
new(d[k]) ArrayPage(n1, n2, n3); if (L2 != N21)
} ReadPageLine(next_page_line, L2, L3);
} FFTW1(page_line);
} // next_page_line has been read
WritePageLine(page_line, i2, i3);
Figure 9: The Array class. Domain is a helper class describing 3D page_line = next_page_line;
subdomains of an array. ArrayPage is a small 3D array, which
implements local array operations, such as transpose12 and
}
transpose13 methods. These operations are needed in the Fourier }
transform computation. Global array operations are implemented
using the local methods of ArrayPage. Array pages are allocated in
circulant order. The allocate method constructs N1 × N2 × N3 Figure 11: Fourier transform of a slab.
array pages of size n1 × n2 × n3 on a list of virtual devices. The page_line and next_page_line are 2 local RAM buffers, 4 GB
dimensions are obtained from ArrayDomain and PageDomain, and each. The iterations are sequential, and next_page_line is read
in our case are all equal 128. while page_line is being transformed using the FFTW1 function,
which computes 1282 1D FFTs using the FFTW library.

The Fourier transform main (see Figure 13) creates


96 virtual devices for array storage, one on each hard
drive of the 4 storage nodes. The array object is created
and array pages are allocated. Next, 16 virtual hosts
are created on the 4 computing nodes and the Fourier
transform computation is performed using these virtual
hosts.
void Array::FFT1(int number_of_cpus, Host ** cpu)
{
int slab_width = N2 / number_of_cpus;
SlabFFT1 ** slab_fft = 5 Compilation and Runtime
new SlabFFT1 * [number_of_cpus];
for (int i = 0; i < number_of_cpus; i ++)
slab_fft[i] =
5.1 Compiler architecture
new(cpu[i]) SlabFFT1(this, i * slab_width,
(i + 1) * slab_width); The object-oriented framework of section 2.1 implicitly
for (int i = 0; i < n; i ++) restricts network communications to implementation of
slab_fft[i]->ComputeTransform(); object operations. Object operations can be described
}
by an intermediate representation (IR) language. We
devised a rudimentary IR for our compiler prototype
Figure 10: Fourier transform of an array. SlabFFT1 objects are constructed
(in parallel) on remote processors, each SlabFFT1 is assigned a (see section 5.2), where most of the IR instructions
slab of the array. The 16 SlabFFT1 objects compute the transforms were generated by the compiler. Here are three exam-
in parallel.
ples of IR instructions: remote copy a block of memory,
initiate a method execution on an object, notify an
agent that a remote execution has completed. IR code
can be translated by the compiler into instructions for
the network hardware and for the CPUs. It can also be

Page 8 of 12
A solution to the problem of
parallel programming

used for compile-time analysis of network utilization,


as well as optimization of the system’s performance as a
whole. It is therefore natural to implement a dedicated
compiler backend for the interconnect fabric.
void SlabFFT1::ReadPageLine(ArrayPage * page_line, The interconnect hardware instruction set is not re-
int i2, int i3)
{
stricted to sending and receiving messages. In the
for (int i1 = 0; i1 < N1; i1 ++) Mellanox InfiniBand, for example, processing is done
{ in network interface cards and network switches. The
page[i1][i2][i3]->transpose13(); following two examples illustrate the potential advan-
page_line[i1] = *page[i1][i2][i3];
}
tages of compiler-generated networking instructions
} for this network.
Applications must use large messages to avoid the
void SlabFFT1::ReadPageLine(ArrayPage * page_line,
int i2, int i3)
latency penalty and to utilize the network bandwidth.
{ As a result, a lot of code (and some processing power)
for (int i1 = 0; i1 < N1; i1 ++) is devoted to packing and unpacking messages. The
{ User-mode Memory Registration (UMR) feature of Mel-
page_line[i1] = *page[i1][i2][i3];
page_line[i1]->transpose13(); lanox InfiniBand can support MPI derived datatype
} communication, which may reduce some of this over-
} head [16], but it requires the programmer to duplicate
datatype definitions, in order to inform the MPI library
Figure 12: Two possible implementations of ReadPageLine. In our about the datatypes used in the program. In paral-
computation we used the first implementation, where the transpose
is performed “close to the data” by the agent storing the page. In lel C++ this information is available to the compiler,
the second implementation the transpose would be performed by which can generate the UMR instructions.
SlabFFT1 after it reads the page.
Another example of in-network processing is the
Scalable Hierarchical Aggregation and Reduction Pro-
tocol (SHARP) of Mellanox InfiniBand [11] that of-
floads the computation of collective operations, such
as barrier and broadcast, to the switch network, elimi-
nating the need to send data multiple times between
endpoints. The SHARP hardware capabilities are cur-
rently accessed by the user only indirectly via a com-
munications library, like MPI. However, the compiler is
int main() potentially capable of generating detailed and efficient
{ routing and aggregation instructions for very complex
int number_of_disks = 96;
Device ** hdd = new Device *[number_of_disks]; code. Perhaps the simplest example is the following
{ variant of the broadcast statement, where a large num-
for (int i = 0; i < number_of_disks; i ++) ber of objects a[i] are located on some subset of the
hdd[i] = new Device("hard drive i");
system’s processors:
}
int n = 128; // a[i] are remote objects
Domain page_domain(n, n, n); for (int i = 0; i < N; i ++)
int N = n * n; a[i] = b;
Domain array_domain(N, N, N);
Array * a = new Array(array_domain, page_domain);
a->allocate(number_of_disks, hdd);
The last example suggests that the development of an
int number_of_cpus = 16; optimizing compiler targeting network hardware may
Host ** cpu = new Host *[number_of_cpus]; lead to improved network hardware design. In that re-
{ spect, an especially important example of a compilation
for (int i = 0; i < number_of_cpus; i ++)
cpu[i] = new Host("address of cpu i"); target is a many-core processor with a network-on-chip
} (NoC), such as the Tile processor [8]. Such processors
a->FFT1(number_of_cpus, cpu); can now be designed to optimally execute IR code.
}
Presently, computations on distributed systems use
processes that exchange messages, which are arbitrary
Figure 13: Fourier transform main constructs in parallel 96 virtual devices
for array storage, one on each hard drive of the 4 storage nodes. collections of bits. The programming framework pro-
The array object is created and array pages are allocated. The 16 vides the communication libraries with almost no mean-
virtual hosts are created on the 4 computing nodes after the page
allocation has completed, and the Fourier transform computation ingful information about the messages. The derived
starts after the construction of virtual hosts. data types mechanism in MPI, mentioned in the exam-
ple above, is merely an awkward attempt to extract a
small amount of such information from the program-
mer. We have described a software architecture where
the application’s communications are an integral part

Page 9 of 12
A solution to the problem of
parallel programming

of the computation, which is analyzed and mapped We used a small number of basic MPI commands to
by the compiler onto the interconnect fabric. The net- implement a transport library for agents’ communica-
work is not merely a collection of passive data pipes tions, and to launch agents on remote hosts as MPI
between processing nodes, but is a key component, processes. All of the MPI functionality used in the pro-
which together with the processors, makes up a com- totype is encapsulated in the transport library and can
puter. Based on this architecture it is now possible to be easily replaced.
design an operating system, develop new hardware
and build a multiprocessor computer (see section 6). 5.2.2 The PCPP compiler
PCPP is source-to-source translation tool which works
5.2 The prototype with a subset of the C++ grammar. It is built using
We built a prototype compiler, called PCPP, and a run- the Clang library tools. (Clang is the front end of the
time system (see Figure 14) for parallel C++. The com- LLVM compiler infrastructure software [14].)
piler translates parallel C++ into C++ code, which PCPP transforms the main of the input program into
is compiled and linked against the runtime library to a stand-alone class, the application’s main class. It
obtain an executable. generates a new main program which initializes the
runtime system, constructs a virtual host and constructs
the application’s main object on it. Next, the new
main reverses these actions, destroying the applica-
tion’s main object on the virtual host, destroying the
virtual host and shutting down the runtime system.
PCPP translates all pointers to remote pointer ob-
jects. For every class of the application PCPP generates
IR instructions for its object operations (constructors,
destructors and methods). Additionally, PCPP replaces
calls to object operations with code that serializes the
parameters and sends them with the corresponding in-
struction to the destination agent. For example, when a
constructor is invoked, one of the serialized parameters
is a remote pointer containing the address of the result
variable, which is a remote pointer variable that should
be assigned with the result of the constructor. The
PCPP-generated IR instruction is a serializable class,
Figure 14: Prototype compiler PCPP. PCPP translates parallel C++ into C++
code, which is compiled and linked against the runtime library to
derived from the base instruction class defined in the
obtain an MPI executable. runtime library. When this instruction is received by
the destination agent, it is unserialized and its execute
method is invoked. This method constructs a local ob-
ject using the unserialized parameters and generates
5.2.1 The runtime library an IR instruction to copy the object pointer to the result
variable on the source agent.
The runtime library implements virtual hosts as agents For causality enforcement we implemented a simple
that execute IR instructions. All messages between guard object, based on the condition_variable of
agents are serialized IR instructions, and for that pur- the C++11 standard library. PCPP generates a guard
pose the runtime library contains a simple serialization object for every output variable of a remote operation.
layer. An agent is implemented as an MPI process with A wait method on the guard object suspends the exe-
multiple threads: a dispatcher thread and a pool of cuting thread until a release method is called on the
worker threads. The dispatcher thread receives an in- same guard object by another thread. A remote pointer
coming message, unserializes it into an IR instruction to this guard object is sent to the destination agent.
and assigns it to a worker thread for execution. Each When the destination agent completes the operation it
worker thread maintains a job queue of IR instructions, sends an IR instruction to the source agent to release
however the pool of worker threads is not limited, and the guard. The wait call is inserted in the application
can grow dynamically. Every worker thread is either code just before the value of the output variable is used.
processing its job queue, is suspended and waiting to We used PCPP to design the runtime system archi-
be resumed, or is idle and available to work. An exe- tecture and to run experiments with parallel C++.
cution of an IR instruction typically involves execution Substantial work needs to be done to produce a fully
of the application’s code and may result in new IR in- functional compiler. A full-fledged formal IR language
structions being sent over the network. We used one has to be developed and a multiple of compiler back
dedicated worker thread in every agent to serialize and ends implemented for various hardware architectures.
send IR instructions to their destination agents. Because the programming model makes no distinc-

Page 10 of 12
A solution to the problem of
parallel programming

tion between local and remote pointers, compiler op- municating asynchronously, and designed with hard-
timization of the code is needed to make local com- ware support for running virtual hosts. The NoC has
putations efficient. It is likely that this problem can very high bandwidth, but congestion control is still a
also be addressed with an appropriate hardware de- problem. It has been shown that application awareness
sign. Causality enforcement requires implementation significantly improves congestion control in a NoC be-
of control flow graph analysis and loop optimization, cause better throttling decisions can be implemented
such as those implemented in parallelizing compilers. [18]. In present systems such application awareness
A conservative implementation would inhibit parallel is very difficult to implement, but the object-oriented
computation and would alert the user whenever de- framework provides the concept of an application, and
pendence analysis fails, allowing the user to change the operating system can be tasked with tracking ap-
the code accordingly. plication’s objects and throttling the application when
appropriate.
Building a multiprocessor computer is an enormous
6 Conclusion: task. It is not possible to discuss all of the important
aspects of this project in a few pages, but for the first
the road to a multiprocessor time in this paper we have described the key ideas that
computer make it possible.

We have defined a framework for object-oriented com-


puting and have shown that object-oriented languages
can be interpreted in this framework as parallel pro-
gramming languages. Parallel C++, for example, is a
very powerful language, which is accessible and nat-
ural for programmers who are proficient in standard
C++. We believe that implementation and standard-
ization of parallel C++ will drastically reduce the cost
of parallel programming. Furthermore, we have shown
that standard sequential C++ programs can be paral-
lelized automatically, potentially sped up and ported
to more energy efficient parallel computing hardware.
The object-oriented computing framework provides
the foundation for a new computer architecture, which
we call a multiprocessor computer. A parallel C++
compiler with a back end that generates code for the
interconnect fabric is a first step towards developing an
object-oriented operating system and designing new
hardware architecture.
Many object-oriented operating systems have been
built [6], but these projects used standard sequential
object-oriented languages, and were not aimed at de-
signing an operating system for a multiprocessor com-
puter. Files and processes are the fundamental building
blocks of operating systems used today. Both of these
abstractions need to be replaced with the concept of an
object; files should be replaced with persistent objects.
Objects, virtual hosts, virtual devices and applications,
defined in section 2.1, are some of the fundamental
entities an object-oriented operating system needs to
be based on. It is now possible to design an operating
system that enables multiple applications to simulta-
neously share the network, the processors, and all of
the system’s resources. This will significantly improve
hardware utilization and reduce the energy cost of
computations.
Processors with a large number of cores and a net-
work on chip (NoC) are very energy efficient [9], but
are very difficult to program [17]: the typical approach
is to use a communication library with message passing
processes. We propose to use lightweight cores, com-

Page 11 of 12
A solution to the problem of
parallel programming

References [12] Carl Hewitt, Peter Bishop, and Richard Steiger. A


universal modular actor formalism for artificial in-
[1] graph500 benchmark. http://www.graph500. telligence. In Proceedings of the 3rd International
org/. Accessed: 2018-07-30. Joint Conference on Artificial Intelligence, IJCAI’73,
pages 235–245, San Francisco, CA, USA, 1973.
[2] InfiniBand® Roadmap. http://www. Morgan Kaufmann Publishers Inc.
infinibandta.org/content/pages.php?pg=
technology_overview. Accessed: 2018-07-27. [13] Programming Language C++. [Working draft].
Standard, International Organization for Stan-
[3] The Python Language Reference. https://docs. dardization (ISO), Geneva, Switzerland, (2014).
python.org/2/reference/index.html. Ac-
cessed: 2018-11-04. [14] Chris Lattner and Vikram Adve. LLVM: A com-
pilation framework for lifelong program analy-
[4] Wikipedia entry: Actor model. https://en. sis & transformation. In Proceedings of the In-
wikipedia.org/wiki/Actor_model. Accessed: ternational Symposium on Code Generation and
2018-11-04. Optimization: Feedback-directed and Runtime Op-
timization, CGO ’04, pages 75–, Washington, DC,
[5] Wikipedia entry: Object (computer science).
USA, 2004. IEEE Computer Society.
https://en.wikipedia.org/wiki/Object_
(computer_science). Accessed: 2018-11-04. [15] Edward A. Lee. The problem with threads. Com-
puter, 39(5):33–42, May 2006.
[6] Wikipedia entry: Object oriented operating
system. https://en.wikipedia.org/wiki/ [16] M. Li, K. Hamidouche, X. Lu, J. Zhang, J. Lin, and
Object_(computer_science)https://en. D. K. Panda. High performance OpenSHMEM
wikipedia.org/wiki/Object-oriented_ strided communication support with infiniband
operating_system. Accessed: 2018-11-14. umr. In 2015 IEEE 22nd International Conference
on High Performance Computing (HiPC), pages
[7] Maria Avgerinou, Paolo Bertoldi, and Luca Castel-
244–253, Dec 2015.
lazzi. Trends in data centre energy consumption
under the european code of conduct for data cen- [17] Timothy G. Mattson, Rob Van der Wijngaart, and
tre energy efficiency. Energies, 10(10):1470, Sep Michael Frumkin. Programming the intel 80-
2017. core network-on-a-chip terascale processor. In
Proceedings of the 2008 ACM/IEEE Conference on
[8] B. Edwards, D. Wentzlaff, L. Bao, H. Hoffmann, Supercomputing, SC ’08, pages 38:1–38:11, Pis-
C. Miao, C. Ramey, M. Mattina, P. Griffin, A. Agar- cataway, NJ, USA, 2008. IEEE Press.
wal, and J. F. Brown III. On-chip interconnection
architecture of the tile processor. IEEE Micro, [18] George Nychis, Chris Fallin, Thomas Moscibroda,
27:15–31, 10 2007. and Onur Mutlu. Next generation on-chip net-
works: What kind of congestion control do we
[9] Emilio Francesquini, Márcio Castro, Pedro H. need? In Proceedings of the 9th ACM SIGCOMM
Penna, Fabrice Dupros, Henrique C. Freitas, Workshop on Hot Topics in Networks, Hotnets-
Philippe O.A. Navaux, and Jean-François Méhaut. IX, pages 12:1–12:6, New York, NY, USA, 2010.
On the energy efficiency and performance of ACM.
irregular application executions on multicore,
numa and manycore platforms. J. Parallel Distrib.
Comput., 76(C):32–48, February 2015.

[10] Matteo Frigo and Steven G Johnson. The design


and implementation of fftw3. Proceedings of the
IEEE, 93(2):216–231, 2005.

[11] Richard L. Graham, Devendar Bureddy, Pak


Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch,
Dror Goldenerg, Mike Dubman, Sasha Kotchu-
bievsky, Vladimir Koushnir, Lion Levi, Alex Mar-
golin, Tamir Ronen, Alexander Shpiner, Oded
Wertheim, and Eitan Zahavi. Scalable hierarchi-
cal aggregation protocol (sharp): A hardware
architecture for efficient data reduction. In Pro-
ceedings of the First Workshop on Optimization
of Communication in HPC, COM-HPC ’16, pages
1–10, Piscataway, NJ, USA, 2016. IEEE Press.

Page 12 of 12

View publication stats

You might also like