Arallel Rogramming With Keletons

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

S CIEntIfIC ProgrAmmIng

Editors: Konstantin Läufer, [email protected]


Konrad Hinsen, [email protected]

Parallel Programming
with SkeletonS
By Joel Falcou

Parallel programming is bound to become the main concern of software developers in the coming decades.
Various models aim to solve this tension, trading efficiency for abstraction or vice versa, but how about
getting both?

B ack at the dawn of scientific


computing, parallel machines
were revered as titans that
few could approach and even fewer
could tame. Today, after decades of
have emerged to provide a structured
framework to design and implement
nontrivial parallel applications and
deliver high performance.
In this article, I present such a pro-
tributed data mining is, in fact, noth-
ing but a specific parallel skeleton.
Every community or application
domain has its own specific skeletons.
In computer vision—the example
progress in both hardware and soft- gramming model—Parallel Algorith- that I use in this article—parallel
ware development, parallel comput- mic Skeletons—along with a library skeletons mostly involve slicing and
ing is a mainstream technique for called Quaff that implements it in distributing regular data. Parallel ex-
getting things done. However, as C++ and makes parallel application ploration of a tree-like structure or a
this progress spawned increasingly development easier. randomized walk through some itera-
powerful machines, it spelled doom tion space are common in operational
for many developers. Parallel Skeletons research applications. The main ad-
As Herb Sutter1 stated, the free in a Nutshell vantage of this parameterization of
lunch is over for sequential pro- The concept of parallel skeletons2,3 is parallelism is that all low-level, ar-
gramming: designers must take based on the observation that many chitecture, or framework-dependent
concurrency and parallel program- applications express parallelism in the code is hidden from the user, who
ming into account at every level of form of a few recurring patterns of only has to write sequential code
software design to avoid the dreaded computation and communication. An fragments and instantiate skeletons.
“this application doesn’t scale” re- example of such a pattern is a pipeline, The skeleton approach thus provides
sult. For nonspecialists, this means which is a model of parallel function a decent level of organization.
that writing efficient code for such composition. Another such pattern is Another interesting feature of skel-
machines or groups of machines will a data-parallel structure, in which a etons is their ability to be nested. If
become less trivial, as it usually in- master process slices data along some we look at a skeleton as a function
volves dealing with low-level APIs dimension and distributes them to a that takes functions as arguments and
such as OpenMP, Posix threads, pool of slave processing units. produces parallel code, then any in-
and the message-passing interface Essentially, a skeleton is a pattern stantiated skeleton is eligible to be an-
(MPI). However, years of experience that occurs in a significant number of other skeleton’s argument. Skeletons
have shown that using such frame- parallel applications; some skeletons are thus higher-order functions in the
works is difficult and error-prone; are general, whereas others are specific sense of functional programming—
deadlocks and other undesired be- to an application domain. Although that is, functions whose parameters
haviors make parallel software de- design patterns are rather informal, are themselves functions. To be able
velopment very slow compared to the skeletons have been formalized to the to perform the transformation from a
classic, sequential approach. As an extent that we can express them as con- combination of skeletons to a working
alternative, software design patterns crete language constructs or templates. parallelized program, we need an ap-
offer an easy way to build reusable, Recently, commercial users have start- plication model capable of represent-
structured sequential software com- ed to show interest in these methods. ing an arbitrarily complex program as
ponents. Similarly, various attempts Google’s MapReduce model for dis- a set of functions.4

58 Copublished by the IEEE CS and the AIP 1521-9615/09/$25.00 © 2009 IEEE Computing in SCienCe & engineering
PIPE

Skeleton tree Process network φ φ2 φ C+MPI code


2 2
generation production generation C
C++
φ1 FARM3 φ3 MPI
φ1 f φ3

φ2

As an example, consider an image- figure 1. Quaff code generation process. the compile-time system analyzes C++
processing application for detecting code using skeleton constructors to build a process network. this network is then
edges in a video stream. We would used to generate the message-passing interface (mPI) code for compilation.
build this application from four
functions:
slices over a pool of slave processors, structure extracted from the applica-
• load, which retrieves an image from and merges the results of a function’s tion definition into a list of executable
the video stream; parallel application (which is thresh in instructions for a given parallel archi-
• thresh, which applies a binary our example): tecture, as Figure 1 shows.
threshold to an image; Other research describes this pro-
• edge, which extracts edges as lists of A2 = pipeline [map [slice, thresh, cess and its associated specific lan-
lines; and merge], edge, save]. guage.5 Whereas languages such as
• save, which saves the result to a file metaOCaml or Template Haskell
on disk. We thus define the final paral- natively support such constructions,
lel application in terms of skeleton C++ requires the use (and abuse) of
The sequential version of our applica- nesting—map being nested inside a template metaprogramming and op-
tion is defined as the sequential com- pipeline—and the list of sequential erator overloading.
position of these four functions: functions. As far as the developer is Let’s examine the Quaff interface
concerned, the parallelization is done and which skeletons it supports.
As = sequence [load, thresh, edge, save], because the skeleton implementation
will handle all the low-level com- The Quaff Programming Model
where sequence represents the se- munication and marshaling details. Figure 2 presents a simple algorithm
quential composition skeleton. Building parallel applications is sim- parallelized with Quaff. In this ex-
We can introduce the first level of plified because the developer only ample, we want to apply the function
parallelism by noticing that we can needs to know the skeletons’ opera- comp to a vector in a parallel way.
run the four functions in parallel if we tional semantics. Another advantage The actual code is split into four
apply them to different elements of the is that developers can reuse existing parts, starting with the definition of
data stream. Thus, while load is load- sequential functions directly because the user functions. The only limita-
ing the ith image, we can apply thresh they don’t need to know about the tion is the argument ordering (input
to the (i − 1)th image. In general, we parallelization process. first, output last), which is a require-
can apply the thresh function on imag- ment to enable Quaff to determine
es it, it–1, it–2, and it–3 in parallel. This The Quaff Library how data should be transferred be-
parallelization scheme—a pipeline—is Quaff is a skeleton-based parallel pro- tween processes. The library pro-
a classic yet useful form of parallelism gramming library whose main task vides communication support for
that we can choose as a skeleton. A first is to rely on C++ template metapro- all standard C and C++ types and
parallel version of A is thus gramming to reduce the overhead some standard template library con-
traditionally associated with object- tainers—such as vector or list—thus
A1 = pipeline [load, thresh, edge, save]. oriented implementations of such li- limiting the marshaling code we
braries. The basic idea is to use the must write to support custom types.
We can express another level of par- C++ template mechanism so that Next, the user initializes the parallel
allelism by noticing that we can run skeleton-based programs expand at execution environment at line 10 via
the thresh function on different im- compile time and generate a new C++ the initialize function. From this
age slices in parallel. So, if we define MPI code to be compiled and execut- point on, we can evaluate and run
two functions for slicing and merg- ed at runtime. This code generation skeleton expressions on the underly-
ing images (slice and merge, respec- totally removes the overhead associ- ing MPI-enabled parallel machine.
tively), we can express As by using a ated with runtime polymorphism The application is a combination
new parallel construction—map— and function forwarding. To do this, of skeleton constructors on lines 12
which slices an image, distributes the developers transform the skeleton through 14.

may/June 2009 59
SCIEntIfIC ProgrAmmIng

# include <quaff / quaff .hpp >


using namespace quaff ;

void load ( vector <float >& d );


void comp ( vector <float > const & d, vector <float >& r);
void save ( vector <float > const & r );

int main ( int argc , const char * argv [])


{ description of parallel applications
initialize (arg , argv ); easier—for instance, the CHAIN,
PIPE, and PARDO skeletons are re-
run ( ( seq ( load ) spectively mapped to the comma op-
, map <16 >( seq( comp )) erator, the bitwise or, and the bitwise
, seq ( save ) and operators.
) );
finalize ();
} Application to
Computer Vision
To demonstrate Quaff’s expressive-
figure 2. Quaff sample code. In this example, we want to apply the function comp ness and efficiency, we can parallel-
to a vector in a parallel way. ize a realistic application from the
domain of computer vision. Com-
puter vision features various com-
In this example, we first load data etons that the community agrees on plex, time-consuming algorithms
from a file, distribute this data over so far: and is a field of choice for paralleliza-
processors using the map skeleton, tion. The application I’ve chosen
perform the computation, and gather • The SEQ skeleton encapsulates performs object detection and track-
the results, which are then saved back user-defined sequential functions ing in a stereoscopic video stream us-
on disk. This sample code shows the to use them as parameters of other ing a probabilistic algorithm. In this
explicit call to the MAP and SEQ skeletons. approach, detecting and tracking is
skeleton constructors and the use of • The CHAIN skeleton calls other done by computing a posteriori the
the comma operator as the CHAIN skeletons in sequence. probability density of a state vec-
skeleton constructor. Note that we • The PARDO skeleton supports ad tor representing the tracked object
can parameterize skeleton construc- hoc parallelism; it simply spawns from a series of observations follow-
tors via additional information—for parallel processes with no defined ing a standard Bayesian inference
instance, map takes an additional communication scheme. procedure. To solve such a problem,
template parameter that describes • The PIPELINE skeleton is func- we can either analytically solve the
over how many processors the data tionally equivalent to parallel func- Chapman-Kolmogorov prediction
will be distributed. Finally, the fi- tion composition. equation or use a Monte Carlo meth-
nalize function shuts down the • The FARM skeleton models irreg- od. The particle filter algorithm
parallel execution environment in ular, asynchronous data parallelism models such a probabilistic proce-
line 17. Again, note that the process in which a master process dynami- dure with a Markov process by esti-
of structuring the communication cally distributes inputs to a pool of mating the probability density by a
and generating MPI primitive calls is slave processors using some param- weighted, discrete set of observations
mostly done at compile time thanks eterizable heuristic. inside the observation universe.6
to metaprogramming, leading to a • The MAP skeleton models regular In our case, we want to track a pe-
very small overhead (a few percent) data parallelism that divides the destrian in a 3D world. To do so, we
compared to a manual implementa- input data into subsets on which try to estimate the probability density
tion of the same application. a given function is applied. The of a set of particles containing the pe-
MAP skeleton then merges back destrian’s 3D position and 3D velocity
Supported Skeletons the subset results to produce the vector. Those particles also have an
Because we usually define skeletons by total output. evolution model that represents how
generalizing the parallelization pat- we can model a pedestrian particle
terns that arise from the implementa- All these skeletons are directly us- between frames. No pedestrian can
tion of specific application classes, no able in Quaff via the correspond- run faster than the speed of sound,
single standard list of skeletons exists. ing function. Some of them are also thus capping the velocity vector’s
Quaff supports a small subset of skel- mapped to operators to make the norm, and no pedestrian can teleport

60 Computing in SCienCe & engineering


# include <quaff / quaff .hpp >
using namespace quaff ;

# define NPROC 16
typedef std :: vector < particles > data_t ;

void gui ();


void generate ( data_t & );
between frames, thus providing a void measure ( image const &, data_t const &, data_t & );
continuity equation on position. void sample ( data_t const &, data_t & );
Our implementation of the particle particles estimate ( data_t const & );
filter algorithm uses four functions: void update_gui ( particles const & );

• generate builds the particle dis- int main ( int argc , const char * argv [])
{
tribution from the last iteration
initialize (arg , argv );
results,
• measure extracts features from the run ( seq ( gui )
video stream to evaluate each par- & ( map <NPROC >( ( seq ( generate )
ticle’s interest score by using an im- , seq ( measure )
age descriptor, , seq ( sample )
• sample resamples the particle set )
by replicating particles with large )
weight and trimming particles with , seq ( estimate )
, seq ( update_gui )
small weight, and
)
• estimate computes the particle );
set’s average to get the current finalize ();
frame estimation. }

This algorithm’s parallelization is


based on distributing the particle set figure 3. Quaff code for the particle filter application. After starting up the parallel
over the available processors and then environment, the application is split into two parts: the gUI handling and the
computing the estimated position of actual computation code that follows the algorithm.
the object on the root node, which
also runs the application’s GUI parts.
Figure 3 presents the Quaff listing
for this application, and Figure 4 shows
a sample execution of our 3D tracking Left Right
image image
application. The upper part of the fig-
ure shows the two video streams and
the projection of the particle distribu-
tion, whereas the lower part shows the
estimated pedestrian’s 3D path pro-
jected on the ground plane.

Estimated

P arallel programming is bound to


become an everyday problem for
a large part of the scientific comput-
Particles
distribution
projection
3D path

ing community. These developers


will have to struggle with increasingly
complex architectures and will need
tools to overcome these difficulties.
Parallel skeletons could offer a solu- figure 4. Sample tracking session. the top of the screen shows the right and left
tion to several problems. Because it’s image provided by the cameras; a subset of the particle set and the estimated
an easy-to-use and efficient way of position are projected (the yellow and red boxes). the lower part shows a bird’s-
building parallel applications, this eye view of the reconstructed pedestrian path.

may/June 2009 61
SCIEntIfIC ProgrAmmIng

Other Parallel main idea is to generate a process topology from the con-

SkeletOn librarieS struction of various skeleton classes and use a distributed


container to handle data transmission. this polymorphic

S everal implementations of the parallel skeleton para-


digm are available for various programming languages.
they show that the skeleton paradigm is indeed easy to
C++ skeleton library offers a high level of abstraction but
stays close to a language that’s familiar to many developers.
moreover, the C++ binding for higher-order functions and
integrate into mainstream languages. polymorphic calls ensures that the library is type safe. the
the Eskel Library by murray Cole represents a concrete main problem is that the overhead due to dynamic poly-
attempt to embed the skeleton-based parallel program- morphism is rather high (between 20 and 110 percent for
ming method into the mainstream of parallel program- rather simple applications). A prototype is available at www.
ming. It offers various skeletal parallel programming wi.uni-muenster.de/PI/forschung/Skeletons/index.php.
constructs that resemble mPI primitives and are directly finally, Lithium is a Java skeleton library for clusters and
usable in C code. However, Eskel’s low-level API requires grid-like networks of machines whose underlying execu-
the programmer to take care of internal implementa- tion model is a macro data-flow one. the choice of Java
tion details. Eskel and a lot of seminal information about is motivated by the fact that it provides an easy-to-use,
skeletons are available at http://homepages.inf.ed.ac.uk/ platform-independent, object-oriented language. Lithium
abenoit1/eSkel/. also spawned ASSISt, a skeleton-based framework for
Herbert Kuchen based mUESLI, the C++ münster Skel- grid computing. Lithium is available at www.di.unipi.it/
eton Library, on a platform-independent structure. the marcod/Lithium/.

Table 1. Timing results for various particle sets.


Np = 1,000 Np = 2,000 Np = 5,000 Np = 10,000
P=1
time 672.30 ms 1,986.19 ms 11,783.58 ms 21,142.71 ms
P=2
time 409.94 ms 1,115.84 ms 6,074.01 ms 10,898.30 ms
Speedup 1.64 1.78 1.87 1.94
Efficiency 82% 89% 93.5% 97%
P=4
time 263.65 ms 711.90 ms 3,527.45 ms 6,110.61 ms

Speedup 2.55 2.79 3.22 3.46


Efficiency 63.7% 79.5% 80.5% 96.5%
P=6
time 187.79 ms 467.34 ms 1,975.37 ms 3,645.29 ms
Speedup 3.58 4.25 5.75 5.80
Efficiency 59.7% 70.8% 95.8% 96.7%
P = 12
time 162.00 ms 291.23 ms 1,046.86 ms 1,828.95 ms
Speedup 4.15 6.82 10.85 11.56
Efficiency 34.6% 56.8% 90.4% 96.3%
P = 28
time 108.61 ms 170.05 ms 512.79 ms 786.27 ms
Speedup 6.19 11.68 22.15 26.89
Efficiency 22.1% 41.7% 79.1% 96.1%

model can solve many of the com- munications and synchronization. they’re applicable, they provide a con-
mon issues associated with parallel Parallel skeletons won’t be the uni- venient way to describe computational
programming, such as handling com- versal solution for everyone, but when problems and solve them efficiently.

62 Computing in SCienCe & engineering


Lower nonmember rate
of $32 for S&P magazine!
The Quaff library shows that users IEEE Security & .NET SEcuriTy • iNTErviEw wiTh MEliSSa haThaway

can operate such a model in a main- Privacy is THE


stream language with a high level of premier magazine NovEMbEr/DEcEMbEr 2008
voluME 6, NuMbEr 6

abstraction and without efficiency for security


loss. Currently, Quaff targets clus-
ters, multicore machines, and the
professionals.
Cell processor, with only small differ-
ences in the interface. The goal is to
develop skeleton-based applications Top security
on heterogeneous platforms with a professionals
single source code—architectures
GooGle’s Android PlAtform • risk Assessment for norwAy’s infrAstructure

in the field share


such as the IBM RoadRunner cluster
of Cell-enhanced multicore cluster information JAnuAry/februAry 2009

on which
Volume 7, number 1

nodes, for example.


you can rely:

References
1. H. Sutter and J. Larus, “the free Lunch Is • Silver Bullet podcasts
over: A fundamental turn toward Concur-
rency in Software,” Dr. Dobb’s J., vol. 30, no. and interviews
3, 2005; www.ddj.com/web-development/ • Intellectual Barack’s BlackBerry Bind • education via second life

184405990. Property
2. m. Cole, Algorithmic Skeletons: Structured Protection
Management of Parallel Computation, mIt March/april 2009

Press, 1989.
and Piracy volume 7, numBer 2

3. D.B. Skillicorn, “Architecture-Independent


• Designing for
Parallel Computation,” Computer, vol. 23, no. Infrastructure Security
12, 1990, pp. 38–50. • Privacy Issues
4. m. Aldinucci and m. Danelutto, “Stream Par- • Legal Issues and
allel Skeleton optimization,” Proc. 11th Int’l
Cybercrime
Conf. Parallel and Distributed Computing and
Systems, ACtA Press, 1999, pp. 955–962 • Digital Rights
5. J. Sérot and J. falcou, “functional meta- Management
Programming for Parallel Skeletons,” Proc. • The Security Profession
Int’l Conf. Computational Science, Springer-
Verlag, 2008, pp. 154–163.
6. J. falcou et al., “real time Parallel Imple-
mentation of a Particle filter Based Visual Visit our Web site at
tracking,” Workshop on Computation In-
tensive Methods for Computer Vision, 2006,
www.computer.org/security/
CD-rom.

Joel Falcou is an assistant professor at the


University Paris-Sud and researcher at the Subscribe now!
Laboratoire de recherche d’Informatique
in orsay, france. His work focuses on in-
vestigating high-level programming mod-
els for parallel architectures (present and
www.computer.org/services/
future) and providing efficient implemen- nonmem/spbnr
tation of such models using high-perfor-
mance language features. Contact him at
[email protected].

may/June 2009 63

You might also like