Fundamentals of Multicore Software Development PDF
Fundamentals of Multicore Software Development PDF
Fundamentals of Multicore
Computational Science Series Computational Science Series
Software Development
With multicore processors now in every computer, server, and embedded
device, the need for cost-effective, reliable parallel software has never been
greater. By explaining key aspects of multicore programming, Fundamentals of
Multicore Software Development helps software engineers understand parallel
programming and master the multicore challenge.
Accessible to newcomers to the field, the book captures the state of the art
of multicore programming in computer science. It covers the fundamentals
of multicore hardware, parallel design patterns, and parallel programming in
C++, .NET, and Java. It also discusses manycore computing on graphics cards
and heterogeneous multicore platforms, automatic parallelization, automatic
performance tuning, transactional memory, and emerging applications.
Features
• Presents the basics of multicore hardware and parallel programming
• Explains how design patterns can be applied to parallel programming
• Describes parallelism in C++, .NET, and Java as well as the OpenMP API
• Discusses scalable manycore computing with CUDA and programming
approaches for the Cell processor
• Covers emerging technologies, including techniques for automatic extraction
of parallelism from sequential code, automatic performance tuning for
parallel applications, and a transactional memory programming model
• Explores future directions of multicore processors
Adl-Tabatabai
Pankratius
As computing power increasingly comes from parallelism, software developers
must embrace parallel programming. Written by leaders in the field, this book
Tichy
provides an overview of the existing and up-and-coming programming choices
for multicores. It addresses issues in systems architecture, operating systems,
languages, and compilers.
K10647
SERIES EDITOR
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.
PUBLISHED TITLES
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS, Georg Hager and Gerhard Wellein
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS, Edited by David Bailey,
Robert Lucas, and Samuel Williams
HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS
John Levesque with Gene Wagenbreth
PEER-TO-PEER COMPUTING: APPLICATIONS, ARCHITECTURE, PROTOCOLS, AND CHALLENGES
Yu-Kwong Ricky Kwok
FUNDAMENTALS OF MULTICORE SOFTWARE DEVELOPMENT
Victor Pankratius, Ali-Reza Adl-Tabatabai, and Walter Tichy
Fundamentals of
Multicore Software
Development
Edited by
Victor Pankratius
Ali-Reza Adl-Tabatabai
Walter Tichy
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Foreword vii
Editors ix
Contributors xi
1 Introduction 1
Victor Pankratius, Ali-Reza Adl-Tabatabai, and Walter F. Tichy
6 OpenMP 101
Barbara Chapman and James LaGrone
v
vi Contents
Index 309
Foreword
Parallel computing is almost as old as computing itself, but until quite recently it
had been interesting only to a small cadre of aficionados. Today, the evolution of
technology has elevated it to central importance. Many “true believers” including
me were certain it was going to be vital to the mainstream of computing eventually. I
thought its advent was imminent in 1980. What stalled its arrival was truly spectacular
improvements in processor performance, due partly to ever-faster clocks and partly to
instruction-level parallelism. This combination of ideas was abetted by increasingly
abundant and inexpensive transistors and was responsible for postponing the need to
do anything about the von Neumann bottleneck—the requirement that the operations
in a program must appear to execute in linear order—until now.
What some of us discovered about parallel computing during the backwater period
of the 1980s formed a foundation for our current understanding. We knew that
operations that are independent of each other can be performed in parallel, and so
dependence became a key target for compiler analysis. Variables, formerly a benign
concept thanks to the von Neumann bottleneck, became a major concern for some of
us because of the additional constraints needed to make variables work. Heterodox
programming models such as synchronous languages and functional programming
were proposed to mitigate antidependences and data races. There was a proliferation
of parallel languages and compiler technologies aimed at the known world of appli-
cations, and very few computational problems escaped the interest and zeal of those
of us who wanted to try to run everything in parallel.
Soon thereafter, though, the worlds of parallel servers and high-performance com-
puting emerged, with architectures derived from interconnected workstation or per-
sonal computer processors. The old idea of totally rethinking the programming model
for parallel systems took a backseat to pragmatism, and existing languages like For-
tran and C were augmented for parallelism and pressed into service. Parallel program-
ming got the reputation of being difficult, even becoming a point of pride for those
who did it for a living. Nevertheless, this body of practice became parallel comput-
ing in its current form and dominated most people’s thinking about the subject until
recently.
With the general realization that the von Neumann bottleneck has arrived at last,
interest in parallelism has exploded. New insights are emerging as the whole field of
computing engages with the challenges. For example, we have an emerging under-
standing of many common patterns in parallel algorithms and can talk about parallel
programming from a new point of view. We understand that a transaction, i.e., an
isolated atomic update of the set of variables that comprise the domain of an invari-
ant, is the key abstraction needed to maintain the commutativity of variable updates.
vii
viii Foreword
Language innovations in C++, Microsoft .NET, and Java have been introduced to
support task-oriented as well as thread-oriented programming. Heterogeneous paral-
lel architectures with both GPUs and CPUs can deliver extremely high performance
for parallel programs that are able to exploit them.
Now that most of the field is engaged, progress has been exciting. But there is much
left to do. This book paints a great picture of where we are, and gives more than an
inkling of where we may go next. As we gain broader, more general experience with
parallel computing based on the foundation presented here, we can be sure that we
are helping to rewrite the next chapter—probably the most significant one—in the
amazing history of computing.
Burton J. Smith
Seattle, Washington
Editors
ix
x Editors
xi
xii Contributors
Contents
1
2 Fundamentals of Multicore Software Development
possible, increasing clock speeds would exceed the few hundred watts per chip that
can practically be dissipated in mass-market computers as well as the power available
in battery-operated mobile devices.
The second event is that parallelism internal to the architecture of a processor
has reached a point of diminishing returns. Deeper pipelines, instruction-level par-
allelism, and speculative execution appear to offer no opportunity to significantly
improve performance.
The third event is really a continuing trend: Moore’s law projecting an exponen-
tial growth in the number of transistors per chip continues to hold. The 2009
International Technology Roadmap for Semiconductors (http://www.itrs.net/Links/
2009ITRS/Home2009.htm) expects this growth to continue for another 10 years;
beyond that, fundamental limits of CMOS scaling may slow growth.
The net result is that hardware designers are using the additional transistors to
provide additional cores, while keeping clock rates constant. Some of the extra pro-
cessors may even be specialized, for example, for encryption, video processing, or
graphics. Specialized processors are advantageous in that they provide more per-
formance per watt than general-purpose CPUs. Not only will programmers have
to deal with parallelism, but also with heterogeneous instruction sets on a single
chip.
1.3 Audience
This book targets students, researchers, and practitioners interested in parallel pro-
gramming, as well as instructors of courses in parallelism. The authors present the
basics of the various parallel programming models in use today, plus an overview of
emerging technologies. The emphasis is on software; hardware is only covered to the
extent that software developers need to know.
1.4 Organization
• Part I: Basics of Parallel Programming (Chapters 2 and 3)
• Part II: Programming Languages for Multicore (Chapters 4 through 6)
• Part III: Programming Heterogeneous Processors (Chapters 7 and 8)
• Part IV: Emerging Technologies (Chapters 9 through 12)
extensions, parallel programming in C++ becomes less error prone, and C++ imple-
mentations become more robust. Boehm’s chapter concludes with a comparison of
the current standard to earlier standards.
In Chapter 5, Judy Bishop describes parallelism in .NET and Java. The chapter
starts with a presentation of .NET. Bishop outlines particular features of the Task
Parallel Library (TPL) and the Parallel Language Integrated Queries (PLINQ), a lan-
guage that allows declarative queries into datasets that execute in parallel. She also
presents examples on how to use parallel loops and futures. Then she discusses paral-
lel programming in Java and constructs of the java.util.concurrent library, including
thread pools, task scheduling, and concurrent collections. In an outlook, she sketches
proposals for Java on fork-join parallelism and parallel array processing.
In Chapter 6, Barbara Chapman and James LaGrone overview OpenMP. The chap-
ter starts with describing the basic concepts of how OpenMP directives parallelize
programs in C, C++, and Fortran. Numerous code examples illustrate loop-level
parallelism and task-level parallelism. The authors also explain the principles of how
an OpenMP compiler works. The chapter ends with possible future extensions of
OpenMP.
Barry Wilkinson
Contents
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Potential for Increased Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Types of Parallel Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Processor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Multicore Processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Symmetric Multicore Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 Asymmetric Multicore Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Programming Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.2 Thread APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Parallel Programming Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.1 Task and Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.1.1 Embarrassingly Parallel Computations. . . . . . . . . . . . . . . . . . . . . 25
2.7.1.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.1.3 Synchronous Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.1.4 Workpool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1 Introduction
In this chapter, we will describe the background to multicore processors, describe
their architectures, and lay the groundwork for the remaining of the book on program-
ming these processors. Multicore processors integrate multiple processor cores on the
same integrated circuit chip (die), which are then used collectively to achieve higher
overall performance. Constructing a system with multiple processors and using them
collectively is a rather obvious idea for performance improvement. In fact, it became
evident in the early days of computer design as a potential way of increasing the
speed of computer systems. A computer system constructed with multiple processors
9
10 Fundamentals of Multicore Software Development
that are intended to operate together is called a parallel computer historically, and
programming the processors to operate together is called parallel programming. Par-
allel computers and parallel programming have a long history. The term parallel
programming is used by Gill in 1958 (Gill 1958), and his definition of parallel pro-
gramming is essentially the same as today.
In this chapter, we will first explore the previous work and will start by establishing
the limits for performance improvement of processors operating in parallel to satisfy
ourselves that there is potential for performance improvement. Then, we will look at
the different ways that a system might be constructed with multiple processors. We
continue with an outline of the improvements that have occurred in the design of the
processors themselves, which have led to enormous increase in the speed of individual
processors. These improvements have been so dramatic that the added complexities
of parallel computers have limited their use mostly to very high-performance com-
puting in the past. Of course, with improvements in the individual processor, so paral-
lel computers constructed with them also improve proportionately. But programming
the multiple processors for collective operation is a challenge, and most demands out-
side scientific computing have been satisfied with single processor computers, relying
on the ever-increasing performance of processors. Unfortunately, further improve-
ments in single processor designs hit major obstacles in the early 2000s, which we
will outline. These obstacles led to the multicore approach. We describe architectural
designs for a multicore processor and conclude with an outline of the methods for pro-
gramming multicore systems as an introduction to subsequent chapters on multicore
programming.
not to perform as well on a single processor. The execution times could be empir-
ical, that is, measured on real computers. It might be measured by using the Linux
time command. Sometimes, one might instrument the code with routines that return
wall-clock time, one at the beginning of a section of code to record the start time, and
one at the end to record the end time. The elapsed time is the difference. However,
this method can be inaccurate as the system usually has other processes executing
concurrently in a time-shared fashion. The speedup factor might also be computed
theoretically from the number of operations that the algorithms perform. The classi-
cal way of evaluating sequential algorithms is by using the time complexity notation,
but it is less effective for a parallel algorithm because of uncertainties such as com-
munication times between cooperating parallel processes.
A speedup factor of p with p processors is called linear speedup. Conventional
wisdom is that the speedup factor should not be greater than p because if a problem
is divided into p parts each executed on one processor of a p-processor system and
tp < ts /p, then the same parts could be executed one after the other on a single pro-
cessor system in time less than ts . However, there are situations where the speedup
factor is greater than p (superlinear speedup). The most notable cases are
• When the processors in the multiprocessor system have more memory (cache or
main memory) than the single processor system, which provides for increased
performance.
• When the multiprocessor has some special feature not present in the single
processor system, such as special instructions or hardware accelerators.
The first two cases are not fair hardware comparisons, whereas the last certainly can
happen but only with specific problems.
In 1967, Amdahl explored what the maximum speed up would be when a sequen-
tial computation is divided into parts and these parts are executed on different pro-
cessors. This not comparing the best sequential algorithm with a particular parallel
algorithm—it is comparing a particular computation mapped onto a single computer
and mapped onto a system having multiple processors. Amdahl also assumed that a
computation has sections that cannot be divided into parallel parts and these must be
performed sequentially on a single processor, and other sections that can be divided
equally among the available processors. The sections that cannot be divided into par-
allel parts would typically be an initialization section of the code and a final section
of the code, but there may be several indivisible parts. For the purpose of the analysis,
they are lumped together into one section that must be executed sequentially and one
section that can be divided into p equal parts and executed in parallel. Let f be the
fraction of the whole computation that must be executed sequentially, that is, cannot
be divided into parallel parts. Hence, the fraction that can be divided into parts is 1−f .
If the whole computation executed on a single computer in time ts , fts is indivisible
12 Fundamentals of Multicore Software Development
ts
fts (1 – fts)
(1 – fts)/p
tp = fts + (1 – fts)/p
and (1 − f )ts is divisible. The ideal situation is when the divisible section is divided
equally among the available processors and then this section would be executed in
time (1 − f )ts /p given p processors. The total execution time using p processors is
then fts + (1 − f )ts /p as illustrated in Figure 2.1. The speedup factor is given by
ts 1 p
S( p) = = = (2.2)
fts + (1 − f )ts /p f + (1 − f )/p 1 + ( p − 1)f
This famous equation is known as Amdahl’s law (Amdahl 1967). The key observation
is that as p increases, S( p) tends to and is limited to 1/f as p tends to infinity. For
example, suppose the sequential part is 5% of the whole. The maximum speed up
is 20 irrespective of the number of processors. This is a very discouraging result.
Amdahl used this argument to support the design of ultrahigh speed single processor
systems in the 1960s.
Later, Gustafson (1988) described how the conclusion of Amdahl’s law might be
overcome by considering the effect of increasing the problem size. He argued that
when a problem is ported onto a multiprocessor system, larger problem sizes can be
considered, that is, the same problem but with a larger number of data values. The
starting point for Gustafson’s law is the computation on the multiprocessor rather than
on the single computer. In Gustafson’s analysis, the parallel execution time is kept
constant, which we assume to be some acceptable time for waiting for the solution.
The computation on the multiprocessor is composed of a fraction that is computed
sequentially, say f , and a fraction that contains parallel parts, (1 − f ).
This leads to Gustafson’s so-called scaled speedup fraction, S ( p), given by
f tp + (1 − f )ptp
S ( p) = = p + (1 − p)f (2.3)
tp
Fundamentals of Multicore Hardware and Parallel Programming 13
The fraction, f , is the fraction of the computation on the multiprocessor that cannot
be parallelized. This is different to f previously, which is the fraction of the compu-
tation on a single computer that cannot be parallelized. The conclusion drawn from
Gustafson’s law is that it should be possible to get high speedup if we scale up the
problem size. For example, if f is 5%, the scaled speedup computes to 19.05 with 20
processors, whereas with Amdahl’s law with f = 5%, the speedup computes to 10.26.
Gustafson quotes results obtained in practice of very high speedup close to linear on
a 1024-processor hypercube.
Others have explored speedup equations over the years, and it has reappeared with
the introduction of multicore processors. Hill and Marty (2008) explored Amdahl’s
law with different architectural arrangements for multicore processors. Woo and Lee
(2008) continued this work by considering Amdahl’s law and the architectural arrange-
ments in the light of energy efficiency, a key aspect of multicore processors. We will
look at the architectural arrangements for multicore processors later.
add a constant to each data element, or multiply data elements. In a SIMD computer,
there are multiple processing elements but a single program. This type of design is
very efficient for the class of problems it addresses, and some very large SIMD com-
puters have been designed over the years, perhaps the first being the Illiac IV in 1972.
The SIMD approach was adopted by supercomputer manufacturers, most notably by
Cray computers. SIMD computers are sometimes referred to as vector computers as
the SIMD instructions operate upon vectors. SIMD computers still need SISD instruc-
tions to be able to construct a program.
SIMD instructions can also be incorporated into regular processors for those times
that appropriate problems are presented to it. For example, the Intel Pentium series,
starting with the Pentium II in 1996, has SIMD instructions, called MMX (Multi-
Media eXtension) instructions, for speeding up multimedia applications. This design
used the existing floating-point registers to pack multiple data items (eight bytes,
four 16-bit numbers, or two 32-bit numbers) that are then operated upon by the same
operation. With the introduction of the Pentium III in 1999, Intel added further SIMD
instructions, called SSE (Streaming SIMD extension), operating upon eight new 128-
bit registers. Intel continued adding SIMD instructions. SSE2 was first introduced
with the Pentium 4, and subsequently, SSE3, SSE4, and SSE5 appeared. In 2008, Intel
announced AVX (Advanced Vector extensions) operating upon registers extended to
256 bits. Whereas, large SIMD computers vanished in the 1990s because they could
not compete with general purpose multiprocessors (MIMD), SIMD instructions still
continue for certain applications. The approach can also be found in graphics cards.
General-purpose multiprocessor systems (MIMD computers) can be divided into
two types:
Shared memory multiprocessor systems are a direct extension of single processor sys-
tem. In a single processor system, the processor accesses a main memory for program
instructions and data. In a shared memory multiprocessor system, multiple processors
are arranged to have access to a single main memory. This is a very convenient con-
figuration from a programming prospective as then data generated by one processor
and stored in the main memory is immediately accessible by other processors. As in
a single processor system, cache memory is present to reduce the need to access main
memory continually, and commonly two or three levels of cache memory. However,
it can be difficult to scale shared memory systems for a large number of processors
because the connection to the common memory becomes a bottleneck. There are sev-
eral possible programming models for a shared memory system. Mostly, they revolve
around using threads, which are independent parallel code sequences within a pro-
cess. We shall look at the thread programming model in more detail later. Multicore
processors, at least with a small number of cores, usually employ shared memory
configuration.
Distributed memory is an alternative to shared memory, especially for larger sys-
tems. In a distributed memory system, each processor has its own main memory and
Fundamentals of Multicore Hardware and Parallel Programming 15
Apart from designs that process a single instruction sequence, increased perfor-
mance can be achieved by processing instructions from different program sequences
switching from one sequence to another. Each sequence is a thread, and the technique
is known as multithreading. The switching between threads might occur after each
instruction (fine-grain multithreading) or when a thread is blocked (coarse-grain mul-
tithreading). Fine-grain multithreading suggests that each thread sequence will need
its own register file. Interleaving instructions increases the distance of related instruc-
tions in pipelines and reduces the effects of instruction dependencies. With advent of
multiple-issue processors that have multiple execution units, these execution units
can be utilized more fully by processing multiple threads. Such multithreaded pro-
cessor designs are called simultaneous multithreading (SMT) because the instructions
of different threads are being executed simultaneously using the multiple execution
units. Intel calls their version hyper-threading and introduced it in versions of the
Pentium IV. Intel limited its simultaneous multithreading design to two threads. Per-
formance gains from simultaneous multithreading are somewhat limited, depending
upon the application and processor, and are perhaps in the region 10%–30%.
Up to the early 2000s, the approach taken by manufacturers such as Intel was
to design a highly complex superscalar processor with techniques for simultaneous
operation coupled with using a state-of-the-art fabrication technology to obtain the
highest chip density and clock frequency. However, this approach was coming to an
end. With clock frequencies reaching almost 4 GHz, technology was not going to
provide a continual path upward because of the laws of physics and increasing power
consumption that comes with increasing clock frequency and transistor count.
Power consumption of a chip has a static component (leakage currents) and a
dynamic component due to switching. Dynamic power consumption is proportional
to the clock frequency, the square of the voltage switched, and the capacitive load
(Patterson and Hennessy 2009, p. 39). Therefore, each increase in clock frequency
will directly increase the power consumption. Voltages have been reduced as a neces-
sary part of decreased feature sizes of the fabrication technology, reducing the power
consumption. As the feature size of the chip decreases, the static power becomes
more significant and can be 40% of the total power (Asanovic et al. 2006). By the
mid-2000s, it had become increasing difficult to limit the power consumption while
improving clock frequencies and performance. Patterson calls this the power wall.
Wulf and McKee (1995) identified the memory wall as caused by the increasing
difference between the processor speed and the memory access times. Semiconductor
main memory has not kept up with the increasing speed of processors. Some of this
can be alleviated by the use of caches and often nowadays multilevel caches, but
still it poses a major obstacle. In addition, the instruction-level parallelism wall is
caused by the increasing difficulty to exploit more parallelism within an instruction
sequence. These walls lead to Patterson’s “brick wall”:
Power wall + Memory wall + Instruction-Level wall = Brick wall
for a sequential processor. Hence, enter the multicore approach for using the ever-
increasing number of transistors on a chip. Moore’s law originally predicted that the
number of transistors on an integrated circuit chip would double approximately every
18 Fundamentals of Multicore Software Development
year, later predicting every two years, and sometimes quoted as doubling every 18
months. Now, a prediction is that the number of cores will double every two years or
every fabrication technology.
Processor cores
L1 caches
Instr. data Instr. data
L2 cache
Main memory
External
connections
Chip
L1 L1
L2 L2
Bus
Bus Bus
L1 L1
L2 L2
Bus
External
connections
This approach offers the possibility of fabricating more cores onto the chip, although
each core might not be as powerful as a high-performance complex superscalar core.
Figure 2.3 shows an arrangement using a 2D bus structure to interconnect the cores.
Using a large number of less complex lower-performance lower-power cores is often
targeted toward a particular market. An example is the picaChip designed for wireless
infrastructure and having 250–300 DSP cores. Another example is TILE64 with 64
cores arranged in a 2D array for networking and digital video processing.
State-of-the-art
superscalar
processor core
resources (instruction pointer, stack, heap, files, etc.). An operating system will sched-
ule processes for execution, time-sharing the hardware among processes waiting to
be executed. This approach enables processes that are stalled because they are wait-
ing for some event such as an I/O transfer to be descheduled and another process
be wakened and run on the same processor. On a processor capable of executing
only one program sequence at a time, only a single process could be executing at
any moment. By switching from one process to another, the system will appear to
be executing multiple programs simultaneously although not actually. To differenti-
ate from actual simultaneous operation, we say the processes are executing concur-
rently. On a multiprocessor system, multiple processes could be executing, one on
each processor. In that case, we could get true simultaneous executing of the pro-
cesses. Using the process as the basic unit of simultaneous operation is appropriate
for complete but separate programs running on separate computers, for example, in
message-passing clusters. The programs might communicate and form a larger par-
allel program.
A process can be divided into threads, separate sequences of code that are intended
to be executed concurrently or simultaneously. Being inside the process, these threads
share the resources of the process such as memory allocation and files, but each thread
needs its own instruction pointer and stack. Creating threads will have much less over-
head than creating processes. On a processor capable of executing only one program
sequence at a time, threads in a process would be time-shared on the processor, just as
processes might be time-shared. However, while processes are completely separate
programs doing different tasks, threads within a process are doing tasks associated
with the purpose of the process.
The concepts of both processes and threads are embodied in operating systems
to enable concurrent operation. Operating system threads are called kernel threads.
By using such threads, the operating system can switch out threads for other threads
when they are stalled. Commonly, threads are assigned a priority number to aid appro-
priate scheduling, with higher priority threads taking precedence over lower priority
threads. Note in the thread model, a process will consist of at least one thread.
The thread programming model has been adopted for programming shared mem-
ory multiprocessor systems, including multicore systems. There are two common
ways a programmer might create a thread-based program:
With thread APIs, it will be up to the user to determine exactly what program
sequences should be present in each thread and call a “thread create” routine to cre-
ate the thread from within the main thread or sub-threads. Using higher-level lan-
guage constructs/directives is easier for the programmer, although it is still up to the
programmer to identify what needs to be parallelized into threads. The programmer
does not have as much control of the threads and still may need to call some thread
Fundamentals of Multicore Hardware and Parallel Programming 23
variable if not set and enter the critical section. If the lock is already set (i.e., another
thread is in the critical section), the thread has to wait for the critical section to be
free. A spin-lock simply keeps reading the lock variable in tight loop until the lock
variable indicates an unlocked critical section, that is,
critical section
A spin lock occupies the thread but is acceptable if it is likely that the critical section
will be available soon, otherwise a more efficient solution is needed. When a thread
leaves the critical section, it is to reset the lock variable to a 0.
It may be that more than one thread will reach the critical section at the same
instant. We must ensure that the actions of each thread in reading and setting the lock
variable are not interleaved as this could result in multiple threads setting the lock
variable and entering the critical section together. Preventing interleaving is achieved
by having instructions that operate in an atomic fashion, that is, without interruption
by other processors. Most processors are provided with suitable atomic instructions.
The Intel Pentium processors can make certain instructions, including the bit test-
and-set instruction atomic by prefix the instruction with a LOCK prefix instruction.
Apart from using atomic instructions, there are also quite old software algorithms to
achieve the same effect such as Dekker’s algorithm but mostly nowadays one relies
on processor hardware support rather than using software algorithms.
Thread APIs provide support for critical sections. Locks are implemented in so-
called mutually exclusive lock variables, mutex s, with routines to lock and unlock
named mutex’s. Another approach is to use semaphores. Whereas a lock variable can
only be 0 or 1, a semaphore, s, is a positive integer operated upon by two operations,
P(s) and V(s). P(s) will wait until s is greater than 0 and then decrement s by 1 and
allow the thread to continue. P(s) is used at the beginning of critical section. V(s) will
increment s by 1 to release one of the waiting processors. V(s) is used at the end of a
critical section. A binary semaphore limits the value of s to 0 or 1 and then behaves
in a very similar fashion to a lock, except that semaphores should have an in-build
algorithm in V(s) to select wait threads in a fair manner whereas locks may rely upon
additional code.
Locks and semaphores are very low-level primitives and can make the program-
ming error-prone. Rather than use explicit lock variables, locks can be associated with
objects in an object-oriented language, and this appears in Java. A locking mechanism
can be implicit in so-called monitor routines that can only be called by one thread at
a time. Java has the synchronized keyword to be used on methods or code sequences
to lock them with the associated object lock. In .NET, a section of code can also be
protected against more than one thread executing it by the lock keyword.
Fundamentals of Multicore Hardware and Parallel Programming 25
Threads often need to synchronize between themselves. For example, one thread
might need to wait for another thread to create some new data, which it will then
consume. A solution is to implement a signaling mechanism in which one thread
sends a signal to another thread when an event or condition occurs. Pthreads uses
so-called condition variables with a signaling mechanism to indicate that a waiting
condition has been achieved. A full treatment of thread APIs in a C/C++ environment
is found in Chapter 4. A full treatment of threads in a Java/.NET environment is found
in Chapter 5.
2.6.3 OpenMP
OpenMP was developed in the 1990s as a standard for creating shared memory
thread-based parallel programs. OpenMP enables the programmer to specify sections
of code that are to be executed in parallel. This is done with compiler directives. The
compiler is then responsible for creating the individual thread sequences. In addition
to a small set of compiler directives, OpenMP has a few supporting routines and
environment variables. The number of threads available is set either by a clause in a
compiler directive, an explicit routine, or an environment variable. OpenMP is very
easy to use but has limitations. Chapter 6 explores OpenMP in detail.
parts—they are completely independent on each other. Geoffrey Fox (Wilson 1995)
called this type of computation embarrassingly parallel, a term that has found wide
acceptance, although the phrase naturally parallel is perhaps more apt (Wilkinson
and Allen 2005). The implication of embarrassingly parallel computations is that the
programmer can immediately see how to divide the work up without interactions
between the parts. Usually, there is some interaction at the beginning to start the
parts or when initial data is send to the separate parts, and there is usually interac-
tion also at the end to collect results, but if these are the only interactions, we would
still describe the problem as embarrassingly parallel. A number of important applica-
tions are embarrassingly parallel. In low-level image processing, the picture elements
(pixels) of an image can be manipulated simultaneously. Often, all that is needed for
each pixel is the initial image or just values of neighboring pixels of the image. Monte
Carlo methods can be embarrassingly parallel. They use random selections in numer-
ical calculations. These random selections should be independent of each other, and
the calculation based upon them can be done simultaneously. Monte Carlo methods
can be used for numerical integration and is especially powerful for problems that
cannot be solved easily otherwise. A critical issue for Monte Carlo solutions is the
generation of the random selections that can be done in parallel. Traditional random
number generators create pseudorandom sequences based upon previously generated
numbers and hence are intrinsically sequential. The SPRNG (Scalable Parallel Num-
ber Generators) is a library of parallel random number generators that address the
issue.
The embarrassingly parallel classification is usually limited to solving a single
problem. There are also situations that there is absolutely no interaction, for exam-
ple, when a complete problem has to be solved repeatedly with different arguments
(parameter sweep). Such situations are ideal for using multiple processors.
2.7.1.2 Pipelining
We have already mentioned the use of a pipeline in a processor to achieve higher
execution speed. The same technique can be used to construct parallel programs.
Many problems are constructed as a series of tasks that have to be done in a sequence.
This is the basis of normal sequential programming. Multiple processors could be
used in a pipeline, one for each task. The output of one task is passed onto the input
of the next task in a pipeline. A pipeline can be compared to an assembly line in
a factory in which products are assembled by adding parts as they pass down the
assembly line. Automobiles are assembled in that way and very efficiently. Multiple
automobiles can be assembled at the same time, although only one comes off a single
assembly line at a time. Pipelining as a parallel programming strategy is limited to
certain applications. A pipeline can be used effectively in parallel processing if the
problem can be decomposed into a series of tasks, and there are multiple instances
of the problem that need to be solved (c.f. with assembling multiple automobiles).
Pipelining can also be used effectively processing a series of data items, each requir-
ing multiple sequential operations to be performed upon them. In that case, the data
items are fed down the pipeline.
Fundamentals of Multicore Hardware and Parallel Programming 27
2.7.1.4 Workpool
A workpool describes a collection of tasks to be handed out to computing resources
for execution on a demand basis. It is a very effective way of balancing the load. It can
take into account the speed of the compute resources to complete tasks and also when
the number of tasks varies. The most basic way to construct a workpool is to start with
the tasks that need to be performed in a workpool task queue. Compute resources are
then given tasks to perform from this queue. When a compute resource finishes its
tasks, it requests further tasks. It may be that processing a task will generate new tasks.
These tasks might be returned by the compute resources and placed in the workpool
task queue for redistribution. Workpools can be centralized or distributed. A single
workpool might hold all the pending tasks, or there could be multiple workpools at
different sites to distribute the communication. A fully distributed workpool would
have a tasks queue in each compute resource. In the received-initiated approach, other
compute resources request tasks from compute resources, typically when they have
28 Fundamentals of Multicore Software Development
2.8 Summary
The move to multicore processors has come after a very long history of com-
puter design. The primary motive is to increase the speed of the system. Increas-
ing in the speed of a single processor hit the famous Patterson “brick” wall, which
described the combination of not being able to cope with increase of power dissipa-
tion, limited prospect for any further instruction-level parallelism, and limits in main
memory speeds. Hence, manufacturers have moved from developing a single high-
performance processor on one die to having multiple processor cores on one die.
This development began with two cores on one die and continues with more cores
on one die. The cores may individually be less powerful that might be possible if
the whole die was dedicated to a single high-performance superscalar processor, but
collectively, they offer significant more computational resources. The problem now
is to use these computational resources effectively. For that, we draw upon the work
of the parallel programming community.
References
Amdahl, G. 1967. Validity of the single-processor approach to achieving large-scale
computing capabilities. Proc 1967 AFIPS, vol. 30, New York, p. 483.
Asanovic, K., R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.
Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. 2006.
The landscape of parallel computing research: A view from Berkeley. Uni-
versity of California at Berkeley, technical report no. UCB/EECS-2006-183,
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
Fundamentals of Multicore Hardware and Parallel Programming 29
Flynn, M. J. 1966. Very high speed computing systems. Proceedings of the IEEE
12:1901–1909.
Gill, S. 1958. Parallel programming. The Computer Journal 1(1): 2–10.
Gustafson, J. L. 1988. Reevaluating Amdahl’s law. Communication of the ACM 31(1):
532–533.
Hill, M. D. and M. R. Marty. 2008. Amdahl’s law in the multicore era. IEEE Computer
41(7):33–38.
Patterson, D. A. and J. L. Hennessy. 2009. Computer Organization and Design: The
Hardware/Software Interface, 4th edn. Burlington, MA: Morgan Kaufmann.
Villalobos, J. F. and B. Wilkinson. 2008. Latency hiding by redundant processing:
A technique for grid-enabled, iterative, synchronous parallel programs. 15th
Mardi Gras Conference, January 30, Baton Rouge, LA.
Wilkinson. B. 2010. Grid Computing: Techniques and Applications. Boca Raton, FL:
Chapman & Hall/CRC Computational Science Series.
Wilkinson, B. and M. Allen. 2005. Parallel Programming: Techniques and Applica-
tions Using Networked Workstations and Parallel Computers, 2nd edn. Upper
Saddle River, NJ: Prentice Hall.
Wilson, G. V. 1995. Practical Parallel Programming. Cambridge, MA: MIT Press.
Woo, D. H. and H.-H. S. Lee. 2008. Extending Amdahl’s law for energy-efficient
computing in many-core era. IEEE Computer 41(12):24–31.
Wulf, W. and S. McKee. 1995. Hitting the memory wall: Implications of the obvious.
ACM SIGArch Computer Architecture News 23(1):20–24.
This page intentionally left blank
Chapter 3
Parallel Design Patterns
Tim Mattson
Contents
31
32 Fundamentals of Multicore Software Development
parallel programming; and those that did tended to use low-level message passing
libraries [MPI] that emphasized portability and control over elegance.
New high-level languages were not enough to solve the parallel programming prob-
lem in the 1980s and 1990s, and there is no reason to believe it will be different this
time as we help general purpose programmers adopt parallel programming.
It is my hypothesis that the key to solving this problem is to focus on how experi-
enced programmers think about parallel programming. Experienced parallel
programmers, once they understand how the concurrency in their problem supports
parallel execution, can use any sufficiently complete parallel programming language
to create parallel software. Therefore, to solve the parallel programming problem for
mainstream programmers, we need to capture how expert parallel programmers think
and provide that understanding to mainstream programmers. In other words, we do
not need new languages or sophisticated tools (though they help); we need to help
mainstream programmers learn to “think parallel.”
Learning to reason about parallel algorithms is a daunting task. Fortunately, we
have a few decades of experience to draw on. All we need to do is distill this collective
wisdom into a form that we can put down in writing and communicate to people new
to parallel computing. We believe design patterns are an effective tool to help us
accomplish this task.
• Task parallelism
• Data parallelism
• Pipeline
• Geometric decomposition
The lower-level implementation strategy patterns describe how that algorithm strat-
egy can be supported in software.
• Loop-level parallelism
34 Fundamentals of Multicore Software Development
• Fork-join
• Master-worker
The complete text for these patterns goes well beyond the scope of this survey. Instead,
we use an abbreviated pattern-format that exposes the most essential elements of the
patterns. For each pattern, we provide the following:
To solve any problem in parallel, you must decompose the problem into concurrent
tasks AND decompose the data to manage data access conflicts between tasks.
The goal of the algorithm strategy patterns is to define this dual task/data decom-
position. They start with a conceptual understanding of a problem and how it can be
executed in parallel. They finish with a decomposition of the problem into concurrent
tasks and data.
Use When The concurrency in a problem may present itself directly in terms of
a collection of distinct tasks. The definition of the “tasks” is diverse. They can be a
set of files you need to input and process. They can be the rays of light between a
camera and a light source in a ray tracing problem. Or they can be a set of operations
to be applied to a distinct partition of a graph. In each case, the tasks addressed by
this pattern do not depend on each other. Often the independent tasks are a natural
consequence of the problem definition. In other cases where this pattern can be used,
a transformation of the data or a post-processing step to address dependencies can
create the independent tasks needed by this pattern.
Details Give a set of N tasks
ti ∈ T(N)
the task parallel pattern schedules each task, ti , for execution by a collection of
processing elements. There are three key steps in applying the task parallelism pattern
to a problem:
For an effective application of this pattern, there are two criteria the tasks must meet.
First, there must be more tasks (often many more) than processing elements in the
target platform. Second, the individual tasks must be sufficiently compute-intensive
in order to offset the overheads in creating and managing them.
The dependencies between tasks play a key role in how this pattern is applied. In
general, the tasks need to be mostly independent from each other. When tasks depend
on each other in complicated ways, it is often better to use an approach that expresses
the concurrency directly in terms of the data.
There are two common cases encountered when working with the dependencies in
a task parallel problem. In the first case, there are no dependencies. In this case, the
tasks can execute completely independently from each other. This is often called an
“embarrassingly parallel” algorithm. In the second case, the tasks can be transformed
to separate the management of the dependencies from the execution of the tasks. Once
all the tasks have completed, the dependencies are handled as a distinct phase of the
computation. These are called separable dependencies.
A key technique that arises in separable-dependency algorithms is “data replica-
tion and reduction.” Data central to computation over a set of tasks is replicated with
one copy per computational agent working on the tasks. Each agent proceeds inde-
pendently, accumulating results into its own copy of the data. When all the tasks are
complete, the agents cooperate to combine (or reduce) their own copies of the data
to produce the final result.
36 Fundamentals of Multicore Software Development
Known Uses This pattern, particularly for the embarrassingly parallel case, may
be the most commonly used pattern in the history of parallel computing. Render-
ing frames in a film, ray tracing, drug docking studies are just a few of the count-
less instances of this pattern in action. Map-reduce [Dean04] can be viewed as a
“separable-dependencies” variant of the task parallel pattern. The computations of
short-range forces in molecular dynamics problems often make use of the separable-
dependencies version of the task parallel pattern.
Use When Problems that utilize this pattern are defined in terms of data with a
regular structure that can be exploited in designing a concurrent algorithm. Graph-
ics algorithms are perhaps the most common example since in many cases, a similar
(if not identical) operation is applied to each pixel in an image. Physics simulations
based on structured grids, a common class of problems in scientific computing but
also in computer games, are often addressed with a data parallel pattern. The shared
feature of these problems is that each member of a data structure can be updated
with essentially the same instructions with few (if any) dependencies between them.
The dependencies are of course the key challenge since they too must be managed
in essentially the same way for each data element. A good example of a data paral-
lel pattern with dependencies is the relaxation algorithm used in partial differential
equation solvers for periodic problem domains. In these problems, a stencil is used to
define how each point in the domain is updated. This defines a dependency between
points in the domain, but it is the same for each point and hence (other than at the
boundaries) can be handled naturally with a data parallel algorithm.
Details This pattern is trivial in the case of simple vector operations. For example,
to add two equal length vectors together in parallel, we can apply a single function
(the addition operator) to each pair of vector elements and place the result in the cor-
responding element of a third vector. For example, using the vector type in OpenCL,
we can define four element vectors and add them with a single expression:
Parallel Design Patterns 37
float4 a = (float4) (2.0, 3.0, 4.0, 5.0);
float4 b = (float4) (3.0, 4.0, 5.0, 2.0);
float4 c;
The advantage of vector operations is they map directly onto the vector units found
in many microprocessors.
The challenge is how do we generalize this trivial case to more complex situations?
We do this by defining an abstract index space. The data in the problem is aligned to
this index space. Then the concurrency is expressed in terms of the data by running
a single stream of operations for each point in the index space.
For example, in an image processing problem applied to an N by M image, we
define a rectangular “index space” as a grid of dimension N by M. We map our image,
A, onto this index space as well as any other data structures involved in the compu-
tation (in this case, a filter array). We then express computations on the image onto
points in the index space and define the algorithm as
Note that the FFT operations involve complex data movement and operations over
collections of points. But the data parallelism pattern is still honored since these com-
plex operations are the same on each subset of data elements. As suggested by this
example, the data parallelism pattern can be applied to a wide range of problems, not
just those with simple element-wise updates over aligned data.
Known Uses This pattern is heavily used in graphics, so much so that GPU hard-
ware is specifically designed to support problems that use the data parallelism pattern.
Image processing, partial differential equation solvers, and linear algebra problems
also map well onto this pattern. More complex applications of this pattern can be
found in Chapter 7, Scalable Manycore Computing with CUDA.
Related Patterns The geometric decomposition pattern is often described as a form
of data parallelism. It is not quite the same as this pattern in that the functions applied
to each tile in a geometric decomposition pattern usually differ between tiles (i.e., it
is not strictly a single function applied to each of the tiles). But it shares the general
idea of defining parallelism in terms of concurrent updates to different elements of a
larger data structure.
The data parallelism pattern can be implemented in software with the SIMD, loop
parallelism, or even SPMD patterns, the choice often being driven by the target
platform.
38 Fundamentals of Multicore Software Development
A data parallel pattern can often be transformed into a task parallel pattern. If the
single function, t, applied to the members of a set of data elements, di , then we can
define
ti = t(di )
and we can address the problem using the task parallelism pattern. Note, however, that
the converse relationship does not hold, that is, you cannot generally transform a task
parallel pattern into a data parallel pattern. It is for this reason that the task parallelism
patterns are considered to be the more general patterns for parallel programming.
3253567622459853
32535676 22459853
Split
3253 5676 2245 9853
32 53 56 76 22 45 98 53
Compute
5 8 11 13 4 9 17 8
Recombine
13 24 13 25
37 38
75
a string of integers into a single value using summation. Subproblems are generated at
each level by splitting the list into two smaller lists. The subproblem is small enough
to solve directly once there are a pair of integers at which point the “operate phase”
occurs. Finally, the subproblems are recombined (using summations) to generate the
global solution.
Known Uses This pattern is heavily used in optimization and decision support
problems. Programmers using Cilk [Blumofe95] and the explicit task construct in
OpenMP 3.0 (see Chapter 6) make heavy use of this pattern as well.
Related Patterns Implementations of this pattern in software often use the fork-
join pattern. To avoid the high costs of explicitly spawning and destroying threads,
implementations of this pattern build schedulers to manage tasks forked during the
recursion and execute them using a thread pool of some sort.
3.3.1.4 Pipeline
Concurrency produced by passing multiple data elements through a sequence of
computational stages.
Use When Concurrency can be difficult to extract from a problem. Sometimes, a
problem is composed of a sequence of tasks that must execute in a fixed order. It
appears in such cases that there just is not a way to make the program run in paral-
lel. When this occurs, however, it is sometimes the case that there are multiple data
40 Fundamentals of Multicore Software Development
elements that must pass through the sequence of tasks. While each of the tasks must
complete in order, it may be possible to run each data element as an independent flow
through the collection of tasks. This only works if the tasks do not hold any state,
i.e., if the tasks take input data, computes a result based on the data, and then passes
the result to the next task.
When the computation starts, only the first stage, X1, is active and there is no concur-
rency. As X1 completes, it passes its result to X2 which then starts its own compu-
tation while X1 picks up the next data element. This continues until all the pipeline
states are busy.
The amount of concurrency available is equal to the number of stages, and this is
only available once the pipeline has been filled. The work prior to filling the pipeline
and the diminishing work as the pipeline drains constrains the amount of concurrency
these problems can exploit.
This pattern, however, is extremely important. Given that each data element must
pass through each stage in order, the options for exploiting concurrency is severely
limited in these problems (unless each of the states can be parallelized internally).
Known Uses This pattern is commonly used in digital signal processing and image
processing applications where each unit of a larger multicomponents data set is passed
through a sequence of digital filters.
Related Patterns This pattern is also known as the pipe-and-filter pattern [Shaw95].
This emphasizes that the stages are ideally stateless filters.
The pipeline pattern can be supported by lower-level patterns in a number of ways.
The key is to create separate units of executions (i.e., threads or processes) and man-
age data movement between them. This can be set up with the SPMD or the fork-join
patterns. The challenge is to safely manage movement of data between stages and
to be prepared for cases where there is an imbalance between stages. In this case,
the channels between stages need to be buffered (perhaps using a shared queue data
structure).
Use When This pattern is used when the data associated with a program plays a
dominant role in how we understand the problem. For example, in an image process-
ing algorithm where a filter is applied to neighborhoods of pixels in the image, the
problem is best defined in terms of the image itself. In cases such as these, it is often
best to define the concurrency in terms of the data and how it is decomposed into
relatively independent blocks.
Details Define the concurrent tasks in terms of the data decomposition. The data
breaks down into blocks. Tasks update blocks of data. When data blocks share bound-
aries or otherwise depend on each other, the algorithm must take that into account;
often adding temporary storage around the boundaries (so called “ghost cells”) to
hold nonlocal data. In other words, the algorithm is often broken down into three
segments:
The split into two update-phases, one for interior points and the other for boundary
regions, is done so you can overlap communication and computation steps.
Known Uses This is one of the most commonly used patterns in scientific comput-
ing. Most partial differential equation solvers and linear algebra programs are paral-
lelized using this pattern. For example, in a stencil computation, each point in an array
is replaced with a value computed from elements in its neighborhood. If the domain
is decomposed into tiles, communication is only required at the boundaries of each
tile. All the updates on the interior can be done concurrently. If communication of
data on the boundaries of tiles can be carried out during computations in the interior,
very high levels of concurrency can be realized using this pattern.
Related Patterns This pattern is closely related to the data parallelism pattern. In
fact, for those cases where each tile is processed by the same task, this problem trans-
forms into an instance of the data parallelism pattern.
This pattern is commonly implemented with the SPMD and loop-level parallelism
patterns. In many cases, it can be implemented with the SIMD pattern, but inefficien-
cies arise since boundary conditions may require masking out subsets of processing
elements for portions of the computation (for example, at the boundaries of a domain).
3.3.2.1 SPMD
A single program is written that is executed by multiple threads or processes. They
use the ID of each thread or process to select different pathways through the code or
choose which elements of data structures to operate upon.
Use When Concurrent tasks run the same code but with different subsets of key data
structures or occasional branches to execute different code on different processes or
threads.
You intend to use a message passing programming model (i.e., MPI) and treat the
system as a distributed memory computer.
Details The SPMD pattern is possibly the most commonly used pattern in the his-
tory of parallel supercomputing. The code running on each process or thread is the
same, built from a single program. Variation between tasks is handled through expres-
sions based on the thread or process ID. The variety of SPMD programs is too great
to summarize here. We will focus on just a few cases.
In the first case, loop iterations are assigned to different tasks based on the process
ID. Assume for this (and following) examples that each thread or process is labeled
by a rank ranging from zero to the number of threads or processes minus one.
ID = 0, 1, 2, ... (Num_procs − 1)
We can use the techniques described in the “loop parallelism pattern” to create
independent loop iterations and then change the loop to split up iterations between
processes or threads.
for(i=ID; i<Num_iterations; i=i+Num_procs) {...}
This approach divides the work between the threads or processes executing the
program, but it may lead to programs that do not effectively reuse data within the
caches. A better approach is to define blocks of contiguous iterations. For example,
working with the geometric decomposition pattern, we can break up the columns of
a matrix into distinct sets based on the ID and num_procs. For example, given a
square matrix of order Norder, we can break up columns or matrix into sets one of
which is assigned to each process or thread.
IStart = (ID∗ Norder/num_procs)
ILast = (ID + 1) ∗ Norder/num_procs
If ( ID == (num_procs − 1)) ILast = Norder
for(i=IStart; i<ILast; i++) { ... }
Programming models designed for general purpose programming of GPUs (see
Chapter 7, GPU programming or [39]) provide another common instance of the SPMD
programming model. Central to these programming models is a data parallel mode
of operation. The software framework defines an index space. A kernel is launched
for each point in the index space. The kernel follows the common SPMD model; it
queries the system to discover its ID and uses this ID to (1) select which point in the
index space it will handle and (2) choose a path through the code. For example, the
following OpenCL kernel would be used to carry out a vector sum:
Parallel Design Patterns 43
_kernel void vec_add (_global const float *a,
_global const float *b,
_global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
A keyword identifies this function as a kernel. The arguments are listed as “_global”
to indicate to the compiler that they will be imported from the host onto the compute
device. The SPMD pattern can be seen in the body of the function. The kernel queries
its ID “gid” and then uses it to select which elements of the array to sum. Notice that
in this case, each kernel will execute a single stream of instructions, so in this form,
the pattern is identical to the SIMD pattern. We consider this an SPMD pattern, how-
ever, since a kernel can branch based on the ID, causing each instance of a kernel’s
execution to execute a significantly different set of instructions from the body of the
kernel function.
Known Uses This pattern is heavily used by MPI programmers and OpenMP pro-
grammers wanting more control over how data is managed between different parts
of the program. It is becoming increasingly important as a key pattern in GPGPU
programming.
Related Patterns The SPMD pattern is very general and is used to support soft-
ware designs based on the data parallel, geometric decomposition, and task parallel
patterns.
As mentioned earlier, this pattern is very similar to the SIMD pattern. In fact, the
SIMD pattern can be considered a simple case of the more general SPMD pattern.
3.3.2.2 SIMD
One stream of instructions is executed. The system (usually with hardware sup-
port) applies this stream of instructions to multiple data elements, thereby supporting
concurrent execution.
Use When Use this pattern for problems that are strictly data parallel, i.e., where
the same instruction will be applied to each of a set of data elements. If the application
involves a significant degree of branching, this pattern is difficult to apply. This limits
the applicability of the pattern, but the benefits are considerable since the SIMD pat-
tern makes it so much easier to reason about concurrency. In a multithreaded program,
for example, the programmer needs to think about the program and understand every
semantically allowed way the instructions can be interleaved. Validating that every
possible way instructions can be interleaved is correct can be prohibitively difficult.
With the SIMD pattern, however, there is only one stream of instructions.
Details This pattern is tightly coupled to the features provided by the platform. For
example, the SIMD pattern is used when writing code for the vector units of a micro-
processor. Writing code using a native vector instruction set sacrifices portability.
44 Fundamentals of Multicore Software Development
float4 A, B, C;
A = B + C;
Known Uses SIMD algorithms are common on any system that includes a vector
processing unit. Often a compiler attempts to extract SIMD vector instructions from
a serial program. gRAPHICS processing units traditionally supported the SIMD pat-
tern, but increasingly over time, this is being replaced with closely related but more
flexible SPMD patterns. In particular, an SPMD program that makes restricted use of
branch statements could be mapped by a compiler onto the lanes of a vector unit to
turn a SPMD program into an SIMD program.
Related Patterns This pattern is a natural choice for many data parallel programs.
With a single stream of instructions, it is possible to build SIMD platforms that are
deterministic. This can radically simplify the software development effort and reduce
validation costs.
As mentioned earlier, programs written using the SPMD pattern can often be con-
verted into an execution pathway that maps onto the SIMD pattern. This is commonly
done on graphics processors, and as graphics processors and CPUs become more gen-
eral and encroach on each other’s turf, this overlap between SIMD and SPMD will
become increasingly important.
Use When This pattern is used for any problem where the crux of the computation is
expressed as a manageable number of compute intensive loops. We say “manageable”
since a programmer will need to potentially restructure the individual loops to expose
concurrency; hence, it is best if the program spends the bulk of its time in a small
number of loops.
Details This pattern is deceptively simple. All you do is find the compute intensive
loops and direct the system to execute the iterations in parallel. In the simplest cases
where the loop iterations are truly independent, this pattern is indeed simple. But in
practice, it’s usually more complicated.
The approach used with this pattern is to
• Locate compute intensive loops by inspection or with support from program
profiling tools.
• Manage dependencies between loop iterations so the loop iterations can safely
execute concurrently in any order.
• Tell the parallel programming environment which loops to run in parallel. In
many cases, you may need to tell the system how to schedule loop iterations at
runtime.
Typically, the greatest challenge when working with this pattern is to manage depen-
dencies between loop iterations. Of course, in the easiest case, there are no depen-
dencies, and this issue can be skipped. But in most cases, the code will need to be
transformed. The following code fragment has three common examples of dependen-
cies in a loop:
minval = Largest_negative_int;
for (i=0, ind=1; i<N; i++){
x = (a[ind+1] + b[ind - 1])/2;
sum += Z[(int)x];
if (is_prime(ind)) minval = min(minval,x);
ind += 2;
}
Privatize temporary variables: the variable “x” is a temporary variable with respect
to any given iteration of a loop. If you make sure that each thread has its own copy
of this variable, threads will not conflict with each other. Common ways to give each
thread its own copy of a variable include
• Promotion to an array: Create an array indexed by the thread ID and use that
ID to select a copy of the variable in question that is private to a thread.
• Use a built in mechanism provided with the programming environment (such as
a private clause in OpenMP) to generate a copy of the variable for each thread.
Remove induction variables: The variable “ind” is used to control the specific ele-
ments accessed in the arrays. The single variable is visible to each thread and therefore
46 Fundamentals of Multicore Software Development
creates a dependence carried between loop iterations. We can remove this dependence
by replacing the variable with an expression computed from the loop index. For exam-
ple, in our code fragment, the variable ind ranges through the set of odd integers. We
can represent this as a function of the loop index
ind = 2 * i + 1
Reductions: The expression to sum together a subset of the elements of the Z array
is called a reduction. The name reduction indicates that a higher dimension object
(e.g., an array) is used to create a lower dimension object (e.g., a scalar) through an
associative accumulation operation. Most programming models include a reduction
primitive since they are so common in parallel programs.
Protect shared variables: In some cases, there are shared variables that cannot be
removed. There is no option other than to operate on them as shared variables. In
these cases, the programmer must assure that only one thread at a time accesses the
variables, i.e., access by any thread excludes any other thread from access to the
variable until the variable is released. This is known as mutual exclusion.
Putting these techniques together using OpenMP, we produce the following version
of the above loop with the dependencies removed:
minval = Largest_negative_int;
#pragma omp parallel for private(x, ind) reduction(sum:+)
for (i=0, ind=1; i<N; i++){
ind = 2 * i + 1;
x = (a[ind+1] + b[ind - 1])/2;
sum += Z[(int)x];
if (is_prime(ind)){
#pragma omp critical
minval = min(minval,x);
}
OpenMP is described in more detail in Chapter 6 of this book. We will discuss enough
to explain this program fragment. The pragma before the loop tells the OpenMP
system to fork a number of threads and to divide loop iterations between them. The
private clause tells the system to create a separate copy of the variables x and ind
for each thread. This removes any dependencies due to these temporary variables. The
reduction clause tells the system to create a separate copy of the variable “sum”
for each thread, carry out the accumulation operation into that local copy, and then at
the end of the loop, combine the “per-thread” copy of “sum” into the single global
copy of sum. Finally, the critical section protects the update to “minval” with only
one thread at a time being allowed to execute the statement following the critical
pragma.
The final step in the “loop parallelism” pattern is to schedule the iterations of the
loop onto the threads. In the aforementioned example, we let the runtime system
choose a schedule. But in other cases, the programmer may want to explicitly control
how iterations are blocked together and scheduled for execution. This is done with a
Parallel Design Patterns 47
schedule clause in OpenMP, the details of which are described later in the OpenMP
chapter.
Known Uses The loop-level parallelism pattern is used extensively by OpenMP
programmers in scientific computing and in image processing problems. In both
cases, the problem is represented in terms of a grid and updates to the grid are carried
out based on a neighborhood of grid points. This maps directly onto nested loops over
the grid points which can be run in parallel.
Related Patterns This pattern is used to implement geometric decomposition and
data parallel algorithms. When task definitions map onto loop iterations, it is used
with task parallel algorithms as well.
Solutions that utilize this pattern may use the SPMD pattern to explicitly parallelize
the loops. This approach gives the programmer more control over how collections of
loops are divided among threads.
3.3.2.4 Fork-Join
Threads are forked when needed, complete their work, and then join back with a
parent thread. The program execution profile unfolds as a series of serial and concur-
rent phases, often with nesting as forked-threads themselves fork additional threads.
Use When This pattern is used in shared address space environments, where the
cost of forking a thread is relatively inexpensive. This pattern is particularly useful
for recursive algorithms or any problems composed of a mixture of concurrent and
serial phases.
Details In Figure 3.2, we provide a high-level overview of the fork join pattern.
Think of the program as starting with a single, serial thread. At points where concur-
rent tasks are needed, launch or fork additional thread to execute a task. These threads
Parallel regions
Sequential parts
FIGURE 3.2: The fork-join pattern as commonly used with programming models
such as OpenMP. Parallelism is inserted as needed incrementally leading to a program
consisting of sequential parts (one thread) and parallel regions (teams of threads).
48 Fundamentals of Multicore Software Development
may execute asynchronously from the original serial or master thread. At some later
point, the master thread pauses and waits for the forked threads to finish their work.
This is called a join. The resulting program consists of serial parts and parallel regions
with parallelism added incrementally.
A solution that utilizes this pattern must
• Expose the concurrency and isolate the task in a form that can be assigned to a
forked thread.
The tasks are the calls to the “fib()” function. The function that contains the instruc-
tions implementing these tasks is marked by the keyword “cilk” to indicate to the
compiler that this function may be executed concurrently. Inside the function, addi-
tional tasks are forked (or in Cilk parlance, spawned) to compute different inter-
mediate results. The “sync” statement in Cilk carries out the function of the join,
i.e., it causes the “forking” thread to wait until all “forked” threads complete before
proceeding.
Parallel Design Patterns 49
Known Uses Every program that uses OpenMP or a native threads library such
as Pthreads uses this pattern. The explicit forks and joins, however, may be hidden
inside a higher level API. This pattern is particularly important for any program that
utilizes recursion. Hence, this is an essential pattern for graph algorithms. The style
of programming associated with Cilk also makes heavy use of this pattern.
Related Patterns This pattern is very general and hence it is used to support most of
the higher level algorithms patterns. It is particularly important for recursive splitting
algorithms.
3.3.2.5 Master-Worker/Task-Queue
A master process defines a collection of independent tasks that are dynamically
scheduled for execution by one or more workers.
Use When The master-worker pattern is used for problems that are expressed as a
collection of tasks that are independent or that can be transformed into a form where
they are independent. In this case, the challenge is to schedule the tasks so the com-
putational load is evenly distributed among a collection of processing elements.
Known Uses This is the pattern of choice for embarrassingly parallel problems
where the initial problem definition is expressed in terms of independent tasks. Batch
queue environments used to share a computational resource among many users are
based on this pattern.
Related Patterns The master-worker pattern is frequently used with the task paral-
lel pattern. It is an especially effective pattern to use with an embarrassingly parallel
problem since a good implementation of the master-worker pattern will automatically
and dynamically balance the load among the workers.
50 Fundamentals of Multicore Software Development
References
[Alexander77] C. Alexander, S. Ishikawa, and M. Silverstein, A Pattern Language:
Towns, Buildings, Construction, Oxford University Press, New York,
1977.
[Blelloch96] G. Blelloch, Programming parallel algorithms, Communications of
the ACM, 39, 85–97, 1996.
[Blumofe95] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,
K. H. Randall, and Y. Zhou, Cilk: An efficient multithreaded
runtime system, in Proceedings of the Fifth ACM SIGPLAN Sympo-
sium on Principles and Practice of Parallel Programming (PPoPP),
Santa Barbara, CA, pp. 207–216, 1995.
[Dean04] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on
large clusters, in Proceedings of OSDI’04: 6th Symposium on Operat-
ing System Design and Implementation, San Francisco, CA, Decem-
ber 2004.
[Gamma94] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns:
Elements of Reusable Object Oriented Software, Addison-Wesley,
Reading, MA, 1994.
[Hwu08] W.-M. Hwu, K. Keutzer, and T. Mattson, The concurrency challenge,
IEEE Design and Test, 25(4), 312–320, 2008.
Parallel Design Patterns 51
Hans Boehm
Contents
55
56 Fundamentals of Multicore Software Development
The C++ language is currently undergoing revision. This upcoming version of the
language is customarily referred to as C++0x, in spite of the fact that it is not expected
to be finalized until roughly 2011.
C++0x addresses the difficulties with threads in C++ by adding support for
threads directly to the language, giving the language specification a chance to pre-
cisely define their semantics [6]. The next revision of the C standard is taking a
similar route though with significant differences in the API for thread creation and
synchronization.
In this chapter, we describe multi-threaded programming in C++0x, since it
significantly simplifies our task. Aside from significant syntactic differences, the
approach is fundamentally similar to C++03 with the Boost threads [16] library.
However, the basic rules are clearer, simpler, and much less platform-dependent.
double sqrt123()
{
double x;
std::thread t([&]() { x = sqrt(1.23); });
// parent thread might do something else here.
t.join();
return x;
}
The function invoked by the thread t first computes sqrt(1.23) and then
assigns the result to the variable x in the creating (parent) thread. The parent thread
then waits for t to finish before returning the final value of x.
Note that if the parent thread throws an exception between the creation of t and
the call to t.join(), the program will terminate since t will be destroyed without
an intervening t.join() call. If there is danger of such a recoverable exception,
the code should ensure that t.join() is also called on the exception path.
int f()
{
int x = 0;
std::thread t([&]() { x = 400; });
x = 200;
t.join();
return x;
}
or as
Main Thread Child Thread
int x = 0;
create child;
x = 400;
x = 200;
join child;
return x; // returns 200
Such an execution is called sequentially consistent [12] if it can be understood in this
way as the interleaving of thread steps. In a sequentially consistent execution, the
value of a shared object is always the last value assigned to it in the interleaving, no
matter which thread performed the assignment.
Although this is the traditional way to view the execution of a multi-threaded pro-
gram, it is generally viewed as impractical, especially for a systems programming
language, such as C++, for two different but complementary reasons:
1. The meaning of a program depends on the granularity at which memory oper-
ations are performed. To see this, consider a hypothetical machine on which
memory accesses are performed a byte at a time. Thus, the assignment
x = 400 is actually decomposed into at least the two assignments x_high_
bits = 1 and x_low_bits = 144 (1×28 +144 = 400). Thus, we could
also get an interleaving
∗ As a special exception, zero-length bit-fields can be used to separate sequences of bit-fields into multiple
memory locations.
Threads and Shared Variables in C++ 61
struct {char a; int b:25; int c:26; int d:22; short e;} x;
Memory Memory
location 1 location 3
Memory
location 2
memory that can be updated without also writing to adjacent memory, which might
interfere with other threads.
Two memory operations conflict if they access the same memory location and
at least one of them is a store operation to that location. A particular execution of a
program (interleaving) contains a data race if two conflicting operations correspond-
ing to different threads occur adjacently in the interleaving, reflecting the fact that
they are not prevented from executing concurrently. For example, Figure 4.1 has a
data race because the two assignments to x conflict, and can be executed simulta-
neously by different threads, as is seen by the fact that they can be adjacent in an
interleaving. Similarly, Figure 4.2 contains two distinct data races: The assignment
and access to x conflict and may be simultaneously executed, and similarly for y.
A program with a possible execution that includes a data race has “undefined
behavior,” that is, allowed to produce any results whatsoever. (This is sometimes
referred to as “catch fire” semantics for data races, though we do not expect real
implementations to actually cause a fire.) A data race is erroneous, though it is diffi-
cult to diagnose such errors, and typical implementations generally will not. Typical
implementations will, however, perform program transformations that assume there
are no data races. When that assumption is violated, very strange behavior may result.
To understand the special treatment for bit-fields in the definition of “memory loca-
tion,” consider how they are typically implemented. Assume x is a struct consist-
ing of two 4 bit wide fields a and b sharing a single byte. Essentially no modern
hardware allows either field to be changed without storing to the entire byte. Thus,
x.a = 1 would be implemented as
tmp = x; tmp.a = 1; x = tmp;
where tmp is stored in a register, and the middle assignment involves some bit manip-
ulation instructions operating on the register value. This means that if both fields are
initially zero, and two threads concurrently assign to the two fields, we could get the
following interleaving of the generated code:
Thread 1 Thread 2
tmp = x;
tmp2 = x;
tmp2.b = 1;
tmp.a = 1;
x = tmp;
x = tmp2;
62 Fundamentals of Multicore Software Development
This resulting code clearly has a data race on x, and this is reflected in the unex-
pected result here, in which only the x.b field is actually updated, since the second
thread happens to write last. In order to avoid the introduction of such data races, the
draft standard treats the original code as updating the same memory location in both
threads, and hence as already containing the data race.
By effectively outlawing data races, we ensure that legal code cannot test whether
memory operations are reordered and cannot detect the granularity at which they
are performed. Even larger operations, such as operations on STL containers, can be
treated as though they occur in one atomic operation; any code that could tell the
difference is already outlawed by the preceding rule.
Mutexes: Mutexes prevent certain code sections from being interleaved with each
other. Informally, a mutex (class std::mutex) cannot be acquired by a thread, by
invoking lock() function on the mutex, until all prior holders of the mutex have
released it by calling its unlock() method. More formally, a lock() call is not
allowed to occur in the interleaving until the number of prior unlock() calls on the
mutex is equal to the number of prior lock() calls.
4.7 Mutexes
To implement the shared counter using the preceding locks, we might write the
following:
Threads and Shared Variables in C++ 63
#include <thread>
#include <mutex>
using namespace std;
mutex m;
int c;
void incr_c()
{
m.lock();
c++;
m.unlock();
}
Although this is occasionally the right coding style for a code that acquires locks,
it introduces a major trap: If the code between the lock() and unlock() calls
throws an exception, the lock will not be released. This prevents any other thread
from acquiring the lock, and would thus normally cause the program to deadlock in
short order.
One could avoid this by explicitly catching exceptions and also unlocking on
the exception path. But the standard provides a more convenient mechanism:
A std::lock_guard object acquires a mutex when it is constructed, and releases
it on destruction. Since the destructor is also invoked in any exception path, the under-
lying mutex will be released even when an exception is thrown.
Lock_guard uses the standard C++ RAII (resource acquisition is initializa-
tion) idiom. Like some other uses of this idiom, the actual lock_guard variable is
not of interest. (We give it a name of “_” to emphasize that.) Its only purpose is to
ensure execution of its constructor and destructor. Thus, the earlier example could be
rewritten as follows:
#include <thread>
#include <mutex>
using namespace std;
mutex m;
int c;
void incr_c()
{
lock_guard<mutex> _(m);
c++;
}
void incr_c()
{
c++;
}
Although atomic operations by default continue to provide sequential consistency
for data-race-free programs, there are also versions of these operations, customar-
ily referred to as “low-level” atomics, that violate the simple interleaving semantics
for improved performance. Unfortunately, on current hardware, these performance
improvements can sometimes be dramatic.
The semantics of low-level atomics are quite complex, and the details are beyond
the scope of this chapter.
We expect that code using atomic objects will generally be written initially using
“high-level” sequentially consistent atomics, and hot paths may then be carefully
manually optimized to use low-level atomics, where it is essential to do so. By doing
so, we at least separate subtle memory ordering concerns from the already sufficiently
subtle issue of developing the underlying parallel algorithm. We briefly illustrate this
process here, describing the simplest and probably most common application of low-
level atomics in the process.
Assume we would like to initialize some static data on the first execution of a
function. (We could also do this by simply declaring a function-local static variable,
or the standard-library-provided call_once function, in which case the language
implementation might itself use a technique similar to the one described here.)
A simple way to do so explicitly would be to write the appropriate code using
locks as shown earlier. To implement a function get_x() that returns a reference to
Threads and Shared Variables in C++ 65
T x;
bool x_init(false);
mutex x_init_m;
T& get_x()
{
lock_guard<mutex> _(x_init_m);
if (!x_init) {
initialize_x();
x_init = true;
}
return x;
}
This has the, often substantial, disadvantage that every call to get_x() involves
the overhead of acquiring and releasing a lock. We can improve matters by using the
so-called double-checked locking idiom:
T x;
atomic<bool> x_init(false);
mutex x_init_m;
T& get_x()
{
if (!x_init) {
lock_guard<mutex> _(x_init_m);
if (!x_init) {
initialize_x();
x_init = true;
}
}
return x;
}
We first check whether x has already been initialized without acquiring a lock. If it
has been initialized, there is no need to acquire the lock.
Note that in this version, x_init can be set to true in one thread while being
read by another. This would constitute a data race had we not declared x_init as
atomic. Indeed, in the absence of such a declaration, if the compiler knew that the
implementation of initialize_x did not mention x_init, it would have been
66 Fundamentals of Multicore Software Development
perfectly justified in, for example, moving the assignment to x_init to before the
initialization of x, thus effectively breaking the code.
By declaring x_init to be atomic, we avoid these issues. However, we now
require the compiler to restrict optimizations so that sequential consistency is pre-
served and, more importantly, to insert additional instructions, typically so-called
memory fences, that prevent the hardware from violating sequential consistency.
T x;
atomic<bool> x_init(false);
mutex x_init_m;
T& get_x()
{
if (!x_init.load(memory_order_acquire)) {
lock_guard<mutex> _(x_init_m);
if (!x_init.load(memory_order_relaxed)) {
initialize_x();
x_init.store(true, memory_order_release);
}
}
return x;
}
Thread 1
x.store(1, memory_order_release);
res_1 = y.load(memory_order_acquire);
Thread 2
y.store(1, memory_order_release);
res_2 = x.load(memory_order_acquire);
It would be entirely possible to get a final result of res_1 = res_2 = 0, since the
loads could appear to be reordered with the stores.
For our specific get_x() example, this kind of reordering is typically not an
issue, so the faster primitives are safe for this idiom. It appears to be the case that
no caller of get_x() can tell whether the faster primitives are used, and hence the
caller may continue to reason based on interleavings, that is, sequential consistency.
But even in this simple case, we know of no rigorous proof of that claim. In general, it
appears quite difficult to effectively hide the use of low-level atomics inside libraries,
keeping them invisible from the caller.
Note that the second load from x_init in get_x() cannot in fact occur at the
same time as a store to x_init. Thus, it is safe to require no ordering guarantees.
One perfectly safe use of low-level atomics is for such non-racing accesses to vari-
ables that need to be declared atomic because other accesses are involved in races.
4.9.1 Unique_lock
The lock_guard facility provides a useful facility for managing the ownership
of mutexes, and ensuring that mutexes are released in the event of an exception. How-
ever, it imposes one important restriction: The underlying mutex is unconditionally
released by the lock_guard’s destructor. This means that the mutex can never be
released within the scope of the lock_guard.
The unique_lock template provides a generalization of lock_guard that
supports explicit locking and unlocking within the scope of the unique_lock. The
unique_lock tracks the state of the mutex, and the destructor unlocks the mutex
only if it is actually held. In this way, the programmer can get essentially the full
flexibility of direct mutex operations, while retaining the exception safety provided
by lock_guard.
68 Fundamentals of Multicore Software Development
#include <condition_variable>
std::condition_variable queue_nonempty;
{
std::unique_lock<std::mutex> q_ul(q_mtx);
while (q.empty()) {
queue_nonempty.wait(q_ul);
}
// retrieve element from q
}
Note that it makes no sense to wait for the queue to be refilled while holding
q_mtx, since that would make it impossible for another thread to add anything to the
queue while we are waiting. Hence, the wait() call on a condition_variable
needs a way to release the mutex while it is waiting. This is done by passing it a
unique_lock, which unlike a lock_guard can be released and required. (One
might alternatively have designed the interface to pass a raw std::mutex. That
would have made it much more difficult to ensure proper release of the mutex in case
of an exception.)
When another thread adds something to the queue, it should subsequently invoke
either
queue_nonempty.notify_one();
Threads and Shared Variables in C++ 69
or
queue_nonempty.notify_all();
The former allows exactly one waiting thread to continue. It is normally more effi-
cient in cases like our example in which there is no danger of waking the wrong thread,
and thus potentially reaching deadlock. The latter allows all of them to continue, and
is generally safer. If there are no waiting threads, neither call has any effect.
Note that condition_variable::wait() calls always occur in a loop. (The
condition_variable interface also contains an overloaded wait function that
expects a Predicate argument, and executes the loop internally.) In most realistic
cases, for example, if we have more than one thread removing elements from q, this
is usually necessary because a third thread may have been scheduled between the
notify call and the wait call, and it may have invalidated the associated condition,
for example, by removing the just added element from the queue. Since the client
code generally needs a loop anyway, the implementation is somewhat simplified by
allowing wait() to occasionally return spuriously, that is, without an associated
notification. Thus, the loop is required in all cases.
Since wait() can wake up spuriously, the correct code should remain correct
if all wait and notify calls are removed. However, this would again be far less effi-
cient. It might conceivably even prevent the application from making any progress at
all, if waiting threads consume all processor resources, though this is not likely for
mainstream implementations.
The implementation of wait() is guaranteed to release the associated mutex
atomically with the calling threads entrance into the waiting state. If the condition
variable is notified while holding the associated mutex, there is no danger of the noti-
fication being lost because the waiting thread has released the mutex, but is not yet
waiting, a scenario referred to as a “lost wakeup.” Thus, the wait() call must release
the mutex, and condition variables are tightly connected with mutexes; there is no way
to separate condition_variable::wait() and the release of the mutex into
separate calls.
A C++ condition variable always expects to cooperate with a unique_
lock<mutex>. Occasionally it is desirable to use condition variables with other
types of locks. Std::condition_variable_any provides this flexibility.
mutex, since they would be called from a public function that already owns the mutex.
The use of a recursive_mutex would allow a public function owning the mutex
to call another public function, which would reacquire it. If a regular mutex were
used in this setting, the second acquisition attempt would either result in deadlock or
throw an exception indicating detection of a deadlock condition.
Either kind of mutex can be acquired either by calling its lock() function, which
blocks, that is, waits until the lock is available, or by calling its try_lock() func-
tion, which immediately returns false if the mutex is not available. The try_
lock() function may spuriously return false, even if the mutex is available.
(Implementations will typically not really take advantage of this permission in the
obvious way, but this permission declares certain dubious programs to contain data
races, and it would potentially be very expensive to provide sequential consistency
for those programs. See Ref. [6] for details.)
One use of the try_lock() function is to acquire two (or more) locks at the
same time, without risking deadlock by potentially acquiring two locks in the opposite
order in two threads, and having each thread successfully acquire the first one. We
can instead acquire the first lock with lock() and try to acquire the second with
try_lock(). If that fails, we release the first and try the opposite order. In fact, the
standard supplies both lock() and try_lock() stand-alone functions that take
multiple mutexes or the like, and follow a process similar to this to acquire them all
without the possibility of deadlock, even in the lock() case.
The draft standard also provides mutex types timed_mutex and recursive_
timed_mutex that support timeouts on lock acquisition. Since the draft standard
generally does not go out of its way to support real-time programming, for example,
it does not support thread priorities, these are expected to be used rarely.
4.9.4 Call_once
The call_once function is invoked on a once_flag and a function f plus its
arguments. The first call on a particular once_flag invokes f . Subsequent calls
simply wait for the initial call to complete. This provides a convenient way to run
initialization code exactly once, without reproducing the code from Section 4.8.
As we also mentioned earlier, often an even easier way to do this is to declare
a static variable in function scope. If such a variable requires dynamic initialization,
it will be initialized exactly once, on first use, using a mechanism similar to
call_once.
• The main (original) thread must ensure that all threads have been joined before
it returns. This implies that any library that starts a helper thread should either
provide an explicit call to shut down those helper threads, and those calls should
be invoked before main() exits, or it should join with the helper threads when
one of the library’s static duration objects is destroyed. The latter option is quite
tricky, since such a thread may not rely on another library that may be shutting
down at the same time. The standard library is guaranteed to be available until
the end of static duration object destruction, but user-level libraries typically do
not share this property, since their static objects may be destroyed in parallel.
• The entire process may be shut down without running static destructors by call-
ing quick_exit. This will simply terminate all other threads still running at
that point. It is possible to perform some limited shutdown actions in the event
of a quick_exit call by registering such actions with at_quick_exit().
However, any such registered actions should be limited so that they do not inter-
fere with any threads running through the shutdown process.
Unfortunately, any use of the second option is not fully compatible with existing
C++ code that relies on the execution of static destructors. Essentially the entire
application has to agree that quick_exit will be used, and static destructors will
not be used. As a result, we expect most applications to use the first option, and to
ensure that all threads are joined before process exit.
The model we have presented here requires that a separate thread be created for
tasks to be performed in parallel. If this is done indiscriminately, it can easily result in
more concurrently active threads than the runtime can efficiently support, and in far
more thread creation and destruction overhead than desired, possibly outweighing
any benefit from performing the tasks in parallel in the first place. Essentially, we
have left it to the programmer to divide an algorithm into tasks of sufficiently small
granularity to keep all available processors busy, but sufficiently large granularity
to prevent thread creation and scheduling overheads from dominating the execution
time.
A variety of library facilities have been proposed to mitigate these issues:
• Parallel algorithms libraries (cf. STAPL [3]) that directly provide for paral-
lel iteration over data structures, and thus the library either simplifies or takes
responsibility for the choice of task granularity. These often also support con-
tainers distributed over multiple machines, an issue we have not otherwise
addressed here.
• Thread pools allow a small number of threads to be reused to perform a large
number of tasks.
• Fork-join frameworks such as Intel Threading Building Blocks [10,11] or
Cilk++ [7,8] allow the programmer to create a large number of potentially
parallel tasks, letting the runtime make decisions about whether to run each
one in parallel, essentially in a thread pool, or to simply execute it as a function
call in the calling thread. This allows the programmer to concentrate largely on
exposing as much parallelism as possible, leaving the issue of limiting paral-
lelism to minimize overhead to the runtime.
Particularly, the latter two often allow the cost of creating a logically parallel task to
be reduced far below typical thread creation costs. The next chapter discusses some
related approaches in other languages in more detail.
Although such facilities were discussed in the C++ committee, none were incor-
porated into C++0x. Such facilities are quite difficult to define precisely and cor-
rectly. Issues like the following need to be addressed:
• How do thread_local variables interact with tasks? Does a task always
see consistent values? Can they live longer than expected, leaking memory, and
potentially outliving some data structures required to execute the destructor?
• What synchronization operations are the tasks allowed to use? What about
libraries that use synchronization “under the covers”? Usually, serializing exe-
cution of logically parallel and unrestricted tasks introduces deadlock possibil-
ities. What rules should the programmer follow to avoid this?
Although there is agreement that we would like to encourage such programming prac-
tice, the prevailing opinion was that it was premature to standardize such features
without a deeper understanding of the consequences, particularly in the context of
C++.
Threads and Shared Variables in C++ 73
C++0x does include some minimal support targeting such approaches, which
were finalized late in the process:
• Class promise provides a facility for setting and waiting for a result produced
by a concurrently executing function. This provides a set_value member
function to supply the result, for example, in a child thread, and a get_future
function to obtain a corresponding future object, which can be used to wait
for the result, for example, in a parent thread. This allows propagation of excep-
tions from the child to the parent. Class packaged_task can be used to sim-
plify the normal use case, by wrapping a function so that its result is propagated
to an associated future object.
• The function template async allows very simple and convenient execution
of parallel tasks. In order to avoid the aforementioned problems, such tasks
are always executed either in their own thread or, by default, at the runtime’s
discretion, sequentially at the point at which the result is required. We could
rewrite our first complete threads example as follows:
#include <future>
double sqrt123()
{
auto x = std::async([]() { return sqrt(1.23); });
// parent thread might do something else here.
return x.get();
}
Windows API. The former is closer to what we have presented here than the lat-
ter. It is also possible to use one of several possible add-on C++ threads APIs,
which are usually implemented as a thin layer over the underlying C-level API. The
C++0x threads API is very loosely based on one of these, namely, that provided by
Boost [16].
4.12.2 No Atomics
There is no full equivalent of C++0x atomic<T> variables. There are a number
of vendor-specific extensions that provide related functionality, such as Microsoft’s
Interlocked... functions or gcc’s __sync... intrinsics. However, these pro-
vide primarily atomic read-modify-write operations, and it is unclear how to safely
implement plain loads and stores on such shared values. They also occasionally pro-
vide surprising memory ordering semantics.
Note that it is generally not safe to access variables that may be concurrently
modified by another thread using ordinary variable accesses. Such an attempt may
have results that do not correspond to reading any specific value of the variable. For
example, consider
{
bool my_x = x; // x is shared variable
if (my_x) m.lock();
a: ...
if (my_x) m.unlock();
}
The compiler may initially load my_x, use that value for the first conditional, then
be forced to “spill” the value of my_x when it runs out of available registers, and
then reload the value of my_x from x for the second condition, since it knows that
my_x was just a copy of x. If the value of x changes from false to true, the net
result of this is a likely runtime error as m.unlock() is called without a prior call
to m.lock(), something that initially appears impossible based on the source code.
Operations on C++ volatiles do put the compiler on notice that the object
may be modified asynchronously, and hence are generally safer to use than ordinary
variable accesses. In particular, declaring x volatile in the preceding example would
prevent my_x from being reloaded from it a second time. However, volatile does
not in general guarantee that the resulting accesses are indivisible; indeed it cannot,
since arbitrarily large objects may be declared volatile. On most platforms, there
are also very weak or no guarantees about memory visibility when volatile is
used with threads. We know of no platform on which the Dekker’s example
would yield the same behavior with volatile int as it would with
atomic<int>.
In general, there is no fully portable replacement for atomic<T>. The usual solu-
tions are to either use locks instead, to use non-portable solutions, or to use a third-
party library that attempts to hide the platform differences.
Threads and Shared Variables in C++ 75
a data race, though only if the programmer happened to know that this loop executed
for no iterations, and concurrently accessed count. There are other similar, though
much less likely, transformations that introduce races into much more plausible
code [4].
In practice, programmers rely on the fact that such transformations are very unlikely
to introduce problems. We know of no real work-around other than disabling some
rather fundamental compiler optimizations that only rarely violate the C++0x rules.
Since such transformations are no longer legal in C++0x, we expect them to rapidly
disappear, and a few compilers are already careful to avoid them.
References
1. S.V. Adve. Designing memory consistency models for shared-memory multi-
processors. PhD thesis, University of Wisconsin-Madison, Madison, WI, 1993.
2. S. Adve and H.-J. Boehm. Memory models: A case for rethinking parallel lan-
guages and hardware. Communications of the ACM, 53, 8, pp. 90–101, August
2010.
3. P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. Amato,
and L. Rauchwerger. STAPL: An adaptive generic parallel C++ library. In Work-
shop on Languages and Compilers for Parallel Computation (LCPC), Kumber-
land Falls, KY, August 2001.
Threads and Shared Variables in C++ 77
Judith Bishop
Contents
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Types of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.2 Overview of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 .NET Parallel Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Task Parallel Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Basic Methods—For, ForEach, and Invoke . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Breaking Out of a Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.3 Tasks and Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 PLINQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Java Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.6.1 Thread Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.6.2 java.util.concurrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 Fork-Join Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.8 ParallelArray Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1 Introduction
.NET and Java are both platforms supporting object-oriented languages, designed
for embedded systems and the Internet. As such they include libraries for the concur-
rent execution of separate threads of control, and for communication between com-
puters via typed channels and various protocols. Just using any of these libraries in
either language will not, however, automatically cause a program to speed up on a
multicore machine. Moreover before it requires very careful programming to avoid
making the classic errors when trying to maintain mutual exclusion of shared vari-
ables, or to achieve correct synchronization of threads without running into deadlock.
In this chapter, we describe and discuss the advances that have been made in the past
79
80 Fundamentals of Multicore Software Development
few years on both the .NET and Java platforms in making concurrent and parallel
programming easier for the programmer, and ultimately more efficient.
Concurrent and parallel are terms that are often, incorrectly, used interchangeably.
Since this book is about parallelism, we need to define it and explain why it is different
to concurrency. Concurrency occurs naturally in all programs that need to interact
with their environment: multiple threads handle different aspects of the program and
one can wait for input while another is computing. This behavior will occur even on
a single core. Successful concurrency requires fair scheduling between threads.
Parallelism is defined differently: the threads are intended to execute together and
the primary goal is speedup. The wish is that by adding n cores, the program will
run n times faster. Sometimes this goal can almost be achieved, but there are spec-
tacular failures as well [1], and many books and papers spend time on how to avoid
them [2–4].
Parallelism uses unfair scheduling to ensure all cores are loaded as much as pos-
sible. This chapter describes new libraries incorporated into familiar languages that
help to avoid these pitfalls by representing common patterns at a reasonably high
level of abstraction. The value of using these libraries is that if one has a program
that needs speeding up, the programming is easier, the resulting code is shorter, and
there is less chance of making mistakes. On the other hand, applying these techniques
to tasks that are small, the overhead will most probably make it run slower.
C# compiler
VB compiler
FIGURE 5.1: The .NET Framework parallel extensions. (From MSDN, Paral-
lel programming with .NET, http://msdn.microsoft.com/en-us/library/dd460693%
28VS.100%29.aspx, 2010. With permission.)
82 Fundamentals of Multicore Software Development
The two libraries that emerged in the 2007–2010 timeframe for all .NET managed
languages are as follows:
The Parallel class has other methods: ForEach and Invoke. In keeping with
the object-oriented nature of .NET languages, ForEach can be used to iterate over
any enumerable collection of objects, which could be an array, list etc. The basic
structure of the parallel version, as compared to the sequential one, is as follows:
// Sequential
foreach (var item in sourceCollection) {
Process(item);
}
//Parallel
Parallel.ForEach (sourceCollection, item =>
Process(item)
);
There are also overloads of ForEach that provide the iteration index to the loop
body. However, the index should not be used to perform access to elements via cal-
culated indices, because then the independence of the operations, potentially running
on different cores, would be violated.
Invoke is the counterpart of the Par construct found in earlier languages [15,16].
This method can be used to execute a set of operations, potentially in parallel. As with
the bodies of the For and ForEach, no guarantees are made about the order in which
the operations execute or whether they do in fact execute in parallel. This method does
84 Fundamentals of Multicore Software Development
not return until each of the provided operations has completed, regardless of whether
completion occurs due to normal or exceptional termination. Invoke is illustrated in
this example of a generic parallel Quicksort, programmed recursively, with recourse
to Insertion sort when the lists are small:
// Example of TPL Invoke
// From [6]
ParallelLoopResult loopResult =
Parallel.ForEach(Students,)
(student, ParallelLoopState loop) => {
// Lots of processing per student here, then
if (student.grade == 100 {
loop.Stop();
return;
}
// Checking if any other iteration has stopped
if (loop.IsStopped) return;
});
In this example, finding out details of the top student would be difficult, since the
iterations would be competing to write them to a shared variable declared outside the
loop. The break method, on the other hand, will complete all the iterations below
the one where the break occurs, and it can return its iteration number. Moreover, if
there is more than one student with 100%, the earlier snippet will likely give different
answers on different runs—such is non-determinism.
Factory
Consider the simple task graph in Figure 5.2 where one arm has the function F1
and the other arm has the functions F2 and F3 in sequence. The two arms join to
provide input to F4 (adapted from [3]).
One example for creating futures for this graph would be as follows:
Task<int> futureB =
Task.Factory.StartNew<int>(() => F1(a));
int c = F2(a);
int d = F3(c);
int f = F4(futureB.Result, d);
return f;
The future created for F1(a) starts first and then on a multicore machine, F2 can
start as well. F3 will start as soon as F2 completes and F4 as soon as F1 and F3
complete (in any order).
86 Fundamentals of Multicore Software Development
a a
F1 F2
F3
b
F4
FIGURE 5.2: A task graph for futures. (From Campbell, C., et al., Parallel Pro-
gramming with Microsoft .NET: Design Patterns for Decomposition and Coordi-
nation on multicore Architectures, Microsoft Press, 167 pp. at http://parallelpatterns.
codeplex.com/, 2010.
futureF.ContinueWith((t) =>
myTextBox.Dispatcher.Invoke(
(Action)(() => {myTextBox.Text = t.Result.ToString();}))
);
Parallelism in .NET and Java 87
The task that invokes the action on the text box is now linked to the result
of F4.
5.4 PLINQ
LINQ was introduced in the .NET Framework 3. It enables the querying of collec-
tions or data sources such as databases, files, or the Internet in a type-safe manner.
LINQ to Objects is the name for LINQ queries that are run against in-memory col-
lections such as List <T> and arrays and PLINQ is a parallel implementation of
the LINQ pattern [10].
A PLINQ query in many ways resembles a nonparallel LINQ to Objects query. Like
LINQ, PLINQs queries have deferred execution, which means they do not begin exe-
cuting until the query is enumerated. The primary difference is that PLINQ attempts
to make full use of all the processors on the system. It does this by partitioning the
data source into segments, and then executing the query in each segment on separate
worker threads in parallel on multiple processors. In many cases, parallel execution
means that the query runs significantly faster.
Here is a simple example of selecting certain numbers that satisfy the criteria in an
expensive method Compute and putting them in a collection of the same type as the
source.
The mention of AsParallel indicates that the runtime may partition the data
source for computation by different cores. In PLINQ, the var declaration works out
types at compile time.
PLINQ also has a ForAll method that will iterate through a collection in an
unordered way (as opposed to using a normal foreach loop that would first wait
for all iterations to complete). In this example, the results of the earlier query are
added together into a bag of type System.Collections.Concurrent.
ConcurrentBag(Of Int) that can safely accept concurrent bag operations.
PLINQ will be of great benefit for graphics and data processing type applications
of the embarrassingly parallel kind. It can of course also work in combination with
the task synchronization provided by TPL. Consider the popular Ray Tracer example
[17]. This method computes the intersections of a ray with the objects in a scene.
88 Fundamentals of Multicore Software Development
private IEnumerable<ISect> Intersections(Ray ray, Scene scene) {
var things = from inter in obj.Intersect(ray)
where inter != null
orderby inter.Dist
select inter;
return things;
}
The example was originally written in C# 3.0 syntax and made references to the
methods being called on each object more explicit:
// Example of LINQ
// Luke Hoban’s Ray Tracer [17]
private IEnumerable <ISect> Intersections(Ray ray, Scene scene) {
return scene.Things
.Select(obj => obj.Intersect(ray))
.Where(inter => inter != null)
.OrderBy(inter => inter.Dist);
}
The Intersections method selects each object in the scene that intersects with
the ray, forming a new stream of Things, then takes only those that actually intersect
(the Where method), and finally uses the OrderBy method to sort the rest by the
distance from the source of the ray. Select, Where, and OrderBy are methods
that have delegates as parameters and the syntax used is the new => operator followed
by the expression to be activated. It is the clever positioning of the dots that make this
code so easy to read [18]. Query comprehension syntax was a natural advance in
syntactic “sugar” for C# 4.0.
5.5 Evaluation
The TPL library contains sophisticated algorithms for dynamic work distribution
and automatically adapts to the workload and particular machine [6]. An important
point emphasized by Leijen et al. is that the primitives of the library express potential
parallelism, but do not enforce it. Thus, the same code will run as well as it can on
one, two, four, or more cores.
When a parallel loop runs, the TPL partitions the data source so that the loop can
operate on multiple parts concurrently. Behind the scenes, the task scheduler parti-
tions the task based on system resources and workload. When possible, the scheduler
redistributes work among multiple threads and processors if the workload becomes
unbalanced. Provision is also made for the user to redefine the task scheduler.
The TPL runtime is based on well-known work-stealing techniques and the scal-
ability and behavior is similar to that of JCilk [19] or the Java fork-join framework
[20]. In most systems, the liveness of a task is determined by its lexical scope, but
Parallelism in .NET and Java 89
it is not enforced. Tasks in TPL however are first-class objects syntactically and are
allocated on the heap. A full discussion of the implementation, which includes the
novel duplicating queues data structure, is given in [6].
In the presence of I/O blocking scheduling can also be an issue. By default, the TPL
uses a hill-climbing algorithm to improve the utilization of cores when the threads
are blocked. Customized schedulers can also be written. An example would be one
that uses an I/O completion port as the throttling mechanism, such that if one of the
workers does block on I/O, other tasks will be run in its place in order to allow forward
progress on other threads to make use of that lost core.
In a similar way, PLINQs queries scale in the degree of concurrency based on the
capabilities of the host computer. Through parallel execution, PLINQ can achieve
significant performance improvements over legacy code for certain kinds of queries,
often just by adding the AsParallel query operation to the data source. However,
as mentioned earlier parallelism can introduce its own complexities, and not all query
operations run faster in PLINQ. In fact, parallelization can actually slow down cer-
tain queries if there is too much to aggregate at the end of the loop, or if a parallel
PLINQ occupies all the cores and other threads cannot work. This latter problem can
be overcome by limiting the degree of parallelism. In this example, the loop is limited
to two cores, leaving any others for other threads that might also be running.
var source = Enumerable.Range(1, 10000);
// Opt-in to PLINQ with AsParallel
var evenNums = from num in source.AsParallel()
.WithDegreeOfParallelism(2)
where Compute(num) > 0
select num;
In documented internal studies, TPL holds its own against hand-tooled programs
and shows almost linear speedup for the usual benchmarks [6]. We converted the
Comparison (ms)
9000
8000
7000
6000
Time (ms)
5000
4000
3000
Degree 1
2000
Degree 2
1000 Degree 4
0
0 5 10 15 20 25 30 35
Trials
Ray Tracer example found at [17] to TPL and PLINQ and had the speedups shown
in Figure 5.3.
The machine was a 4x 6-core Intel Xeon CPU E7450 @ 2.4 GHz (total of 24 cores)
with RAM: 96GB. The operating system was Windows Server 2008 SP2 64-bit (May
2009) running Visual Server 2010 with its C# compiler (April 2010).
Now, to run the preceding two tasks simultaneously, we create a Thread object
for each Runnable object, then call the start() method on each Thread:
A call to start() spawns a new thread that begins executing the task that was
assigned to the Thread object at some time in the near future. Meanwhile, control
returns to the caller of start(), and the second thread starts. So there will be at least
three threads running in parallel: the two just started, plus the main. (In reality, the
JVM will tend to have extra threads running for housekeeping tasks such as garbage
collection, although they are essentially outside of the program’s control.)
There are several other ways of programming the earlier solution, and together
with standard synchronization primitives such as wait and notify, threads are
adequate for many concurrency applications. However, the model is of a too low level
for expressing solutions to well-known data parallelism or task parallelism problems
that need to harness the power of multicores for involve arrays processing involving
even simple synchronization.
5.6.2 java.util.concurrent
The java.util.concurrent package in Java 5 [26] provides classes and
interfaces aiming at simplifying the development of concurrent and parallel appli-
cations by providing high-quality implementations of their common building blocks.
The package includes classes optimized for concurrent access, including the
following:
logger.info("Starting benchmark");
threadExecutor = Executors.newFixedThreadPool(numWorkers);
long elapsedTimeMillis = System.currentTimeMillis();
while (!patient.isEmpty()) {
threadExecutor.execute(pat.poll());
}
// Wait for termination
threadExecutor.shutdown();
ExecutorService threadExecutor =
Executors.newFixedThreadPool(numWorkers);
5.7.1 Performance
Goetz quotes the following results for the performance of the preceding program
[20]. Table 5.1 shows some results of selecting the maximal element of a 500,000
element array on various systems and varying the threshold at which the sequential
version is preferred to the parallel version. For most runs, the number of threads in
the fork-join pool was equal to the number of hardware threads (cores times threads-
per-core) available. The numbers are presented as a speedup relative to the sequential
version on that system.
Parallelism in .NET and Java 95
TABLE 5.1: Results of running selectMax on 500k element arrays on various
systems.
Threshold = Threshold = Threshold = Threshold = Threshold =
500k 50k 5k 500 50
Pentium-4 HT 1.0 1.07 1.02 .82 .2
(2 threads)
Dual-Xeon HT .88 3.02 3.2 2.22 .43
(4 threads)
8-way Opteron 1.0 5.29 5.73 4.53 2.03
(8 threads)
8-core Niagara .98 10.46 17.21 15.34
(32 threads)
Source: Goetz, B., Java theory and practice: Stick a fork in it, http://www.ibm.com/
developerworks/java/library/j-jtp11137.html, November 2007.
The results show that on two cores, the program actually slows down badly when
it is forced to perform mainly in parallel (the last column with threshold 50). For four
cores, the speedup is about three when the threshold is chosen well.
Lea [22] explains that ForkJoin pays off if
• There are lots of elements with small or cheap operations
• Each of the operations is time-consuming and there are not a lot of elements
In the tests on the right-hand columns, the operations were not sufficiently time-
consuming. However, the principal benefit of using the fork-join technique is that
it affords a portable means of coding algorithms for parallel execution. The program-
mer does not have to be aware of how many cores will be available and the runtime
can do a good job of balancing work across available workers, yielding reasonable
results across the wide range of hardware that Java runs on [20].
5.9 Conclusion
Although in the common mind the .NET and Java platforms seem to be very sim-
ilar, the way they have been able to develop has been completely different, and in
Parallelism in .NET and Java 97
certain critical cases, such as support for parallel processing (or in the past, gener-
ics), this difference has led to one getting ahead of the other.
Java has a community-based development model. Researchers from around the
world propose alternate solutions to problems. The advantage is that once they get
to the stage of adoption by Sun (now Oracle), there is a ready community of users.
.NET is supported more by its own researchers who have a faster track to getting new
ideas adopted. The downside is that a user community is not automatically built at
the same time. The effect of these business factors on technology can be long-lasting.
In making use of concurrency, we still need to take into consideration Amdahl’s
law, which states that the amount of speedup for any code is limited by the amount
of code that can be run in parallel. A paper that strikes a warning is that of Hill and
Marty [29], where they explain that there are at least three different types of multicore
chip architectures, which they call symmetric, asymmetric, and dynamic, and that
the performance across multiple cores for standard functions was quite different in
each case. It will be a challenge worth pursuing to see how the language features
described in the preceding text stand up to the new machines, not just the test cases
on the desktop.
Acknowledgments
My thanks to Ezra Jivan, while in the Polelo Group, University of Pretoria, for his
assistance with the programming; to Mauro Luigi Dragos of Politecnico di Milano for
checking and retuning the programs; and to Stephen Toub and Daan Leijen, Microsoft,
for their careful reading of the draft and their expert comments. Section 5.5 relies on
the work done with Campbell et al. [3]. Work for this chapter was initially done when
the author was at the University of Pretoria, South Africa, and supported by a National
Research Foundation grant.
References
1. Holub, A., Warning! Threading in a multiprocessor world, JavaWorld, http://
www.javaworld.com/jw-02-2001/jw-0209-toolbox.html, September 2001.
2. Lea, D., Concurrent Programming in Java: Design Principles and Patterns, 2nd
edn. Addison Wesley, Reading, MA, 1999.
3. Campbell, C., R. Johnson, A. Miller, and S. Toub, Parallel Programming
with Microsoft .NET: Design Patterns for Decomposition and Coordination
on Multicore Architectures, Microsoft Press, 167pp, at http://parallelpatterns.
codeplex.com/, 2010.
4. Toub, S., Patterns of parallel programming—Understanding and applying
patterns with the .NET framework 4 and Visual C#, White Paper, 118pp.,
98 Fundamentals of Multicore Software Development
http://www.microsoft.com/downloads/details.aspx?FamilyID=86b3d32b-ad26-
4bb8-a3ae-c1637026c3ee&displaylang=en, 2010.
5. MSDN, Parallel programming with .NET, http://msdn.microsoft.com/en-
us/library/dd460693%28VS.100%29.aspx, 2010.
6. Leijen, D., W. Schulte, and S. Burckhardt, The design of a task parallel library,
Proc. 24th ACM SIGPLAN Conference on Object Oriented Programming Sys-
tems Languages and Applications (OOPSLA ’09), Orlando, FL, pp. 227–242.
DOI = http://doi.acm.org/10.1145/1640089.1640106
7. Freeman, A., Pro .NET 4 Parallel Programming in C# (Expert’s Voice in .NET),
APress, New York, 328pp. 2010.
8. Campbell, Colin, A. Miller, Parallel Programming with Microsoft Visual Stu-
dio C++ : Design Patterns for Decomposition and Coordination on Multicore
Architectures, Microsoft Press, 208pp, at http://parallelpatterns.codeplex.com/,
2011.
9. Duffy, J. and E. Essey, Running queries on multi-core processors. MSDN Maga-
zine, http://msdn.microsoft.com/en-us/magazine/cc163329.aspx, October 2007.
10. MSDN, Parallel LINQ (PLINQ), http://msdn.microsoft.com/en-us/library/
dd460688.aspx, 2010.
11. MSDN, Task Parallel Library overview, http://msdn.microsoft.com/en-us/
library/dd460717(VS.100).aspx, July 2009.
12. Stephens, R., Getting started with the .NET Task Parallel Library, DevX.com,
http://www.devx.com/dotnet/Article/39204/1763/, September 2008.
13. OpenMP, http://openmp.org/wp/, 2010.
14. Gregor, D. and A. Lumsdaine, Design and implementation of a high-performance
MPI for C# and the common language infrastructure, Proc. 13th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, Salt Lake City,
UT, pp. 133–142, 2008.
15. Handel-C Language Reference Manual. Embedded Solutions Limited, 1998.
16. occam 2.1 Reference Manual. SGS-THOMSON Micro-electronics Limited,
1995.
17. Hoban, L., A ray tracer in C# 3.0, blogs.msdn.com/lukeh/archive/2007/04/03/a-
ray-tracer-in-c-3-0.aspx
18. Bierman, G. M., E. Meijer, and W. Schulte, The essence of data access in
Comega: The power in the dot. ECOOP 2005, Glasgow, U.K., pp. 287–311.
19. Lee, I.-T. A, The jcilk multithreaded language. Masters thesis, MIT, also at
http://supertech.csail.mit.edu/jCilkImp.html, September 2005.
Parallelism in .NET and Java 99
20. Goetz, B., Java theory and practice: Stick a fork in it, http://www.ibm.
com/developerworks/java/library/j-jtp11137.html, November 2007.
21. Lea, D., The java.util.concurrent Synchronizer framework, PODC Workshop on
Concurrency and Synchronization in Java Programs, CSJP’04, July 26, 2004, St
John’s, Newfoundland, CA http://gee.cs.oswego.edu/dl/papers/aqs.pdf
22. Lea, D., A java fork/join framework. Java Grande, pp. 36–43, 2000.
23. Kaminsky, A. Building Parallel Programs: SMPs, Clusters, and Java. Cengage
Course Technology, Florence, KY, 2010. ISBN 1-4239-0198-3.
24. Shafi, A., B. Carpenter, and M. Baker, Nested parallelism for multi-core HPC
systems using Java, Journal of Parallel and Distributed Computing, 69(6),
532–545, June 2009.
25. Danaher, J. S., I.-T. A. Lee, C. E. Leiserson, Programming with exceptions in
JCilk, Science of Computer Programming, 63(2), 147–171, 2006.
26. http://download-llnw.oracle.com/javase/6/docs/api/java/util/concurrent/package-
summary.html
27. http://www.javac.info/jsr166z/jsr166z/forkjoin/ParallelArray.html
28. Neward, T., Forking and joining Java to maximize multicore power, DevX.com,
http://www.devx.com/SpecialReports/Article/40982, February 2009.
29. Hill, M. D. and M. R. Marty, Amdahl’s law in the multicore era, http://www.
cs.wisc.edu/multifacet/amdahl/ (for interactive website) and IEEE Computer, pp.
33–38, July 2008.
This page intentionally left blank
Chapter 6
OpenMP
Contents
6.1 Introduction
OpenMP (OpenMP 2009), the Open standard for shared memory MultiProcess-
ing, was designed to facilitate the creation of programs that are able to exploit the
features of parallel computers where memory is shared by the individual proces-
sor cores. Widely supported by mainstream commercial and several open source
compilers, it is well suited to the task of creating or adapting application codes to
execute on platforms with multiple cores (Intel 2009, Open64 2005, Portland 2009,
Sun 2005). OpenMP is intended to facilitate the construction of portable parallel pro-
grams by enabling the application developer to specify the parallelism in a code at
a high level, via an approach that moreover permits the incremental insertion of its
constructs. However, it also permits a low-level programming style, where the pro-
grammer explicitly assigns work to individual threads.
101
102 Fundamentals of Multicore Software Development
In this chapter, we will first give an overview of the major features of OpenMP and
its usage model, as well as remark upon the process by which it has been designed
and is further evolving. In Section 6.2, we will then introduce the basic constructs
of OpenMP and illustrate them using several short examples. In Section 6.3, we then
describe the manner in which OpenMP is implemented. One of the greatest chal-
lenges facing any application developer is learning how to overcome performance
problems. Some understanding of the strategy used to implement OpenMP, and a
basic appreciation of the implications of a shared memory model, will help us to dis-
cuss the most important considerations in this regard; this is the topic of Section 6.4.
We then briefly discuss some performance considerations before concluding with a
recap of the main points.
the parallelism, and we will give an example of each later in the text. Typically,
both directives and library routines are needed to create an OpenMP
program.
The API provides a means for the programmer to explicitly create, assign work
to, synchronize, and terminate a set of numbered, logical threads. The implemen-
tation will use this information to translate an OpenMP code into a collection of
cooperating tasks that are assigned to the system-level threads on the target platform
for execution. These statically defined tasks are transparent to the programmer; they
will be executed in the order and manner that corresponds to the semantics of the
OpenMP constructs used. However, the API also enables the application developer
to explicitly specify tasks that will be dynamically created and scheduled for exe-
cution in an order that is not predetermined. The specification further provides a
means to synchronize the actions of the various logical threads in order to ensure,
in particular, that values created by one thread are available before another thread
attempts to use them. It is also possible to ensure that only one thread accesses a
specific memory location, or performs a set of operations, at a time without the
possibility of interference by other threads. This is sometimes required in order to
avoid data corruption. Last, but not least, the specification provides a number of so-
called data attributes that may be used to state whether data is to be shared among
the executing threads or is private to them. The logical threads may cooperate by
reading and writing the same shared variables. In contrast, each thread will have its
own local instance of a private variable, with its own value. This makes it relatively
easy for the threads to work on distinct computations and facilitates efficiency of
execution.
An OpenMP implementation typically consists of two components. The first of
these is the compiler, created by extending a preexisting compiler that implements
one or more of the base languages Fortran, C, or C++. It will translate the OpenMP
constructs by suitably modifying the program code to generate tasks and manage
the data. As part of this translation, it will insert calls to routines that will manage the
system threads at run time and assign work to them in the manner specified by the
application programmer. The custom runtime library that comprises these routines is
the second component in an OpenMP implementation. Commercial compilers will
use the most efficient means possible to start, stop, and synchronize the executing
threads on the target platforms they support. Open source compilers will typically
place a higher value on portability and may rely on Pthreads routines to accomplish
this functionality.
Initially, an OpenMP program’s execution begins with a single thread of control,
just like a sequential program. When this initial thread encounters code that is to
be executed in parallel, it creates (and becomes a member of) a team of threads to
execute the parallel code; the original thread becomes the master of the team. The
team of threads will join at the termination of the parallel code region, so that only the
original master thread continues to carry out work. Hence OpenMP is directly based
upon the fork-join model of execution. An OpenMP program may contain multiple
code regions that are outside of any parallel constructs and tasks; collectively, they
are known as the sequential part of the program and they will be executed by the
104 Fundamentals of Multicore Software Development
initial thread. Since the members of a team of threads may encounter code that is to
be executed in parallel, they may each create and join a new team of threads to execute
it. This can lead to nested parallelism, where hierarchies of thread teams are created
to perform computations (Jost et al. 2004, Chapman et al. 2006). As a result, several
different teams of threads may be active at the same time. Most OpenMP constructs
are applied to the current team, which is the particular team that is currently executing
the region of code containing it.
/***************************************************************
C code with a parallel region where each thread
reports its thread id.
****************************************************************/
#pragma omp parallel
{
printf("Hello from thread %d.\n",omp_get_thread_num());
}
106 Fundamentals of Multicore Software Development
Possible output:
Hello from thread 3.
Hello from thread 6.
Hello from thread 2.
Hello from thread 7.
Hello from thread 0.
Hello from thread 5.
Hello from thread 1.
Hello from thread 4.
Example 6.1
This example shows each thread executing the same statement. Possible output
with eight threads is shown.
The most frequently used worksharing directive is the loop directive. This direc-
tive distributes the iterations of the associated countable loop nest among the threads
in the team executing the enclosing parallel region. Note that, to be countable, the
number of iterations of the loop must be known at the start of its execution (in con-
trast to, say, a loop nest that traverses a linked list). The application programmer may
optionally describe the method to be used for assigning loop iterations to threads.
Another worksharing directive, the sections directive, enables the specification
of one or more sequential sections of code that are each executed by a single
thread. The single directive marks code that should be executed by precisely one
thread. Finally, the workshare directive is provided for Fortran programmers to
specify that the computations in one or more Fortran 90 array statements should be
distributed among the executing threads.
!*******************************************************
! Fortran code with workshare and atomic constructs.
!*******************************************************
!$OMP PARALLEL
!$OMP WORKSHARE
A = B - 2*C
B = A/2
C = B
!$OMP ATOMIC
R = R + SUM(A)
!$OMP END WORKSHARE
!$OMP END PARALLEL
Example 6.2
This workshare directive directs the parallel execution of each statement of array
operations. The ATOMIC directive synchronizes writes to R to prevent a data race.
OpenMP 107
Each of these directives except the workshare directive may have one or more
clauses appended to control its application. The applicable clauses differ depending
on the kind of directive, but all of them allow for the specification of attributes for the
data used in the construct. Additional clauses that may be added to the parallel
directive can be used to state how many threads should be in the team that will
execute it and to describe any conditions that must hold for the parallel region to
be actually executed in parallel. If the specified conditions do not hold, the associ-
ated code will be executed by a single thread instead. Additional information that
may be given via clauses in conjunction with the loop directive includes stating how
many loops in the loop nest immediately following the directive will be parallelized,
and specifying a loop schedule, which is a strategy for assigning loop iterations to
threads.
One of the most powerful innovations in the OpenMP API is that it does not require
the worksharing directives to be in the same procedure as the parallel directive.
In other words, one or more procedure calls may occur within a parallel construct
and any of the invoked procedures may contain OpenMP directives. Directives that
are not in the same procedure as the enclosing parallel directive are known as
orphaned directives. Note, too, that a procedure containing orphaned directives may
also be invoked outside of any parallel region, i.e., in the sequential part of the code.
In such a situation, the directive is simply ignored.
Since the threads in a team perform their work independently, there can be minor
variations in the speed at which computations progress on the different threads. Thus,
the relative order in which results are produced is in general not known a priori. More-
over, since the results of computations are often initially stored in memory that is local
to the thread where they were computed, new values may not be immediately accessi-
ble to threads running elsewhere on the machine. In order to provide some guarantees
on the availability of data that are independent of any hardware or operating system
consistency mechanisms, both parallel regions and worksharing constructs are termi-
nated by a barrier, at which point the threads wait until all of them have completed
execution of the construct and results are made available to all threads. Subsequent
computations may therefore safely assume their availability and exploit them. Since
a barrier may introduce inefficiencies if some threads must wait for slower ones to
complete their assigned portion of work, it is possible to override its insertion at the
end of worksharing directives (but not at the end of parallel regions, where the bar-
rier is mandatory). For this, the nowait clause is provided. However, if the barrier
is omitted, there are no guarantees on the availability of new values of shared data.
Note that it is part of the job of the application developer to ensure that there is no
attempt to use data before it is known to be available. The implementation does not
test for this kind of error in the use of OpenMP directives.
In addition to the barrier, there are a few other features in the API that trigger the
updating of shared data so that all threads will have a consistent view of their values.
In addition to writing back any new values to main memory, threads will then retrieve
any new values for shared variables that they have copied into local memory. Such
places in the code are synchronization points. Between synchronization points, code
executing on different threads may temporarily have different values of shared data.
108 Fundamentals of Multicore Software Development
As a result, OpenMP has a relaxed consistency model (Adve and Gharachorloo 1996,
Bronevetsky and de Supinski 2007).
Example 6.3
Use of OpenMP directives with Fortran.
OpenMP 109
Our first code example shows how OpenMP can be used to parallelize a loop nest
in a Fortran program. A parallel region has been identified via the insertion of two
directives, one at the start marking its beginning and one at the end. Note that here the
parallel directive and the loop directive, needed to specify that the following loops’
iterations should be shared among the threads, have been combined in a single line.
Since statements beginning with an exclamation mark are considered to be com-
ments in Fortran 90, a compiler that does not implement OpenMP will consider these
directives to be comments and will simply ignore them. This will also happen if the
user does not invoke the compiler with any options required to instruct it to translate
OpenMP. The directives are clearly recognizable as a result of the omp prefix which
is used to mark them.
Here, the parallel do directive will result in the formation of a team of threads:
the programmer has not specified how many threads to use, so that either a default
value will be applied or a value that has been set prior to this will be utilized. Each
thread in the team will execute a portion of the parallel loop that follows. Since the
collapse clause has been specified with 2 as an argument, both of the following
two loops will be collapsed into a single iteration space and then divided as directed
by the schedule clause. Thus, the work in the iterations of the i and j loops, a
total of np∗ nd iterations of the loop body, will be distributed among the threads in
the current team. The static schedule informs the compiler that each thread should
receive one contiguous chunk of iterations. For example, if the current team has 4
threads and np = nd = 8, there are 64 iterations to be shared among the threads.
Each of them will have 16 contiguous iterations. Thus, thread 0 might be assigned
the set of all iterations for which i is 1 or 2. If a different mapping of iterations to
threads is desired, a chunk size can be given along with the kind of schedule. With a
chunk size of 8, our thread 0 will instead execute two contiguous sets of iterations.
These might be all iterations for which i is either 1 or 5.
/*******************************************************
Basic matrix multiply using OpenMP.
*******************************************************/
#pragma omp parallel private(i,j,k)
{
#pragma omp for schedule(static)
for (i = 0; i < N; i++)
for (k = 0; k < K; k++)
for (j = 0; j < M; j++)
C[i][j] = C[i][j] + A[i][k]*B[k][j];
} /* end omp parallel */
Example 6.4
This shows a naïve matrix multiply. Arrays A, B, and C are shared by default. Loop
counters are kept private.
110 Fundamentals of Multicore Software Development
/*******************************************************
Orphaned OpenMP worksharing directive (sections) with
a nowait clause.
*******************************************************/
int g;
void foo(int m, int n) {
int p, i;
#pragma omp sections firstprivate(g) nowait
{
#pragma omp section
{
p = foo(g);
for (i = 0; i < m; i++)
do_stuff;
}
#pragma omp section
{
p = bar(g);
for (i = 0; i < n; i++)
do_other_stuff;
}
}
return;
}
Example 6.5
Use of OpenMP directives with C.
112 Fundamentals of Multicore Software Development
Since each task will perform work related to the node that was current at the time of
execution, the value of the variable currentNode must also be saved for retrieval
when the work is performed. For this reason, variables passed to a task are firstprivate
by default.
/*******************************************************
Explicit OpenMP tasks to parallelize code traversing a
linked list in C. Though currentNode is firstprivate by
default, it is good practice to use explicit data
scoping attributes.
*******************************************************/
void processList(Node * list)
{
#pragma omp parallel
#pragma omp single
{
Node * currentNode = list;
while (currentNode) {
#pragma omp task firstprivate(currentNode)
doWork(currentNode);
currentNode = currentNode->next;
}
}
}
Example 6.6
OpenMP task directive.
6.2.4 Synchronization
When a new value is computed for a shared variable during program execution, that
value is initially stored either in cache memory or in a register. In either case, it may
be accessible only to the thread that performed the operation. At some subsequent,
unspecified time, the value is written back to main memory from which point on it
can be read by other threads. The time when that occurs will depend partly on the
way in which the program was translated by the compiler and in part on the policies
of the hardware and operating system. Since the programmer needs some assurances
on the availability of values so that threads may cooperate to complete their work,
OpenMP provides its own set of rules on when new values of shared data must be
made available to all threads and provides several constructs to enforce them.
Many OpenMP programs rely on the barriers that are performed at the end of work-
sharing regions in order to ensure that new values of shared data are available to all
threads. While this is often sufficient to synchronize the actions of threads, there is
also a barrier directive that may be explicitly inserted into the code if needed.
114 Fundamentals of Multicore Software Development
It is then essential that all the threads in the current team execute the barrier, since
otherwise the subset of threads that do so will wait for the remaining threads indefi-
nitely, leading to deadlock. Note that all waiting tasks will be processed at a barrier,
so that the program can safely assume that this work has also been completed when
the threads proceed past this synchronization point in the program.
There is also a directive that can be used to enforce completion of the execution
of waiting tasks. Whereas a barrier will require all threads to wait for the comple-
tion of all tasks that have been created, both implicit and explicit, the taskwait
directive applies only to specific explicit tasks. The taskwait directive stipulates
that the currently executing parent task wait for the completion of any child tasks
before continuing. In other words, it applies only to child tasks created and not to any
subsequently generated tasks.
Many computations do not require that shared data accesses occur in a specific
order, but do need to be sure that the actions of threads are synchronized to pre-
vent data corruption. For example, if two different threads must modify the value
of a shared variable, the order in which they do so might be unimportant. But it is
necessary to ensure that both updates do not occur simultaneously. Without some
form of protection, it is conceivable that a pair of threads both access the variable
being modified in quick succession, each obtaining the same “old” value. The two
threads could then each compute their new value and write it back. In this last step,
the value of the first update will simply be overwritten by the value produced by
the slower thread. In this manner, one of the updates is simply lost. To prevent this
from happening, the programmer needs a guarantee that the corresponding compu-
tations are performed by just one thread at a time. In order to ensure this mutual
exclusion, the OpenMP programmer may use the critical directive to ensure the
corresponding critical region may be executed by only one thread at a time. When a
thread reaches such a region of code at runtime, it must therefore first check whether
another thread is currently working on the code. If this is the case, it must wait until
the thread has finished. Otherwise, it may immediately proceed. A name may be asso-
ciated with a critical construct. If several critical directives appear in the code with
the same name, then the mutual exclusion property applies to all of them. This means
that only one thread at a time may perform the work in any critical region with a
given name.
Locks offer an alternative means of specifying mutual exclusion. They are dis-
cussed briefly in Section 6.2.5. Note that for some simple cases, the atomic direc-
tive suffices. For example, if the operation that must be protected in this fashion is a
simple addition or subtraction of a value, then prefixing this by the atomic directive
ensures that the fetching, updating, and writing back of the modified value occurs as if
it were a single indivisible operation. The value must also be flushed back to memory
so that another thread will access the new value when it performs the operation.
One of the least understood features of the OpenMP API is the flush directive,
which has the purpose of updating the values of shared variables for the thread that
encounters it. In contrast to the features introduced earlier, this directive does not
synchronize the actions of multiple threads. Rather, it simply ensures that the thread
executing it will write any new shared values that are locally stored back to shared
OpenMP 115
memory and will retrieve any new values from the shared memory for shared data that
it is using. However, if two different threads encounter a flush directive at different
times, then they will not share their new values as a result. The first of the threads will
make its data available but will not have access to those created by the thread that has
not yet reached that point in the program. As a result, it can be tricky to synchronize
the actions of multiple threads via flushing, although it can offer an efficient way of
doing so once the principle is understood.
Most existing shared memory systems provide support for cache coherency, which
means that once a cache line is updated by a thread, any other copies of that line
(including the corresponding line in main memory) are flagged as being “dirty”
(Bircsak et al. 2000). This means that the remaining threads know that the data on
other copies is invalid. As a result, any attempt to read data from that line will result
in the updated line being shared. In other words, the thread will indeed get the new
values—in contrast to the description mentioned earlier. Thus, on many of today’s
systems, regular flushing has little if any impact on performance. As soon as data
is saved in cache, the system will ensure that values are shared. (So there is only a
problem if it is still in a register.) Some performance problems result from this, as
discussed in the following. Also, not all systems guarantee this and, in the future, it
is likely to be a bigger problem, as it can be expensive to support cache coherency
across large numbers of threads.
/*******************************************************
These two loops achieve the same end. The first shows a
reduction operation using an OpenMP for loop and a
critical region, without which data corruption might
otherwise occur. The second uses an OpenMP for loop with
116 Fundamentals of Multicore Software Development
a reduction clause to direct the implementation to take
care of the reduction.
*******************************************************/
#pragma omp parallel shared(array,sum)
firstprivate(local_sum)
{
#pragma omp for private(i,j)
for(i = 0; i < max_i; i++){
for(j = 0; j < max_j; ++j)
local_sum += array[i][j];
}
#pragma omp critical
sum += local_sum;
} /*end parallel*/
/*** Alternate version ***/
sum = 0;
#pragma omp parallel shared(array)
{
#pragma omp for reduction(+:sum) private(i,j)
for(i = 0; i < max_i; i++){
for(j = 0; j < max_j; ++j){
sum += array[i][j];
}
}
} /*end parallel*/
Example 6.7
Reductions in OpenMP.
thread ID of its ancestor in any of those levels, and the maximum number of threads
available to a program.
Many of these values can be set prior to execution by assigning a value to the
corresponding environment variable. Environment variables can be also used to set
the stack size and a wait policy. We discuss these briefly in Section 6.3.
Another set of routines is available for the explicit use of locks to control threads’
access to a region of code. In addition to providing the same kind of control as a
critical construct, it is possible to test the value of a lock prior to an attempt to acquire
it. If the test shows that it is set, the thread may be able to perform other work rather
than simply waiting. The OpenMP library should be included in C/C++ programs as
a preprocessor directive with #include <omp.h>.
/*******************************************************
SPMD code style using threadids and using a barrier
*******************************************************/
#pragma omp parallel private(my_id, local_n) shared(a,p)
{
p = omp_get_num_threads();
my_id = omp_get_thread_num();
if (my_id == 0){
printf("I am the master of %d thread(s).\n", p);
}else{\{}
printf("I am worker #%d\n", my_id);
}
if (my_id == 0){
a = 11;
local_n = 33;
}else{
local_n = 77;
}
#pragma omp barrier
if (my_id != 0){
local_n = a*my_id;
}
118 Fundamentals of Multicore Software Development
#pragma omp barrier
printf("Thread %d has local_n = %d\n", my_id, local_n);
}
printf("a = %d\n", a);
Example 6.8
Explicit assignment of work to threads in SPMD style.
needed by active threads. So when the wait can be longer, it might be preferable to
put them to sleep.
A parallel loop will typically be replaced by, first, a routine that each thread invokes
to determine the set of loop iterations assigned to it. Then the iterations are performed
and a barrier routine invoked. For static schedule kinds, each thread may indepen-
dently determine its share of the work. If a dynamic or guided schedule has been
specified, then a thread may need to carry out multiple sets of iterations. In such
cases, once it has completed one set, it will again invoke a routine to get another por-
tion of the work. In this case, more coordination is required behind the scenes and
the corresponding routine is involved several times. As a result, overheads are slightly
higher. However, if the loop iterations contain varying amounts of work, it is often
faster than a static schedule. Parallel sections are often converted to a parallel loop
that has as many iterations as there are sections. Branches to the individual sections
ensure that the ith iteration will perform exactly the work of the ith section.
A single directive is typically executed by the first thread that reaches it. To
accomplish this, each thread will test the value of a shared variable that indicates
whether or not the code has been entered. The first thread to reach it will set the vari-
able and thus ensure that other threads do not attempt to perform the computations.
As with any access to a shared variable, this test and set should be done atomically
by the implementation.
There are several ways in which OpenMP tasks may be implemented (Addison
et al. 2009). A simple strategy is to create a queue of tasks, which holds all tasks that
have been generated but whose execution has not been completed. More sophisticated
strategies may implement one or more queues of tasks for each thread and enable
threads to steal tasks from other threads’ queues if they have no more tasks in their
own queue (Frigo et al. 1998, Duran et al. 2008). This might, for instance, enable some
amount of data locality. Since tied tasks may be suspended, and thus put back onto a
queue, there might be an additional queue per thread to hold these. Such tasks cannot
be stolen by other threads so this will separate them from those that could be executed
by any thread. Implementations must also consider whether to prefer to continue to
generate tasks until a synchronization point is encountered or some threshold has been
reached (which may depend on the overall number of tasks generated, the number
already in a queue, or some additional criteria), or whether they should use some
other strategy to decide when to start executing tasks that are ready.
Critical regions are usually implemented with low-level locks managed by the run-
time. Threads that are waiting to enter a critical region may do so actively, i.e., by
frequently checking to find out if the region is still being worked on. They can also
do so passively, in which case the thread simply sleeps and is woken up when the
computation is ready for it to proceed. Both modes can typically be supported by an
implementation. Whereas the former case usually allows the thread to begin its work
faster, it also causes the thread to use resources that may be needed by other threads,
thereby interfering with their progress. Nevertheless, if the wait time is expected to be
very short, it may be a good choice. If the waiting time is long, the latter may be better.
Since the implementation may not be able to determine which policy is preferable,
the OpenMP API allows the application developer to influence its strategy.
120 Fundamentals of Multicore Software Development
The OpenMP (user-level) library routines are often simply replaced by a corre-
sponding runtime procedure. Since some of runtime routines are frequently invoked—
this applies particularly to the barrier implementation and to the function that retrieves
a thread’s identifier—they are typically very carefully crafted. There are, for instance,
a variety of algorithms for performing barriers and efficient versions can signifi-
cantly outperform naïve barrier implementations (Nanjegowda et al. 2009). Locks
and atomic updates might also have straightforward implementations. The under-
lying thread package chosen for thread management operations will depend on the
options available. Where portability is required, Pthreads is often selected. But most
systems provide more efficient alternatives. Note that OpenMP is beginning to be
implemented on some kinds of systems that do not provide a fully fledged operating
system and thus sometimes have minimal support for thread operations. For these
systems too, the implementation will choose the best primitives available (Chapman
et al. 2009).
Although OpenMP has most often been implemented on cache-coherent shared
memory parallel computers, including multicore platforms, it has also been imple-
mented on non-cache-coherent platforms, large distributed shared memory machines,
and there are a few implementations for distributed memory machines also (Huang
et al. 2003, Marowka et al. 2004, Hoeflinger 2006). OpenMP has begun to be used as
a programming model for heterogeneous multicore architectures (Liu and Chaudhary
2003, Chapman et al. 2009), where different kinds of cores are tightly coupled on a
chip or board. An early implementation of this kind targeted the Cell, which combines
a general purpose core (PPE) with multiple special purpose cores (SPEs); the SPEs
do not share memory with the PPE, which makes the translation considerably tougher
(O’Brien et al. 2008). A similar approach was taken to enable the convenient use of
ClearSpeed’s accelerators in conjunction with general purpose multicore hardware
(Gaster and Bradley 2007). Given the growth in platforms that provide some kind of
accelerator and need to facilitate programming across such systems, we expect to see
more attempts to provide such implementations in the future (Ayguadé et al. 2009a).
Our example gives a flavor of the code that is generated to execute the loop in
Example 6.7. It begins with the outlined procedure that encapsulates the work of the
parallel region. The compiler has generated a name for the procedure that indicates
that it is the second parallel region of the main procedure. Here, too, there are no
standards: in some tools in the programming environment, these names may be made
visible to the user. Next the compiler has generated a private variable in which each
thread will store its local portion of the reduction operation. It then saves the original
bounds of the loop that will be distributed among threads before passing these as
arguments to the procedure that will determine static schedule. Note that each thread
independently invokes this routine, and that it will use the bounds as well as the
thread number to determine its own set of iterations. A call to the barrier routine
has been inserted to ensure that the threads wait until all have completed their share
of the parallel loop. Next, a critical region is used to combine the local sums, in order
to complete the reduction. Note that reduction operations are another feature that
permits a variety of implementations, some of which are much more efficient than
others.
OpenMP 121
/*******************************************************
Possible implementation of the code in the 2nd parallel
region in Example 6.7
*******************************************************/
static void __ompregion_main2(thrdid)
/* var declarations omitted */
local_sum = 0;
limit = max_i + -1;
do_upper = limit;
do_lower = 0;
last_iter = 0;
__ompc_static_init(thrdid, 2, &do_lower,
&do_upper, &do_stride, 1, 1);
if(do_upper > limit)
{
do_upper = limit;
}
for(local_i = do_lower; local_i <= do_upper;
local_i = local_i + 1)
{
local_j = 0;
while(local_j < max_j)
{
local_sum = array[local_i][local_j] + local_sum;
local_j = local_j + 1;
}
}
__ompc_reduction(thrdid, &lock);
sum = local_sum + sum;
__ompc_end_reduction(thrdid, &lock);
__ompc_barrier();
return;
} /* __ompregion_main2 */
Example 6.9
OpenMP implementation strategy.
The manner in which memory is used by a program has a great impact on per-
formance, namely in its use of the cache (Jin et al. 1999). As mentioned previously,
cache lines are flagged in their entirety when “dirty.” It is not just the needed variable
that is updated but the entire cache line that contains it! This is expensive! Therefore
it is prudent to minimize cache misses whenever possible. One way to avoid this is
to make sure loops are accessing array elements in the proper order, either by row
or column, for the language being used. Fortran arrays are stored in column-major
order and C/C++ are row-major, so loops should be modified if needed to access array
elements accordingly.
One side effect of cache coherent systems is false sharing. This is the interference
among threads writing to different areas of the same cache line. The write from one
thread causes the system to notify other caches that this line has been modified, and
even though another thread may be using different data it will be delayed while the
caches are updated. False sharing often prevents programs from scaling to a high
number of threads. It may require careful scrutiny of the program’s access patterns if
it is to be avoided.
It is also important to achieve a balanced workload. The time required for a thread
to carry out its portion of the computation should be, as far as possible, the same as
the time required for each of the other threads between any pair of synchronization
points. This is generally accomplished with a suitable loop schedule for the given
algorithm. Some experimentation may be necessary to determine the schedule that
provides the best performance for any given use of a loop directive.
OpenMP provides an easy model for incrementally writing parallel code, and it
is particularly easy to obtain an initial parallel program, but special considerations
are needed to ensure that a program performs well. With due diligence, an OpenMP
programmer can address these common performance obstacles and obtain a parallel
application that is both efficient and scalable.
6.5 Summary
With the rapid proliferation of multicore platforms and the growth in the number
of threads that they may support, there is an urgent need for a convenient means to
express the parallelism in a broad variety of applications. OpenMP began as a vehicle
for parallelizing (primarily) technical computations to run on small shared memory
platforms. Since that time, it has evolved in order to provide a means to express a
number of parallel programming patterns that can be found in modern computations
and to support the parallelization of applications written in Fortran, C, and C++. The
latest version provides support for the expression of multilevel parallelism, for loop
and task-based parallelism, for dynamic adjustments of the manner of the program’s
execution (including the ability to modify the number of threads that will be used to
execute a parallel region), includes features for fine-grained load balancing and offers
ability to write high level, directive-based code as well as low level code that specifies
124 Fundamentals of Multicore Software Development
the instructions that are to be executed by the different threads explicitly. The growth
in terms of numbers of features has been relatively modest.
The challenges for OpenMP are therefore to support the programming of systems
with large numbers of threads and to ensure that it is able to express the diverse
patterns of parallelism that occur in modern technical and nontechnical application
codes alike. Current work is exploring means to provide for a coordinated mapping
of work and data to threads, to enable error handling, and to enhance the task inter-
face. With the introduction of systems based on heterogeneous cores, the complexity
of the application development process has moreover once more increased signifi-
cantly. It will be interesting to see how well OpenMP may target such systems also.
Early work has already begun to address this topic (Ayguadé et al. 2009a, Huang and
Chapman 2009). The OpenMP ARB is actively considering a variety of strategies,
and features, for addressing these challenges, as well as providing additional help to
deal with errors and to enhance several of its existing features.
References
Addison, C., J. LaGrone, L. Huang, and B. Chapman. 2009. OpenMP 3.0 task-
ing implementation in OpenUH. In Open64 Workshop in Conjunction with the
International Symposium on Code Generation and Optimization. http://www.
capsl.udel.edu/conferences/open64/2009/ (accessed October 5, 2009).
Adve, S. V. and K. Gharachorloo. 1996. Shared memory consistency models:
A tutorial. Computer, 29(12), 66–76.
Ayguadé, E., R. M. Badia, and D. Cabrera1. 2009a. A proposal to extend the OpenMP
tasking model for heterogeneous architectures. In International Workshop on
OpenMP, Dresden, Germany, pp. 154–167.
Ayguadé, E., B. Blainey, A. Duran et al. 2003. Is the schedule clause really necessary
in OpenMP? In Workshop on OpenMP Applications and Tools, Toronto, Ontario,
Canada, pp. 147–159.
Ayguadé, E., N. Copty, A. Duran et al. 2009b. The design of OpenMP tasks. IEEE
Transactions on Parallel and Distributed Systems, 20(3), 404–418.
Bircsak, J., P. Craig, R. Crowell et al. 2000. Extending OpenMP for NUMA machines.
Scientific Programming, 8(3), 163–181.
Blikberg, R. and T. Sørevik. 2005. Load balancing and OpenMP implementation of
nested parallelism. Parallel Computing, 31(10–12), 984–998.
Bronevetsky, G. and B. R. de Supinski. 2007. Complete formal specification of the
OpenMP memory model. International Journal of Parallel Programming, 35(4),
335–392.
OpenMP 125
Sun Microsystems, Inc. 2005. OpenMP support in Sun Studio compilers and tools.
http://developers.sun.com/solaris/articles/studio_openmp.html (accessed October
5, 2009).
Weng, T.-H. and B. Chapman. 2003. Toward optimization of OpenMP codes for syn-
chronization and data reuse. In The 2nd Workshop on Hardware/Software Sup-
port for High Performance Scientific and Engineering Computing (SHPSEC-03),
in conjunction with the 12th International Conference on Parallel Architectures
and Compilation Techniques (PACT-03), New Orleans, LA.
This page intentionally left blank
Part III
Programming Heterogeneous
Processors
This page intentionally left blank
Chapter 7
Scalable Manycore Computing with CUDA
Contents
7.1 Introduction
The applications that seem most likely to benefit from major advances in computa-
tional power and drive future processor development appear increasingly throughput
oriented, with products optimized more for data or task parallelism depending on
their market focus (e.g., HPC vs. transactional vs. multimedia). Examples include
the simulation of large physical systems, data mining, and ray tracing. Throughput-
oriented workload design emphasizes many small cores because they eliminate most
of the hardware needed to speed up the performance of an individual thread. These
simple cores are then multithreaded, so that when any one thread stalls, other threads
can run and every core can continue to be used to maximize the application’s overall
throughput. Multithreading in turn relaxes requirements for high performance on any
individual thread. Small, simple cores therefore provide greater throughput per unit
131
132 Fundamentals of Multicore Software Development
of chip area and greater throughput within a given power or cooling constraint. The
high throughput provided by “manycore” organizations has been recognized by most
major processor vendors.
To understand the implications of rapidly increasing parallelism on both hardware
and software design, we believe it is most productive to look at the design of modern
GPUs (Graphics Processing Units). A decade ago, GPUs were fixed-function hard-
ware devices designed specifically to accelerate graphics APIs such as OpenGL and
Direct3D. In contrast to the fixed-function devices of the past, today’s GPUs are fully
programmable microprocessors with general-purpose architectures. Having evolved
in response to the needs of computer graphics—an application domain with tremen-
dous inherent parallelism but increasing need for general-purpose programmability—
the GPU is already a general-purpose manycore processor with greater peak perfor-
mance than any other commodity processor. GPUs simply include some additional
hardware that typical, general-purpose CPUs do not, mainly units such as rasterizers
that accelerate the rendering of 3D polygons and texture units that accelerate filtering
and blending of images. Most of these units are not needed when using the GPU as
a general-purpose manycore processor, although some can be useful, such as texture
caches and GPU instruction-set support for some transcendental functions. Because
GPUs are general-purpose manycore processors, they are typically programmed in
a fashion similar to traditional parallel programming models, with a single-program,
multiple data (SPMD) model for launching a large number of concurrent threads, a
unified memory, and standard synchronization mechanisms.
High-end GPUs cost just hundreds of dollars and provide teraflop performance
while creating, executing, and retiring literally billions of parallel threads per sec-
ond, exhibiting a scale of parallelism that is orders of magnitude higher than other
platforms and truly embodies the manycore paradigm. GPUs are now used in a wide
range of computational science and engineering applications, and are supported by
several major libraries and commercial software products.
the performance of large parallel workloads. Since any particular program will consist
of both latency-sensitive sequential sections and throughput-sensitive parallel sec-
tions, it is advantageous to have processors optimized for both types of workload
(Figure 7.1).
From a programmer’s perspective, the key ingredients of the GPU architecture
can be broken into three categories. The chip itself consists of a collection of mul-
tithreaded multiprocessors called SMs, each of which can run a large collection of
threads. The hardware provides a unified memory model with fast local on-chip mem-
ories and global off-chip memory visible to all threads. These memory spaces pro-
vide a relaxed memory-consistency model [1], meaning that memory operations from
one thread might appear to other threads out of order. Thread barrier, memory fence,
and atomic read-modify-write primitives are provided to provide synchronization and
ordering guarantees where those are required.
Each of the multiprocessors in the GPU is designed to manage and execute a large
population of threads—up to 1536 in current generation hardware. Every thread rep-
resents an independent execution trace in that it possesses its own program counter,
stack, and scalar register set. Threads communicate through shared memory spaces
and synchronize at barriers. In short, they are fundamentally similar to user-level CPU
threads.
In reality, a multiprocessor will contain far fewer physical processing elements
than the total number of threads it can support. Therefore, the virtual processors
represented by the threads are time multiplexed in a fine-grained fashion onto the
physical processing elements, with a new thread running every cycle. Each scalar
processor hosts dozens of threads, providing high latency tolerance (e.g., for accesses
to off-chip memory). Correct thread execution does not depend on which processing
element hosts the thread’s virtual processor.
The processing cores themselves support a fully general-purpose, scalar instruction
set. They provide full support for both integer and IEEE floating point—at both single
and double precision—arithmetic. They provide a standard load/store model of mem-
ory, which is organized as a linear sequence of bytes. And they support the various
other features typical of modern processors, including normal branching semantics,
virtual memory and pointers, function calls, etc. The GPU also offers some instruction
set support for important transcendental functions such as trigonometric functions
and reciprocal square root that other commodity architectures do not. These can
134 Fundamentals of Multicore Software Development
dramatically improve performance over conventional ISAs and are important across
a wide range of applications.
A device function may only be called within the device program and will execute
on the device. A host function may only be called within the host program and will
execute on the host. Functions marked with both specifiers may be called in either
the host or device program and will execute on the processor where they are called.
Functions without any placement annotation are assumed to be host functions.
components are the consecutive integers ranging from zero up to one less than
corresponding dimension. All of these special variables are set by the runtime envi-
ronment and cannot be changed by the program.
The threads of a kernel start executing at the same entry point, namely, the kernel
function. However, they do not subsequently need to follow the same code sequence.
In particular, most threads will likely make different decisions or access different
memory locations based on their unique thread/block coordinates. This style of exe-
cution is often referred to as “single program, multiple data” or SPMD. By default,
when a host program launches multiple kernels in sequence, there will be an implicit
barrier between these kernels. In other words, no thread of the second kernel may be
launched until all threads of the first have completed. CUDA provides additional APIs
for launching independent kernels whose threads may be potentially overlapped, both
with other kernels and memory transfers.
All threads of a kernel have their own local variables, which are typically stored in
GPU registers. These are private and are not accessible to any other thread. Threads
also have direct access to any data placed in device memory. This data is common
to all threads. When accessing device memory, the programmer must either use the
unique thread/block coordinates to guarantee that all threads are accessing sepa-
rate data, or use appropriate synchronization mechanisms to avoid race conditions
between threads.
Each thread block will have its own private copy of z, which will be allocated in
the on-chip shared memory when the thread block is launched.
Shared memory is much faster than global memory access—by roughly two orders
of magnitude on current hardware. Consequently, device programs often copy data
into shared memory, process it there, and then write results back to global memory.
In this case, shared memory functions like a local scratchpad memory, and making
good use of it can lead to large performance gains in practice.
Threads within a thread block run in parallel and can share data via shared mem-
ory. They may also synchronize using an explicit barrier primitive, which is exposed
in CUDA C as a __syncthreads() function call. This brings all threads to a
Scalable Manycore Computing with CUDA 137
common execution point and also ensures that all outstanding memory operations
have completed. The barrier itself is implemented directly in hardware and is
extremely efficient. On current hardware it compiles to a single instruction; thus,
there is essentially no overhead and the cost of the barrier is simply the time required
for all threads to physically reach the barrier.
It is the programmer’s responsibility to ensure that all threads will eventually exe-
cute the same barrier. Barriers at textually distinct positions within the program are
considered different barriers. The program behavior is undefined if some threads exe-
cute a barrier and others either skip that barrier or execute a textually distinct barrier.
A barrier within conditional code, such as:
if( P ) { ....; __syncthreads(); ....; }
else { ....; __syncthreads(); ....; }
is only well-defined if every thread of the block takes the same branch of the condi-
tional. Similarly, a loop containing a barrier:
while( P ) { ....; __syncthreads(); ....; }
is only safe if every thread evaluates the predicate P in the same way.
CUDA’s block-oriented programming model allows a large variety of applications
to be written by allowing threads to cooperate closely within a thread block, while also
allowing a set of thread blocks to work independently and coordinate across kernel
launches. This basic model is designed to allow very efficient low-level programs
to be written for the GPU while allowing higher level application frameworks to be
built on top using the abstraction features of C and C++ language, e.g., functions
and templates.
In this example, the host variable dA will hold the address of the allocated
memory in the device address space. It may be passed as an argument to host
138 Fundamentals of Multicore Software Development
and kernel functions like any other parameter, but may only be dereferenced
on the device.
• cudaMemcpy: Performs a data transfer between the host and the device.
float *hA = ..., *dA = ...;
cudaMemcpy(dA, hA, N*sizeof(float),
cudaMemcpyHostToDevice) );
This example copies N floating point values starting at address hA in the host
memory to the address dA in device memory.
• cudaFree: Deallocates device memory allocated with cudaMalloc.
cudaFree(dA);
int main()
{
/ / The n−v e c t o r s A , B a r e i n h o s t (CPU) memory
float *hA = ..., *hB = ..., *hC = ...;
int n = ...;
/ / A l l o c a t e d e v i c e (GPU) memory
int nbytes = n * sizeof(float);
float *dA, *dB, *dC;
cudaMalloc((void**) &dA, nbytes);
cudaMalloc((void**) &dB, nbytes);
cudaMalloc((void**) &dC, nbytes);
/ / Copy h o s t memory t o d e v i c e
cudaMemcpy(dA, hA, nbytes, cudaMemcpyHostToDevice);
cudaMemcpy(dB, hB, nbytes, cudaMemcpyHostToDevice);
/ / Copy r e s u l t s back t o t h e h o s t
cudaMemcpy(hC, dC, nbytes, cudaMemcpyDeviceToHost);
return 0;
}
the barrier must complete before the host thread is allowed to proceed. Performing
any memory transfer, such as by calling cudaMemcpy(), implicitly introduces a
barrier that waits for all previous kernels to complete.
Block 0 SM Block 1 SM SM
FIGURE 7.3: As a host program launches kernels, these are delivered to the hard-
ware which schedules blocks onto the SMs.
/ / ( 1 ) Make s u r e a l l o u t s t a n d i n g memory o p e r a t i o n s c o m p l e t e
__threadfence();
__syncthreads();
/ / ( 2 ) Punch t i c k e t c o u n t e r and d e t e r m i n e i f we ’ r e l a s t
if( threadIdx.x==0 )
{
amLast = (gridDim.x-1) == atomicInc(&counter, gridDim.x);
if( amLast ) counter=0;
}
/ / ( 3 ) E n t i r e b l o c k must w a i t f o r r e s u l t
__syncthreads();
return amLast;
}
}
FIGURE 7.4: A simple procedure to determine whether the calling block is the
last block to have reached this point.
It is also important to note that this simple example implementation makes several
assumptions that keep it from being useful in all situations. First, its use of a global
counter implicitly assumes that only one kernel is running at any one time. Second,
it assumes that this grid and its blocks are both one-dimensional. Third, it assumes
that it will only be called once per kernel.
Atomic memory operations can also be used to implement many other more com-
plicated shared data structures, such as shared queues or search trees. However, it is
vital that, as with our simple counter example, no block ever waits for another block
to insert something into the shared data structure. It might be tempting to imagine
implementing producer-consumer queues shared by multiple blocks, but this is dan-
gerously susceptible to deadlock if any block ever waits (e.g., rather than exits) when
the queue is empty.
return values[size-1];
}
FIGURE 7.5: Sequential procedure for scan with a generic operator op.
/ / Update p a r t i a l r e s u l t f o r t h r e a d i
if(active) values[i] = op(partial, values[i]);
__syncthreads();
}
FIGURE 7.6: Procedure to perform parallel prefix (or scan) with a generic operator
op within a thread block.
if(P) { ... }
__syncthreads();
if(P) { ... }
__syncthreads();
It might at first seem that this code could be written more compactly by removing
the duplicate if(P) statements and placing everything inside the body of a single
conditional. However, this would violate the requirement that barriers can only exist
inside conditionals when all threads of the block will evaluate the conditional in the
same way. Since the active predicate depends on the thread index, it will be eval-
uated differently by different threads. The condition of the loop, on the other hand, is
evaluated identically by every thread and can therefore safely contain barriers within
its body.
/ / P l a c e p a r t i a l sums i n s c r a t c h s p a c e
scratch[i] = sum;
__syncthreads();
/ / Perform p a r a l l e l r e d u c t i o n o f per−t h r e a d p a r t i a l r e s u l t s
return scan_block(op, scratch, width, i);
}
such as sum, are provided as built-in array primitives in systems like Fortran 90 and
as special collective operations by parallel programming frameworks like OpenMP
and MPI.
Figure 7.7 demonstrates how to reduce an array of arbitrary size with a single thread
block. It accepts a generic operator op and a corresponding identity value. It
expects pointers to the beginning and end of the input array, a convention we adopt
following the C++ Standard Template Library, and a pointer to a “scratch” space.
The input array bounded by begin and end can live in either global or shared on-
chip memory and be arbitrarily long. The scratch array should contain space for
one value per thread and, while not required for correctness, should be in shared on-
chip memory for better performance.
The first part of this procedure is a loop during which the block iterates over block-
sized tiles of the input array. In the first iteration, the blocksize threads of the
block will load elements 0 through blocksize-1 with each thread accumulating
its corresponding value into the sum variable that holds its running total. Each thread
offsets its reading location by blocksize and repeats. This is essentially the pattern
of strip mining commonly used on vector machines when looping over arrays. This
access pattern guarantees that contiguously numbered threads i and i + 1 always
Scalable Manycore Computing with CUDA 147
access contiguous memory locations in the input array. In turn, this allows the GPU’s
hardware to coalesce these adjacent loads into a minimal number of transactions with
external memory.
After each thread has accumulated a partial result for its slice of the input array,
every thread writes its partial result into the scratch space. They then collectively exe-
cute the scan_block procedure that we described earlier. The return value of this
function is the final combination of all the partial results which it was given, which is
precisely the result of the reduction we are seeking to compute. This performs a bit
more work than necessary, since we require only a reduction rather than a full prefix
sum. We use scan_block simply to avoid the need to give block-level implemen-
tations of both scan and reduce routines.
if( threadIdx.x==0 )
results[blockIdx.x] = sum;
}
/ / Each b l o c k n e e d s a B−e l e m e n t s c r a t c h s p a c e
int nbytes = B*sizeof(int);
/ / Produce p a r t i a l sums f o r P b l o c k s . . .
sum_kernel<<<P, B, nbytes>>>(values, size, results);
/ / . . . and combine P p a r t i a l sums t o g e t h e r i n t o f i n a l sum
sum_kernel<<<1, B, nbytes>>>(results, P, results);
}
FIGURE 7.8: Code for summing entire arrays using parallel reduction spread
across an entire grid of thread blocks.
vector of randomly generated integers. The vector containers provide an interface like
that of the STL std::vector container, while transparently managing movement
of data between host and device memory.
int main()
{
/ / G e n e r a t e random d a t a on t h e h o s t
thrust::host_vector<int> x(1000000);
thrust::generate(x.begin(), x.end(), rand);
/ / T r a n s f e r t o d e v i c e and r e d u c e
thrust::device_vector<int> dx = x;
int sum = thrust::reduce(dx.begin(), dx.end(), plus());
/ / P r i n t r e s u l t and e x i t
printf("Sum=%d\n", sum);
return 0;
}
SM
Warp scheduler Warp scheduler
Register file (128 kB)
GUP SM SM Load/store
Scalar cores SFUs
Streaming multiprocessor units
interface
PCle bus
Host
Global L2 cache
Memory interface
Off-chip memory
DRAM DRAM DRAM
add (FMA), which combines a multiply and add, using a single rounding step to avoid
any loss of precision. For Fermi-generation Tesla products, floating-point double-
precision throughput is one half of single-precision throughput, giving 515 GFLOP/s
peak throughput for double precision and 1.03 TFLOP/s for single precision. Tesla
products also provide ECC protection on all memory structures, including the register
file, caches and shared memory, and external DRAM. The GDDR5-based memory
interface memory interface delivers very high bandwidth to external memory—for
example, the Tesla C2050 provides 144 GB/s peak bandwidth.
One of the most important architectural differences in Fermi compared to other
GPU architectures is that it adds a cache hierarchy to the global address space, with
64 kB first-level capacity per SM and a large (768 kB) global L2. The L1 is divided
between cache and per-block shared memory, with either 48 kB of L1 and the con-
ventional 16 kB shared memory, or vice versa. It is important to note that the cache
hierarchy does not support hardware cache coherence. If coherence is required, it
must be supported in software by flushing writes to L2.
All CUDA-capable architectures also support cached texture memory. Texture mem-
ory has been a fixture of 3D graphics hardware for many generations, because it pro-
vides high-bandwidth support to multiple, neighboring cells within an array. This
makes it useful in general-purpose computing for some kinds of read-only array
accesses. Even given the data caches provided by Fermi, using these texture caches
can provide performance benefits since using them increases the aggregate on-chip
cache capacity available to the program.
Another important property of the memory system is that accesses to global mem-
ory are wide, possibly as wide as a warp, and coalesced. Full bandwidth is achieved
when all threads in a warp load from a single cache line. This allows the loads from all
threads in a warp to be serviced in a single transaction to the global memory. This pre-
ferred “SIMD-major” order differs somewhat from the situation on multicore CPUs
which typically prefer that individual threads access contiguous memory locations
themselves.
Unlike conventional CPUs, atomic read-modify-write operations do not fetch the
memory location into the SM, as conventional CPUs do. Instead, the atomic memory
operation is sent to the memory subsystem and waits there until it can complete. This
means that SM compute resources are not tied up waiting for completion, although the
specific warp that issued the atomic may be delayed. Fermi dramatically improves the
bandwidth of atomic and memory-fence operations compared to prior generations.
Among all these aspects of the memory hierarchy, the ones that are most impor-
tant will vary from application to application, and it is always important to identify
the specific bottleneck. In general, for applications that are memory bound, min-
imizing contention for the memory channels will be most important. This can be
achieved by effective use of the per block shared memory and the caches, thus reduc-
ing memory bandwidth utilization; and by maximizing coalescing when memory is
accessed. Maximizing the number of concurrent threads is often helpful too, because
this increases latency tolerance; but only when increasing the thread count does not
increase the cache miss rate. Optimizing memory access patterns will often be more
important than execution divergence within a warp.
152 Fundamentals of Multicore Software Development
References
1. S. V. Adve and M. D. Hill. A unified formalization of four shared-memory mod-
els. IEEE Trans. Parallel Distrib. Syst., 4(6):613–624, 1993.
2. G. E. Blelloch. Vector Models for Data-Parallel Computing. MIT Press,
Cambridge, MA, 1990.
3. W. J. Bouknight, S. A. Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh,
and D. L. Slotnick. The Illiac IV system. Proc. IEEE, 60(4):369–388, April
1972.
4. CUDPP: CUDA data-parallel primitives library. http://www.gpgpu.org/
developer/cudpp/, July, 2009.
5. Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast
scan algorithms on graphics processors. In Proceedings of the 22nd Annual Inter-
national Conference on Supercomputing, Island of Kos, Greece, pp. 205–213.
ACM, New York, June 2008.
6. W. Daniel Hillis and G. L. Steele, Jr. Data parallel algorithms. Commun. ACM,
29(12):1170–1183, 1986.
7. J. Hoberock and N. Bell. Thrust: A parallel template library. http://www.
meganewtons.com/, March 2010. Version 1.2.
8. D. Kirk and W. Hwu. Programming Massively Parallel Processors: A Hands-On
Approach. Morgan Kaufmann, San Francisco, CA, 2010.
Scalable Manycore Computing with CUDA 153
Christoph W. Kessler
Contents
155
156 Fundamentals of Multicore Software Development
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Disclaimers and Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.1 Introduction
Cell Broadband EngineTM ,∗ often also just called “the Cell processor,” “Cell/B.E.”
or shortly “Cell,” is a heterogeneous multicore processor architecture that was intro-
duced in 2005 by STI, a cooperation of Sony Computer Entertainment, Toshiba, and
IBM. With its special architectural design, Cell can speed up gaming, graphics, and
scientific computations by up to two orders of magnitude over contemporary stan-
dard general-purpose processors, and this at a comparable power consumption. For
instance, in a comparative simulation experiment [58], a Cell processor clocked at 3.2
GHz achieved for a single-precision dense matrix-matrix multiplication (SGEMM) a
performance of 204.7 Gflops† at approximately 40 W, while (at that time) an AMD
OpteronTM processor only reached 7.8 Gflops and an Intel Itanium2TM only 3.0
Gflops on this benchmark.
Most prominently, Cell is currently used as the computational core in the Sony
PlayStation 3 (PS3), a high-performance game console that was released by Sony
in 2006 and has, by September 2010, been sold in more than 47 million units world-
wide [52], but also in the fastest supercomputer of the world 2008–2009, RoadRun-
ner at Los Alamos National Laboratory in New Mexico, USA, which combines the
performance of 12,960 PowerXCellTM 8i processors and 6,480 AMD OpteronTM
dual-core processor to a peak performance of 1.026 Petaflops. Details about the
architecture of Cell will be given in Section 8.2.
A drawback is that Cell is quite hard to program efficiently. In fact, it requires
quite some effort in writing code that comes halfway close to the theoretical peak
performance of Cell. In Section 8.3 we will present some of the techniques that are
necessary to produce code that utilizes the hardware resources efficiently, such as
multithreading, SIMD computing, and multiple buffering of DMA communication.
Our example codes in C are based on the Cell software development kit (SDK) from
IBM [28], which we briefly summarize in Section 8.4.
The programmability issue has spawned various efforts in both academic research
and software industry for developing tools and frameworks for more convenient
programming of Cell. We report on some of these in Section 8.5.
A selection of algorithms tailored for Cell is listed in Section 8.6.
Note that a single chapter like this cannot replace a comprehensive tutorial to
Cell programming with all necessary technical details; we can here just scratch the
surface. We therefore focus on the most important system parts and programming
∗ A list of trademarks mentioned in this chapter is given at the end of this chapter.
† IBM confirmed 201 Gflops for SGEMM by measurements [58].
Programming the Cell Processor 157
techniques for high-performance programming on Cell, and refer the reader to the
relevant literature for further details.
registers of 128 bit size. The PPU has a level-1 (L1) instruction cache and L1 data
cache of 32 KB each.
The PPU supports two hardware threads, which look to the operating system like
two independent processors in a multiprocessor with shared memory. While each
thread has its own instance of the registers, the program counter, and further resources
holding its state, most other resources are shared.
Only the PPE runs a full-fledged operating system. For SDK-level Cell program-
ming, this is usually a version of Linux. For instance, the IBM SDK 3.1 [28] is avail-
able for Red Hat Enterprise Linux 5.2 and Fedora 9.
Virtual memory is organized in segments and these in turn in pages. The PPU
has a memory management unit (MMU) that translates virtual to physical memory
addresses. A translation look-aside buffer (TLB) caches page table entries to speed
up translation between virtual and physical memory addresses.
All values and instructions are stored in big-endian way.
∗ SIMD (single instruction stream, multiple data streams) computing is a special control paradigm of par-
allel computing where the same instruction is concurrently applied to multiple elements of an operand
vector; see Section 8.3.4 for further explanation. Most desktop and server processors feature SIMD
instructions in some form today. Cell implements SIMD instructions in both PPU and SPU. SIMD par-
allelism should not be confused with the more general concept data parallelism which also allows more
complex operations, not just individual instructions, to be applied element-wise.
Programming the Cell Processor 159
size each. All arithmetic SPU instructions operate on 128-bit words in 128-bit SPU
registers; SPU load instructions move 128-bit words from the SPE’s local store to the
SPU registers, and store instructions work vice versa. A 128-bit register can contain
so-called vector datatypes that consist of either two adjacent 64-bit double-precision
floating-point or long-integer words, or four 32-bit single-precision floating-point or
integer words, or eight 16-bit short integers, or 16 eight-bit characters, where the inte-
gral data types are available in both signed and unsigned form (see also Figure 8.8).
In addition, scalar data types are also supported, but these occupy only a part of a
128-bit register and leave the rest unused (see also Figure 8.7). A computation on,
for instance, a single 32-bit single-precision floating-point word is actually more
expensive than the same operation applied to a vector of four 32-bit floating-point
words stored consecutively in the same SPU register: That one word may first need
to be extracted by bitwise mask operations from its current position in its 128-bit
register, then, for binary operations, possibly shifted to obtain the same relative posi-
tion in a 128-bit word as the other operand, and finally the result needs to be shifted
and/or inserted by bitwise masked operations into its final position in the destination
register.
It should also be noted that the SPU implementation of floating-point arithmetics
deviates from the IEEE-754 standard in a few minor points, see, for example,
Scarpino [50, Ch. 10.2].
The SPU also has certain support for instruction-level parallelism: The seven func-
tional units of the SPU are organized in two fully pipelined execution pipelines,
such that up to two instructions can be issued per clock cycle. The first pipeline
(called “even”) is for instructions of the kind arithmetic, logical, compare, select, byte
sum/difference/average, shift/rotate, integer multiply-accumulate or floating-point
instructions; the second pipeline (called “odd”) is for instructions such as shift/ro-
tate, shuffle, load, store, MFC channel control, or branch instructions. Instruction
issue is in-order, which (in contrast to current out-of-order issue superscalar proces-
sors) relies more on the compiler’s instruction scheduler to optimize for a high issue
rate by alternating instructions of “odd” and “even” kind. The latency for most arith-
metic and logical instructions is only 2 clock cycles, for integer multiply-accumulate
7 cycles, and for load and store 6 cycles. Branches on the SPU can be quite costly;
depending on whether a branch is taken or not taken, its latency is between 1 and
18 clock cycles. There is no hardware support for automatic branch prediction. The
compiler or assembly-level programmer can predict branches and insert branch hint-
ing instructions in an attempt to minimize the average case latency. Fortunately, the
high-level programmer is hardly concerned with these low-level details of instruction
scheduling, but when tuning performance it may still be useful to be aware of these
issues.
The SPEs have only a very simple operating system that takes care of basic func-
tionality such as thread scheduling. All I/O has to be done via the PPE. For instance,
certain I/O functionalities such as printf that are supported in the C library of the
SPEs call the PPE for handling.
In the early versions of Cell, the SPEs’ double precision performance was far
below that of single precision computations. In 2008, IBM released an updated Cell
160 Fundamentals of Multicore Software Development
version called PowerXCell 8i, where the accumulated SPE double-precision peak per-
formance has improved from 14 to 102 Gflops. This Cell variant is used, for example,
in the most recent version QS22 of IBMs dual-Cell blade server series.
Main differences between PPE and SPE: While the general-purpose PPE is intended
to run the top-level program control, I/O, and other code that cannot effectively uti-
lize the SPE architecture, the SPEs are generally faster and more power-efficient on
vectorizable and/or parallelizable computation-intensive code. Hence, the SPEs are
intended to work as programmable accelerators that take over the computationally
heavy phases of a computation to off-load the PPE.
PPU and SPU have different address sizes (64 and 32 bit, respectively) and different
instruction sets and binary formats; so PPU and SPU codes each need to be produced
with their own processor-specific toolchain (compiler, assembler, linker).∗ Also the
APIs for DMA memory transfers and for vector/SIMD operations differ (slightly) for
PPU and SPU. This heterogeneity within the system additionally contributes to the
complexity of Cell programming.
∗ A single executable file holding a Cell program can be obtained by embedding the SPU binary code and
data segments as a special data segment into the PPU binary file, see Section 8.3.1.
Programming the Cell Processor 161
PPE SPE 1 SPE 3 SPE 5 SPE 7 IOIF 1
FIGURE 8.2: The four unidirectional rings and bus arbiter of the Element Inter-
connect Bus (EIB) network of Cell.
concurrently as long as their assigned ring segments do not overlap. Actual EIB band-
widths between 78 and 197 GB/s have been observed in an experiment that varied the
positions of four communicating pairs of SPEs [9].
process (provided by the operating system) and vice versa is done by a memory man-
agement unit (MMU) in each MFC. Also, the MMU takes care of access permission
control. For these mappings, the SPE MMUs use the same translation mechanism as
the PPE MMU, that is, with page and segment tables as in the PowerPC architecture.
In particular, the 36-bit segment IDs in effective addresses are translated to the 37-bit
segment IDs in virtual addresses, using a segment look aside buffer in memory.
The PPE can query the effective addresses, for example, of the local store of an
SPE on which it has created an SPE context, using special API functions such as
spe_ls_area_get, and pass them to SPE threads, for example, in order to enable
SPE-to-SPE DMA communication. In principle, the PPE could then even directly
access these addresses, as they have been mapped into its virtual address space; how-
ever, DMA transfers to and from local store are more efficient.
The DMA controller in the MFC executes DMA commands in parallel to the SPU.
DMA transfers work asynchronously to the SPU computation: The SPU issues a
request to the MFC and continues with the next instructions; it may then poll the
status of the DMA transfer, which can be necessary to resynchronize with the issuing
SPU thread.∗
In order to keep track of all pending DMA requests issued by the SPU, each one
is given a tag, an integer between 0 and 31. DMA requests with same tag belong to
the same tag group; for instance, the multiple requests into which the transfer of a
large data block is to be split will all have the same tag and belong to the same tag
group. Synchronization of the SPU with pending DMA transfers can be in terms of
tag group masks, that is bit vectors where each bit position corresponds to one of the
32 tag groups. In this way, it is possible to selectively wait for completion of any or
all from an arbitrary subset of DMA transfers.
The MFC has two queues for buffering DMA requests to be performed: The MFC
SPU command queue for DMA requests issued by the SPU, and the MFC proxy com-
mand queue for DMA requests issued by the PPE or other devices to this SPE, which
are entered remotely by appropriate load and store instructions to memory-mapped
I/O (MMIO) registers in the MFC. Up to 16 issued DMA transfer requests of an SPE
can be queued in the MFC SPU command queue. If the MFC command queue is full
and the current SPU instruction wants to issue another DMA request, that SPU DMA
instruction will block until there is a free place in the queue again. Within a single tag
group, the MFC queueing mechanism implements not necessarily a FCFS (first come
first served) policy, as requests may “overtake” each other. For this reason, the DMA
API provides variants of mfc_put and mfc_get that enforce a partial ordering of
DMA requests in the queue.
Special atomic DMA routines are provided for supporting concurrent write accesses
to the same main memory location. These use mutex locking internally, where the unit
of locking (and thereby atomicity) are (L2-)cache line blocks in memory, that is, 128
bytes that are 128-byte aligned.
∗ Programmers familiar with the Message Passing Interface (MPI) will recognize the similarity to the non-
blocking (incomplete) message passing operations in MPI, such as MPI_Isend, and corresponding
probe and wait operations, such as MPI_Test and MPI_Wait, respectively.
Programming the Cell Processor 163
8.2.5 Channels
Channels are a fast message-passing communication mechanism for sending
32-bit messages and commands between different units on Cell. For instance, a SPU
communicates with its local MFC via SPU channel instructions, which are accessible
to the programmer as functions in a special SPE Channel API, such as spu_readch
( ch ) (read from channel ch) and spu_writech( ch, intval ) (write
integer value intval to channel ch).
Channel messages are intended for signaling the occurrence of specific events, for
synchronization, for issuing DMA commands to the MFC, for querying MFC com-
mand status and command parameters, and for managing and monitoring tag groups.
For instance, the DMA operations mfc_put and mfc_get each are implemented
as 6 subsequent channel write operations that write the DMA parameters (addresses,
size, tag) and the MFC command code itself.
An MFC provides 32 channels. Each channel is for either read or write access
by the SPU. A channel is a buffer for channel messages and has a limited capacity
(number of 32-bit entries), where the number of remaining free entries is given by a
counter indicating its remaining capacity at any time. Most channels have a capacity
of one; the channel for writing DMA commands issued by the SPU has a capacity of
16 entries (see above), and the channel for reading from the SPU’s in-bound mailbox
(see below) has capacity 4.
∗ These should not be confused with the collective communication operations MPI_Scatter and
MPI_Gather in MPI, which involve a group of several processors. Here, scatter and gather apply
to a single SPE.
164 Fundamentals of Multicore Software Development
A channel is either blocking or nonblocking: If the SPE attempts to, for instance,
write a blocking channel whose capacity counter is 0 (i.e., the channel is full), the
SPE will stall (i.e., switch to low-power state) until at least one pending command
in that channel has been executed by the MFC to free an entry, and a corresponding
acknowledgment is received from the channel. If this behavior is undesirable, the
capacity counter should be checked by spu_readchcnt( ch ) before writing.
For example, the channel for issuing MFC DMA commands is a blocking channel.
For a non-blocking channel, writing will always proceed immediately.
Reading from and writing to channels are atomic transactions. Channel operations
are performed in program order.
PPU-initiated DMA commands to access an SPE’s local store do not use MFC
channels but write via the EIB into the MMIO registers in the MFC, which are made
globally accessible by mapping them to effective addresses.
8.2.6 Mailboxes
Mailboxes are message queues for indirect communication of 32-bit words between
an SPE and other devices: The recipient or sender on the other end of a mailbox
communication is not directly addressed or predefined, as opposed to DMA channel
communication; for instance, it may be other SPEs or the PPE who fetch a message
from an SPE’s out-bound mailbox or post one into an SPE’s in-bound mailbox.
A SPE has four in-bound mailboxes that the SPU accesses by read-channel instruc-
tions, and one out-bound mailbox and one out-bound interrupt mailbox that the SPU
accesses by write-channel instructions.
8.2.7 Signals
Signal communication is similar to mailbox communication in that both transfer
messages that consist of a single 32-bit word. In contrast to mailbox messages, signals
are often used in connection with DMA communication for notification purposes,
and especially for direct communication between different SPEs. Also, broadcasting
a single message to many recipients is possible with signals.
Signals are implemented on top of MFC channel communication, using the two
channels SPU_RdSigNotify1 and SPU_RdSigNotify2 dedicated to this ser-
vice. Signals can be received via these channels by the API functions spu_read_
signal1 and spu_read_signal2, respectively. One can probe for expected
signals by spu_stat_signal1 and spu_stat_signal2, respectively. These
functions do not block but return 1 if a signal is present in the respective channel,
and 0 otherwise.
Signals are sent by an SPE using the API function spu_sndsig, which also has
variants with fence and barrier effects, see above. The parameters are similar to DMA
get and put, but the redundant size parameter is omitted.
The PPU can send signals to an SPE either by using the API function spe_
signal_write or by accessing the memory-mapped I/O registers in that SPE’s
MFC directly.
Programming the Cell Processor 165
#include <stdlib.h>
#include <stdio.h>
#include <libspe2.h>
#include <pthread.h>
#define NUM_SPU_THREADS 6
int main()
{
int i, nspus;
spe_context_ptr_t ctxs[NUM_SPU_THREADS];
pthread_t pputhreads[NUM_SPU_THREADS];
FIGURE 8.3: PPU code for the “Hello World” program for Cell. For brevity, all
error-handling code was omitted.
Programming the Cell Processor 167
#include <stdio.h>
FIGURE 8.4: SPU code for the “Hello World” program for Cell.
spuhello: spuhello.c
spu-gcc -Wall -O3 -o spuhello spuhello.c
FIGURE 8.5: Makefile rules for building the “Hello World” program for Cell.
After getting back control from its SPE (8), the controlling PPU thread terminates
itself (9). The main PPU thread waits for each PPU thread to terminate (10), after
which it can safely deallocate the context objects (11).
Now, the PPU passes cb to the SPE main function as argp parameter:
The started SPE program has allocated space for a local copy of the control block
in its local store:
The SPE program can then fetch the control block from main memory by a DMA
transfer, using the effective address of cb obtained via argp:
Note that the mfc_get call only issues the DMA get command to the SPE’s
MFC; control returns to the SPE immediately. To make sure that the communica-
tion has properly terminated, we have to wait for it, using the tag as a reference:
mfc_read_tag_status_all waits for all DMA requests encoded by the MFC
tag mask, a bitvector, which is set up by the call to mfc_write_tag_mask and
here includes just one DMA tag to monitor, namely, tag0.
Now, the SPE has access to addresses mycb.addrA etc., and can use these in
subsequent DMA communication, such as
FIGURE 8.6: Double buffering of an operand array with two buffers a1 and a2,
containing size bytes each, in the local store.
170 Fundamentals of Multicore Software Development
FIGURE 8.7: Scalar data types for the SPU with their sizes and mapping to
preferred slots in SPU registers.
#define N (1<<14)
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
...
for (i=0; i<N; i+=4)
a[i] += b[i];
for adding 214 /4 = 4096 scalar floats stored at addresses that are 16-byte aligned,
takes 2460 ticks of the SPU high-precision step counter on my PlayStation 3,∗ where
79.8 ticks correspond to 1 μs, thus 1 tick is 12.53 ns or 40 clock cycles. In contrast,
accessing floats not aligned at a 16-byte boundary costs extra time for shifting: The
loop
∗ These measurements on PS3 were done using SDK v3.0 with GCC version 4.1.1.
Programming the Cell Processor 171
FIGURE 8.8: Vector data types for the SPU. The element data types are SPU scalar
data types (see Figure 8.7). All vector data types are 16 bytes wide.
takes 2664 ticks. (A closer look at the assembler code reveals that the compiler
generated three more instructions per loop iteration.) Worse yet, a misalignment of
the two operands, here a[i] at offset 1 and and b[i+1] at offset 2,
for (i=1; i<N; i+=4)
a[i] += b[i+1];
causes an additional overhead of two more instructions per iteration, and takes 2766
ticks.
Vector data types refer to a set of 16 consecutive bytes in the SPE local store that
are interpreted as a sequence of scalar data types packed into one 128-bit quadword,
and will be placed together in a 128-bit register for computations. Figure 8.8 shows
the SPU vector data types. Figure 8.9(a) gives an example of a SIMD operation on
vector float operands, doing four floating-point additions at a time.
Continuing on our example, vectorization gives an enormous speedup: While
adding 214 = 16,384 consecutive floats by scalar computations
for (i=0; i<N; i++)
a[i] += b[i];
takes only 1640 ticks, which is about 6 times faster. This also demonstrates that
speedup by vectorization is not limited by the packing factor (here, 4 floats per
172 Fundamentals of Multicore Software Development
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
op1:
ra:
op2: 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
rb:
rc: 0a 13 11 1c 16 04 17 00 08 14 03 1e 09 0f 1f 0c
res:
rd:
(a) (b)
FIGURE 8.9: (a) SPU vector addition (spu_add) of two vector float
values op1 and op2. (b) The shuffle operation rd = spu_shuffle(ra,
rb,rc), where the selection vector rc indicates the source vector (0 for ra, 1 for
rb) and index position of every byte of the result vector rd.
vector float), but additional advantage can be drawn from the removal of the
masked insertion operations (explained further below) used to store scalars, as now
whole 128-bit words can be written in one instruction. In particular, even if we only
were interested in every fourth result, it is still 50% faster to compute them all.
The above results were obtained with the spu-gcc compiler using the -O3
optimization level. Additional performance improvements can be obtained, for exam-
ple, by selecting specific optimization flags in the spu-gcc compiler, such as
-funroll-loops to unroll loops with a number of iterations that is known at com-
pile time or on entry to the loop. Unrolling a loop by a certain number of iterations
increases code size but is often profitable, as it reduces the loop trip count and thereby
loop control overhead, and it yields more opportunities for subsequent low-level opti-
mizations such as common subexpression elimination and instruction scheduling. In
the above example, the scalar loop runs in 8807 ticks and the vectorized version in
1372 ticks after unrolling by a factor of 8. Here, spu-gcc does apparently not exploit
the full optimization potential yet, as much better performance can be obtained after
manual unrolling by a factor of 8:
vector float *va = (vector float *)a;
vector float *vb = (vector float *)b;
for (i=0; i<N/4; i+=8) {
va[i] = spu_add( va[i], vb[i] );
va[i+1] = spu_add( va[i+1], vb[i+1] );
va[i+2] = spu_add( va[i+2], vb[i+2] );
va[i+3] = spu_add( va[i+3], vb[i+3] );
va[i+4] = spu_add( va[i+4], vb[i+4] );
va[i+5] = spu_add( va[i+5], vb[i+5] );
va[i+6] = spu_add( va[i+6], vb[i+6] );
va[i+7] = spu_add( va[i+7], vb[i+7] );
}
processes 32 floating-point additions per loop iteration and does the job in just 450
ticks. Moreover, applying the -funroll-loops compiler transformation to this
Programming the Cell Processor 173
already manually unrolled code brings the time further down to 386 ticks, which
corresponds to slightly more than one float addition result per clock cycle.
Even with a perfectly (modulo-) scheduled loop hiding all instruction latencies
and with neglecting all loop control overhead, the SPU could do at most one local
store access per clock cycle. Here, two loads and one store instruction are required
per quadword, which have to share the SPU’s load/store unit (even though they may
overlap with vector additions after modulo scheduling, as the floating-point unit and
the load/store unit can operate in parallel). Hence, the throughput is here limited by
at most four additions per three clock cycles, i.e., (3 · 16384/4)/40 = 307 ticks
is a lower bound for the execution time. Coming closer to this limit requires more
aggressive code optimizations or even assembler-level programming.
There exist many other vector operations for the SPU. A very useful one is the
shuffle operation, see Figure 8.9b. The operation
assigns the bytes of the result vector rd from arbitrary bytes of the source vectors
ra and rb, guided by the entries in the selection vector rc. Bit 5 in byte i of rc
specifies the source vector of byte i of rd, while the four least significant bits in byte
i of rc specify the byte position in that source vector that byte i of rd is copied
from. Entries > 127 in the selection vector set byte i of rd to special constants; more
details can be found in the Cell programming handbook [27]. The selection vector
can be determined by other SPU operations such as spu_mask.
A restricted form of the shuffle operation is the select operation
where the byte positions remain the same (i.e., byte i of rd comes from byte i of either
ra or rb) and thus a single bit (bit i) in the mask vector rm is sufficient to select the
source vector for each byte position in rd. Again, such bit masks can be created
by special vector operations, such as spu_maskb, or by elementwise vector com-
pare operations such as spu_cmpeq (elementwise test for equality) or spu_cmpgt
(elementwise test for greater-than). For instance, the following code snippet
// a and b are arrays of floats
vector float ra = * (vector float *) a, // load first 4 elements of a
rb = * (vector float *) b, // load first 4 elements of b
rd;
vector unsigned int rm; // mask vector
rm = spu_cmpgt ( ra, rb );
rd = spu_sel ( ra, rb, rm );
8.4.1 Compilers
Gschwind et al. [18] describe the open-source development environment for Cell
based on the GNU toolchain, which forms the basis of the Cell SDK. This includes
a port of the GCC compiler, assembler, and linker, of the GDB debugger, but also
many other useful tools such as the embedding tool for SPE executables in PPE
executables.
XL C/C++ is a proprietary compiler by IBM that was ported to Cell by Eichen-
berger et al. [14]. XL C provides advanced features such as auto-vectorization, auto-
parallelization, and OpenMP support (see also Section 8.5.1). The XL C compiler
can be used together with the other, open-source based development tools for Cell,
such as the GDB debugger.
A recent GCC (v4.3) based compiler for Cell that includes improved support for a
shared address space, auto-vectorization, and automatic code partition management
is described by Rosen et al. [48].
OpenCL [39] is a recently developed open standard API for heterogeneous and
hybrid multicore systems. Since 2009, IBM offers an implementation of the XL com-
piler for OpenCL [29] for the most recent generation (QS22) of its Cell blade servers.
The code for the master processor (PPE) contains API calls that sets up task
descriptors for kernel invocations to be launched to the accelerators (SPEs). Each task
descriptor is filled with information about the code units and kernels to be invoked,
lists of input and output parameters of a work block to be transferred to the SPE, SPE
buffer sizes for input and output parameters, SPE stack size, and some other meta-
data. Next, the tasks are registered with the ALF runtime system and enqueued in a
global ALF task queue, from where they are scheduled to local work queues of the
SPEs. The default task scheduling policy of ALF can be reconfigured.
The programmer can also register dependences between tasks of different code
units, which constrain the dynamic task scheduler in the ALF runtime system.
The code units themselves (i.e., SPE code) need also some calls to the ALF API
to control the life cycle of the code unit and its kernel invocations with their param-
eters and runtime data. There are up to five different code parts (stages) in a code
unit to be marked up by the programmer: for setup of the computational kernel code,
setup of DMA communication lists and buffer management for input parameters, of
an invocation (task), control of the actual computational kernel, setup of DMA com-
munication lists and buffer management for output parameters, and postprocessing
in the code unit. Each stage is marked by an ALF macro.
ALF then takes care of the details of task management, splitting work blocks into
packets that fit the specified buffer sizes in local store, and multi-buffered DMA data
transfers of operand packets to and from the executing SPEs automatically.
In general, ALF code is less complex and error-prone than plain-SDK code, espe-
cially because DMA communication and multi-buffering need no longer be coded by
hand. However, the SPE code for the computational kernel still needs to be vectorized
and optimized separately. On-chip (SPE-SPE) pipelining for tasks of different code
units is not supported in ALF.
8.5.1 OpenMP
The Octopiler compiler by Eichenberger et al. [14], based on the IBM XL C/C++
compiler for the PowerPC architecture, provides an OpenMP implementation for Cell
that couples the SPEs and the PPE together to a virtually shared memory multipro-
cessor with a uniform OpenMP programming interface. It automatically creates PPU
and SPE code files from a single OpenMP source program, where the SPE code only
contains those code parts that should be executed in OpenMP parallel regions on
the SPEs. By embedding the SPE code into a PPE data segment linked with the PPE
code, a monolithic executable file is produced.
A major part of the local store of each SPE is set up to act as an automatically
managed software cache to provide an abstraction of a global shared memory, which
has its home residence in main memory. The software cache functionality is pro-
vided by a library for shared memory accesses that can be linked with the SPE code.
Accesses that cannot be handled by the cache (i.e., cache misses) lead to DMA oper-
ations that access the corresponding locations in main memory. There exist two vari-
ants of the software cache: one variant that only supports single-threaded code, where
any cache line is written back completely to memory when evicted at a cache miss,
and a second variant that supports multithreaded shared memory code (as in OpenMP
applications). The latter variant keeps track of modified cache lines such that, either
on eviction or at write-backs enforced by the consistency mechanism (e.g., at
OpenMP flush operations), only the modified bytes are written back by DMA
operations.
The software cache implementation uses SIMD instructions for efficient lookup
of shared memory addresses in the local store area dedicated to holding software
cache lines. By default, the Octopiler instruments all accesses to shared memory (in
code intended for SPE execution) to go via the software cache’s lookup function. The
resulting code is usually not efficient yet. In order to save some lookup overhead for
accesses that are statically expected to yield a cache miss, the compiler can choose to
bypass the cache lookup and access the value directly from its home location in main
memory. Prefetching and coalescing of memory accesses are other optimizations to
improve the software cache performance. Such optimizations have been shown to
boost the speedup achievable with the software cache considerably [14].
The Octopiler performs automatic vectorization of loops to exploit SIMD instruc-
tions, and performs optimizations of the alignment of scalar values and of entire data
streams to improve performance.
Large SPE executables constitute a problem for Cell because of the very limited
size of the local store (or, respectively, the part of the local store that is not reserved for,
e.g., the software cache and other data). The Octopiler supports automatic splitting
of larger SPE programs into partitions and provides a runtime partition manager that
Programming the Cell Processor 179
(re)loads partitions into the local store when they are needed. Only a few partitions can
be held simultaneously in the local store at a time. Cross-partition calls and branches
are therefore replaced by stubs that invoke the runtime partition manager to check
if the target partition is currently resident in the local store, and if not, reload them
before the SPE program can continue execution. If necessary, another loaded partition
must be evicted from the local store instead.
With selected benchmarks suitable to the Cell platform, average speedups of 1.3
for SPE-specific optimizations (such as branch optimizations and instruction schedul-
ing), 9.9 for auto-vectorization, and 6.8 for exploiting thread-level parallelism in
OpenMP programs were obtained with Octopiler [14].
DBDB [42] and Cellgen [51] are implementations of OpenMP, respectively, an
OpenMP subset that do not rely on a software cache for the shared memory abstrac-
tion but use instead a combination of compiler analysis and runtime management to
manage data locality.
Extensions of the OpenMP (3.0) task model that should allow to better address het-
erogeneous multicore architectures such as Cell have been proposed [5]. For instance,
device-specific (e.g., SPE) code for an OpenMP (3.0) task can be integrated as
an alternative implementation variant to the portable default (i.e., CPU-based)
implementation.
8.5.2 CellSs
Cell superscalar (CellSs) [6] provides a convenient way of exploiting task-level
parallelism in annotated C programs automatically at runtime. Starting from an ordi-
nary C program that could run on the PPU only, the programmer marks certain com-
putationally intensive functions as candidates for execution on SPEs and declares
their interface with parameter types, sizes and directions. CellSs builds, at runtime,
a task graph of tasks that could run on SPEs in parallel with the PPU main thread,
keeps track of data dependences, and schedules data-ready tasks at run-time to avail-
able SPEs.
CellSs is implemented by a source-to-source compiler and a runtime system. The
source-to-source compiler generates SPE code for these functions, which is further
processed by the SPE compiler tool chain, as well as special stub code for all (PPU)
calls to such functions. When, during program execution, the PPU control hits such
a stub, the runtime system creates a new task node in the task graph to represent the
call with its parameters, detects data dependences of the new task’s input operands
from previously created (and yet unfinished) tasks, and represents these as edges in
its task graph structure. The stub call is not blocking, i.e., PPU control continues
in parallel to this. The runtime system schedules the data-ready tasks dynamically
to SPUs. Dynamic task graph clustering for reducing synchronization overhead by
increased granularity, parameter renaming to remove write-after-read and write-after-
write dependences, locality-preserving mapping heuristics, and task stealing for bet-
ter load balancing have been added as optimizations to the basic mechanism. Overall,
this method is similar to how a superscalar processor performs dynamic dependence
180 Fundamentals of Multicore Software Development
8.5.3 Sequoia
Sequoia [16] is a programming language for Cell and other parallel platforms that
takes a slightly more declarative approach than most other programming environ-
ments discussed here. For instance, Sequoia explicitly exposes the main structure
and cost of communication between the various memory modules in a system in the
form of a tree representation of the memory hierarchy.
Sequoia provides tasks, which have a private address space; the only way of com-
municating data between tasks (and thus possibly between cores) is restricted to invo-
cations of subtasks by parent tasks, i.e., operand transfer at remote procedure calls.
Sequoia offers three skeleton-like primitives to create subtasks, namely, a parallel
multidimensional forall loop, a sequential multidimensional loop, and mapreduce, a
combination of a parallel loop and subsequent tree-like parallel reduction over at least
one operand.
Sequoia defines special operators to split and concatenate arrays, which support,
for instance, parallel divide-and-conquer computations on arrays such as recursive
matrix-matrix multiplication or FFT.
The programmer may define multiple algorithmic variants for solving the same
task. Moreover, these can be specialized further to expect their operands to be present
in certain memory modules. Selecting among these variants (if necessary, including
moving data between memory modules) is a way to optimize execution time. Also,
certain variables (e.g., those that control how to split a problem into subproblems) can
be defined as tunable parameters so that the computation can be optimized for a given
target machine. The selected values for these parameters are specified externally for
each algorithmic variant in a machine-specific mapping specification.
An implementation of Sequoia, done at Stanford university, is available for the Cell
SDK 2.1 (2007).
8.5.4 RapidMind
RapidMind Multicore Development Platform by RapidMind Inc. [44] (which was
bought by Intel in 2009) is a stream programming language that provides a data par-
allel programming extension for C++. RapidMind defines special data types for val-
ues, arrays, and functions. Data parallelism is specified by applying a componentized
scalar function elementwise to its input arrays and values. If the function is so simple
that its execution time can be assumed to be independent of the input data, the data
Programming the Cell Processor 181
int main()
{ ...
int x = 0; // x is global to the sieve block
sieve {
int y = 1; // y is local to the sieve block
x = y + 1; // writes 2 to x at the end
y = x; // sets y to 0 as the new value of x is not visible yet
} // here the write to x takes effect
...
}
parallelism can be scheduled statically across the SPEs. For more complicated func-
tions where execution time may vary, dynamic load balancing can be added. Also,
the data parallel function specification enables the automatic selection of SPE SIMD
instructions and automatic application of multi-buffering to the operand arrays.
From the same RapidMind source program, code for Cell as well as for GPUs and
other data parallel platforms can be generated without modification, thanks to the
simplicity and universality of the data parallel programming model.
Special data types for parallel iteration variables and parallel reduction variables
allow the programmer to convey additional information about independence of com-
putations to the compiler.
Sieve C++ has been implemented by Codeplay [47] for Cell, for x86-based stan-
dard multicore systems and for GPUs. The Sieve C++ compiler analyzes the remain-
ing data dependences (on block-local variables) in each sieve block and splits the
sieve body into independent tasks for parallel execution.
∗ There are a few more issues and exceptions from this rule, e.g., regarding function pointers and virtual
functions, but we cannot go into details here and refer to [11] for further description.
Programming the Cell Processor 183
void foo( void ) {...} // compiler creates one PPU and one SPU version
Variables defined inside an offload scope are by default allocated in the executing
SPE’s local store, while those defined in the outer scope (PPU code) reside in main
memory. Accesses from an offload scope to variables residing in main memory result
in DMA transfer of the value to be read or written; the implementation [11] uses a
software cache to reduce the number of DMA accesses.
Hence, when declaring pointer variables in an offload scope, the memory type of
their pointee must be declared as well, by adding the __outer qualifier for pointers
to main memory, (unless the compiler can deduce this information automatically),
see the example in Figure 8.11. The pointee’s memory type thus becomes part of the
pointer type and will be checked statically at assignments etc. We refer to [11] for
further details.
By providing preprocessor macros that wrap all new constructs in Offload C++
and that could alternatively expand to nothing, portability to other compilers can be
achieved. The option of SPU-specific restructuring of offload code allows to trade
higher performance on Cell for reduced portability. The portability problem can be
avoided, though, by conditionally overloading a portable routine with an SPU-specific
version that is masked out when compiling for a non-Cell target. Templates can be
used to combine portable data access with special code for optimized DMA transfer
in offload functions [12].
Recent case studies of using Offload C++ for Cell include parallelizing an image
filtering application [13] and porting a seismic wave propagation simulation code that
was previously parallelized for Intel Threading Building Blocks [12].
184 Fundamentals of Multicore Software Development
8.5.7 NestStep
NestStep [35,36] is a C-based partitioned global address space language for exe-
cuting bulk-synchronous parallel (BSP) programs with a shared memory abstraction
on distributed memory systems. Similar to UPC (Universal Parallel C [15]) and its
predecessors, NestStep features, for example, shared variables, blockwise and cycli-
cally distributed shared arrays, and dataparallel iterators. In contrast to UPC, Nest-
Step enforces bulk-synchronous parallel execution and has a stricter, deterministic but
programmable memory consistency model; in short, all modified copies of shared
variables or array elements are combined (e.g., by priority commit or by a global
reduction) at the end of a BSP superstep to restore consistency, in a way that can be
programmed individually for each variable. This combine mechanism is integrated
in the barrier synchronization between subsequent supersteps.
As an example, Figure 8.12 shows a superstep with a dot product computation in
NestStep, where each SPE calculates a local dot product on its owned elements of A
and B and writes its local result to its local copy of the replicated shared variable s. In
the subsequent (implicit) communication phase of the superstep, these written copies
are combined by summing them up (<+>), and the sum is committed automatically
to each copy of s, in order to restore the invariant that all copies of shared variables
have the same value on entry and exit of a superstep.
Both NestStep’s step statements and Sieve C++’s sieve blocks delay the effect
of writes to block-global shared variables and thereby guarantee absence of block-
local data dependences on shared variables. The difference is that parallelism in Nest-
Step is explicit (SPMD) while it is implicit (automatic parallelization) in Sieve C++.
Also, write accesses are committed locally in program order in NestStep.
NestStep was originally developed and implemented for cluster systems on top of
MPI [35]. More recently, the NestStep runtime system was ported to Cell to coor-
dinate BSP computations on a set of SPEs, where the data for each SPE, including
its owned partitions and mirrored elements of distributed shared arrays, are kept in
a privatized area of main memory, and the PPU is used for SPE coordination and
synchronization [32].
NestStep for Cell only addresses thread-level parallelism across SPEs and provides
a global shared address space abstraction, while support for SIMDization and multi-
buffered DMA must be provided separately, either as hand-written SPE code, or in a
platform-independent way by using BlockLib skeletons (Section 8.5.8).
8.5.8 BlockLib
BlockLib [2] is a skeleton programming library that aims to make Cell program-
ming simpler by encapsulating memory management, doubly-buffered DMA com-
munication, SIMD optimization and parallelization in generic functions, so-called
skeleton functions, that are parameterized in problem-specific sequential code. Block-
Lib provides skeleton functions for basic computation patterns. Two of these are
map and reduce, which are well known. BlockLib also implements two variants
of map and reduce: a combined mapreduce, where a reduce operation is applied
immediately to the result of a map, and a map-with-overlap for calculations on
array elements that also access nearby elements. The library consists of compiled
code and macros and requires no extra tools besides the C preprocessor and com-
piler.
BlockLib is implemented on top of the NestStep run-time system for Cell [32],
from which it inherits the data structures for distributed shared arrays and some syn-
chronization infrastructure. However, it could also be used stand-alone with minor
modifications. By default, a call to a BlockLib skeleton function constitutes a Nest-
Step superstep on its own, but the map skeleton can also be run as part of a larger
superstep.
The parameterization in problem-specific user code can be done in different ways
that, on Cell, differ very much in performance and ease of use. For instance, generic
functions with fine-grained parameter functions (i.e., one call per element computa-
tion) are convenient but incur too much overhead on the SPEs and are not amenable
to SIMD code generation either. Hence, BlockLib expects user functions with larger
granularity here. BlockLib also provides the user with the power of SIMD optimiza-
tion without the drawbacks of doing it by hand. A simple function definition lan-
guage, implemented as C preprocessor macros, allows to generate SIMD optimized
inner loops. This macro language provides primitives for basic and advanced math
operations. It is easy to extend by adding definitions to a header file. Many of these
macros have a close mapping to one or a few Cell SIMD instructions, and some are
mapped to functions in the IBM SIMD Math library [25].
Using BlockLib does not tie the user code to Cell. The same interface could be
implemented by another library for any other NestStep platform in an efficient way,
with or without SIMD optimization.
BlockLib synchronization is based on Cell signals. A signal is implemented with a
special register in each of the SPE’s MFC. An SPE sends a signal to another SPE by
writing to the other SPE’s signal register. In BlockLib, the signal register is used in or-
mode, which means that everything written to the register is bitwise or-ed together.
186 Fundamentals of Multicore Software Development
This way multiple SPEs can signal the same target SPE without overwriting each
other’s signals. BlockLib uses one of the two signal registers per SPE. As a signal
register is 32 bit wide, each SPE can have its own bit with exclusive write access
for up to four different kinds of signals in all other SPEs’ signal registers. BlockLib
uses internally three kinds of signals (barrier synchronization, message available, and
message acknowledge).
A skeleton library for Cell based on C++ templates, called Skell BE, was recently
proposed by Saidani et al. [49].
• MCF [7], the MultiCore Framework by Mercury Computer Systems Inc. The
MCF programming model is restricted to data parallel computations on
n-dimensional matrices.
• MPI microtask: With its eight SPEs, each having its own local memory mod-
ule and communicating by DMA transfers, Cell can be regarded a distributed
memory message passing system. IBM’s MPI microtask [45] is an implemen-
tation of the widely used Message Passing Interface (MPI) that, in principle,
allows Cell programmers to write ordinary MPI programs, given that these are
broken down into microtasks, small-footprint MPI processes that each fit into
Programming the Cell Processor 187
the local store completely (including code and all data). The decomposition is
done by some additional API functions that allow to create groups of micro-
tasks dynamically and provide communication in between these.
The stream mergings in this second phase still requires O(n log(n/m)) memory
accesses to sort n elements from presorted blocks of size m < n, and the reported
speedups for phase 2 are small. For instance, in AAsort [30], mergesort with 4-to-1-
mergers is used in phase 2, where the mergers use bitonic merge locally. The tree of
mergers is processed level-wise in rounds. As each SPE reads from main memory and
writes to main memory (dancehall organization), all n words are transferred from and
to main memory in each round. Speedup still is limited, as the main memory interface
bandwidth is the performance bottleneck.
Keller and Kessler [34] propose on-chip pipelining, a technique to trade a reduced
number of memory accesses for an increased volume of SPE–SPE forwarding via
the EIB, which has considerably higher bandwidth. Instead of processing the merger
tree level-wise to and from main memory, all mergers at all levels run concurrently as
dynamically scheduled tasks on the SPEs and forward merged packets immediately to
their successors in the merger tree. The merger tree root is the bottleneck in merging
and gets an SPE of its own, while the remaining merger tasks are mapped to the
remaining SPEs to maximize throughput and minimize the maximum number (thus
maximize the size) of buffers used per SPE, which translates into improved DMA
efficiency and reduced scheduling overhead. Optimal or optimized mappings can be
computed automatically with different algorithms [34,37]. An example mapping for
a 32-to-1 merger tree is shown in Figure 8.13.
By on-chip pipelining with suitable mappings, the global merging phase, which is
dominating the time of parallel mergesort for reasonably large (8M or larger) input
data sets, achieves a speedup over that of CellSort by up to 70% on an IBM QS-20
dual-Cell blade server and by up to 143% on a PS3. Implementation and evaluation
details are described by Hultén [20,21].
The on-chip pipelining technique can also be applied to other memory-intensive
dataparallel computations [38].
Image and signal processing: The IBM SDK contains an optimized library for fast
Fourier transform (FFT) on Cell in an extension package. The application of the Spiral
autotuning approach to optimized FFT program generation for Cell is described by
Chellappa et al. [8].
H.264 video coding/decoding on Cell has been described, for example, by Wu et al.
[60,61]. The use of Cell for real-time video quality assessment was presented by Papp
et al. [46].
In a case study, Varbanescu et al. [57] evaluate different approaches for mapping
an application for multimedia analysis and retrieval to Cell, and derive general par-
allelization guidelines, which could extend the process described in Section 8.3.5.
24 25 26 27
12 13
6 7
SPE 4
16 17 18 19
8 9
4 5
SPE 1
20 21 22 23
10 11
3
SPE 2
28 29 30 31
14 15
2
SPE 5
1
SPE 3
• Instruction-level parallelism
• SIMD parallelism
While tools and high-level programming environments can help with some of
these, writing high-performance code orchestrating all of these nicely together is still
mainly the programmer’s task.
Cell is a fascinating processor. In fact, it has created its own class of moderately
heterogeneous and general-purpose multicore processors that is clearly set apart from
190 Fundamentals of Multicore Software Development
standard multicore processor designs, but also differs from even more domain-specific
chip multiprocessor architectures.
With general-purpose GPUs such as NVIDIA’s CUDA-based devices described in
the previous chapter of this book, Cell shares several architectural features, such as
the concept of programmable accelerators that are optimized for a specific type of
computation (fixed-width SIMD for Cell SPEs, massively data-parallel for GPUs),
and a (mostly) software-managed multilevel memory hierarchy. Both have a more
complicated programming model but also a much higher performance per watt poten-
tial than comparable standard multicore processors. Both have much higher peak
bandwidth for accesses to on-chip than off-chip memory. Both architectures per-
form better at regular, bulk access to consecutive memory locations compared to ran-
dom access patterns; while bulk access is explicit (by DMA) in Cell, it is implicit
in GPUs (by coalescing in hardware in CUDA GPUs). Also, both platforms recently
added hardware support for high-throughput double precision arithmetics, targeting
the high-performance computing domain. However, there are also significant differ-
ences: While Cells SPEs are powerful (at least in terms of year 2006 technology when
Cell was new) general-purpose units with moderate-width (128 bit) SIMD support
and a comparably large amount of on-chip local memory, GPUs’ streaming proces-
sors are much simpler scalar devices that, until recently, did not support function
calls in kernel code (unless statically inlined by the compiler); the latter restriction
was relaxed in NVIDIA’s Fermi GPU generation introduced in 2010. With the PPU,
Cell has a general-purpose host processor directly on chip that runs a full-fledged
standard operating system, while the entire GPU is just a complement to an exter-
nal CPU. Both PPU and SPUs can initiate DMA operations to access any off-chip
or on-chip memory location (as far as permitted within the current process), while
data transfer between GPU memory and main memory can, up to now, only be ini-
tiated by the host. In particular, SPEs can forward data to each other, which is not
(yet?) possible on GPUs, making advanced techniques such as on-chip pipelining
not applicable there. Task parallelism is naturally supported by Cell’s MIMD paral-
lelism across its eight SPEs (each of which could even run a different binary code).
Until recently, CUDA GPUs could only execute a single, massively dataparallel ker-
nel across the entire GPU at a time; this has now changed with Fermi, which allows
for limited task-level parallelism. By leveraging massive hardware parallelism across
hundreds of simple streaming processor cores, GPUs still can afford a lower clock
frequency. GPU architectures are designed for high throughput by massive hardware
multithreading, which automatically hides long memory access latencies, while SPEs
require manual implementation of multi-buffering or other latency hiding techniques
in software to achieve high throughput. To keep pace with the recent development of
GPUs but also of standard multicore architectures, a successor generation in the Cell
processor family would have been urgently needed.
In late 2009, IBM decided to cancel further development of the Cell architecture
line (earlier there had been some plans for a 2-PPU, 32-SPU version), but will con-
tinue to manufacture Cell chips for the PS3 for the foreseeable future. Hence, PS3 and
existing installations such as IBM blade servers and the RoadRunner supercomputer
Programming the Cell Processor 191
will be in use for quite some time forward. Also, IBM may reuse parts of the Cell
architecture in some form [19].
In any case, Cell teaches us important general lessons about heterogeneous multi-
core systems, and several important concepts of Cell programming will also be with
us for the foreseeable future of power-efficient high-performance multicore program-
ming: A combination and coordination of multiple levels of parallelism will be nec-
essary. Multilevel memory hierarchies, whether explicit as in Cell or implicit as in
cache-based systems, will require program optimizations for increased locality and
overlapping of memory access latency with computation. Trade-offs between locality
of memory accesses and load balancing have to be balanced. Explicit data transfer
between local memory units over an on-chip network may become an issue even for
future standard processor architectures, as cache-based SMP architectures may not
scale up to very many cores. Instruction-level parallelism remains important but will
fortunately be handled mostly by the compiler. Effective usage of SIMD operations is
mandatory, and memory parallelism needs to be exploited. Finally, a certain amount
of heterogeneity needs to be bridged. Managing this complexity while achieving high
performance will require a joint effort by the programmer, language constructs, the
compiler, the operating system, and other software tools.
Acknowledgments
The research of the author of this chapter was partly funded by EU FP7 (project
PEPPHER, www.peppher.eu, grant 248481), by Vetenskapsrådet, SSF, Vinnova,
CUGS, and Linköping University in Sweden.
192 Fundamentals of Multicore Software Development
The author thanks Duc Vianney from IBM, Ana Varbanescu from TU Delft, Jörg
Keller from FernUniversität in Hagen, Dake Liu and his group at Linköping Univer-
sity, and Markus Schu from Micronas in München, for interesting discussions about
Cell.
On-chip-pipelined mergesort for Cell was implemented by Rikard Hultén in a mas-
ter thesis project at Linköping university. BlockLib was mainly developed by Markus
Ålind and the NestStep runtime system for Cell by Daniel Johansson in earlier mas-
ter thesis projects at Linköping University. We thank Erling Weibust, Carl Tengwall,
Björn Sjökvist, Nils Smeds, and Niklas Dahl from IBM Sweden for commenting on
this work and for letting us use their IBM QS-20 blade server. We thank Inge Gutheil
and her colleagues at Jülich Supercomputing Centre for giving us access to their Cell
cluster JUICE.
The author thanks Mattias Eriksson and Erik Hansson from Linköping university
for discussions, for proof-reading and commenting on this chapter. Also, many thanks
to George Russell and Uwe Dolinsky from Codeplay and to the anonymous reviewers
for their comments.
However, any possibly remaining errors would be solely the author’s own fault.
Trademarks
In the following list, we try at our best to remark on trademarks and other protected
company and product names that occur in this chapter:
Altivec is a trademark of Freescale Semiconductor, Inc.
AMD and AMD Opteron are trademarks of Advanced Micro Devices Inc. (AMD).
Cell Broadband Engine, Cell/B.E., PlayStation, and PlayStation 3 are trademarks
of Sony Computer Entertainment, Inc.
CUDA and NVIDIA are trademarks of NVIDIA.
IBM, BladeCenter, PowerPC, POWER, Power PC, PowerPC Architecture, and
PowerXCell are trademarks of International Business Machines Corporation.
Intel and Itanium2 are trademarks of Intel Corporation.
Linux is a trademark of Linus Torvalds.
Offload is a trademark of Codeplay Software Ltd.
Programming the Cell Processor 193
References
1. S.G. Akl. Parallel Sorting Algorithms. Academic Press, San Diego, CA, 1985.
2. M. Ålind, M. Eriksson, and C. Kessler. Blocklib: A skeleton library for Cell
Broadband Engine. In Proceedings of the ACM International Workshop on Mul-
ticore Software Engineering (IWMSE-2008) at ICSE-2008, Leipzig, Germany,
May 2008.
3. C. Augonnet, S. Thibault, and R. Namyst. Automatic calibration of performance
models on heterogeneous multicore architectures. In Proceedings of the Interna-
tional Euro-Par Workshops 2009, HPPC-2009, Springer LNCS 6043, Delft, the
Netherlands, pp. 56-65, 2010.
4. C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis. Exploiting the Cell/BE
architecture with the StarPU unified rutime system. In Proceedings of the Ninth
International Workshop on Embedded Computer Systems: Architectures, Mod-
eling, and Simulation (SAMOS’09), Springer LNCS 5657, Samos, Greece, pp.
329–339, 2009.
5. E. Ayguade, R.M. Badia, D. Cabrera, A. Duran, M. Gonzales, F. Igual,
D. Jimenez, J. Labarta, X. Martorell, R. Mayo, J. Perez, and E. Quintana-Orti.
A proposal to extend the OpenMP tasking model for heterogeneous architec-
tures. In Proceedings of the Fifth International Workshop on Open MP: Evolv-
ing OpenMP in an Age of Extreme Parallelism IWOMP’09, Springer LNCS 5568,
Dresden, Germany, pp. 154–167, 2009
6. P. Bellens, J.M. Perez, R.M. Badia, and J. Labarta. CellSs: A programming model
for the Cell BE architecture. In Proceedings of the 2006 ACM/IEEE Supercom-
puting Conference (SC’06), Tampa, FL, November 2006.
7. B. Bouzas, R. Cooper, J. Greene, M. Pepe, and M.J. Prelle. Multicore framework:
An API for programming heterogeneous multicore processors. Technical report,
Mercury Computer Systems Inc., Chelmsford, MA, 2007.
8. S. Chellappa, F. Franchetti, and M. Püschel. FFT program generation for the
Cell BE. In Proceedings of the International Workshop on State-of-the-Art in
Scientific and Parallel Computing (PARA), Trondheim, Norway, June 2008.
9. T. Chen, R. Raghavan, J.N. Dale, and E. Iwata. Cell broadband engine archi-
tecture and its first implementation—A performance view. IBM J. Res. Dev.,
51(5):559–572, September 2007.
194 Fundamentals of Multicore Software Development
10. T. Chen, Z. Sura, K. O’Brien, and K. O’Brien. Optimizing the use of static buffers
for DMA on a CELL chip. In Proceedings of the International Workshop on
Languages and Compilers for Parallel Computers (LCPC’06), New Orleans,
LA, Springer LNCS 4382, pp. 314–329, 2006.
11. P. Cooper, U. Dolinsky, A.F. Donaldson, A. Richards, C. Riley, and G. Russell.
Offload—Automating code migration to heterogeneous multicore systems.
In Proceedings of the International HiPEAC-2010 Conference, Pisa, Italy,
Springer LNCS 5952, pp. 337–352, 2010.
12. U. Dolinsky, A. Richards, G. Russell, and C. Riley. Offloading parallel code on
heterogeneous multicores: A case study using Intel threading building blocks
on cell. In Proceedings of the Many-Core and Reconfigurable Supercomputing
Conference (MRSC-2010), Rome, Italy, www.mrsc2010.eu, March 2010.
13. A.F. Donaldson, U. Dolinsky, A. Richards, and G. Russell. Automatic offload-
ing of C++ for the Cell BE processor: A case study using Offload. In Pro-
ceedings of the International Workshop on Multi-Core Computing Systems
(MuCoCoS’10), Krakow, Poland, pp. 901–906. IEEE Computer Society,
February 2010.
14. A.E. Eichenberger, J.K. O’Brien, K.M. O’Brien, P. Wu, T. Chen, P.H. Oden,
D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M.K.
Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler tech-
nology to exploit the performance of the Cell Broadband EngineTM architecture.
IBM Syst. J., 45(1), 2006.
15. T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick. UPC Distributed Shared
Memory Programming. Wiley-Interscience, Hoboken, NJ, 2005.
16. K. Fatahalian, T.J. Knight, M. Houston, M. Erez, D.R. Horn, L. Leem,
J.Y. Park, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan. Sequoia: Program-
ming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference
on Supercomputing, ACM, Tampa, FL, 2006.
17. B. Gedik, R. Bordawekar, and P.S. Yu. Cellsort: High performance sorting on
the cell processor. In Proceedings of the 33rd International Conference on Very
Large Data Bases, Vienna, Austria, pp. 1286–1207, 2007.
18. M. Gschwind, D. Erb, S. Manning, and M. Nutter. An open source environment
for Cell Broadband Engine system software. Computer, 40(6):37–47, 2007.
19. Heise online. SC09: IBM lässt Cell-Prozessor auslaufen, Hannover, Germany,
http://www.heise.de/newsticker/meldung/SC09-IBM-laesst-Cell-Prozessor-
auslaufen-864497.html, November 2009.
20. R. Hultén. Optimized on-chip software pipelining on the Cell BE proces-
sor. Master thesis, LIU-IDA/LITH-EX-A–10/015–SE, Linköping university,
Sweden, 2010.
Programming the Cell Processor 195
34. J. Keller and C.W. Kessler. Optimized pipelined parallel merge sort on the Cell
BE. In Proceeding of the Second Workshop on Highly Parallel Processing on a
Chip (HPPC-2008), Gran Canaria, Spain, August 2008. In E. Luque et al. (Eds.):
Euro-Par 2008 Workshops, Springer LNCS 5415, pp. 127–136, 2009.
35. C. Kessler. Managing distributed shared arrays in a bulk-synchronous parallel
environment. Concurr. Comp. Pract. Exp., 16:133–153, 2004.
36. C.W. Kessler. NestStep: Nested parallelism and virtual shared memory for the
BSP model. J. Supercomput., 17:245–262, 2000.
37. C.W. Kessler and J. Keller. Optimized on-chip pipelining of memory-intensive
computations on the Cell BE. In Proceeding of the First Swedish Workshop on
Multicore Computing (MCC-2008), Ronneby, Sweden. To appear in ACM Com-
puter Architecture News, 36(5):36–45, 2009, 2008.
38. C.W. Kessler and J. Keller. Optimized mapping of pipelined task graphs on the
Cell BE. In Proceedings of the 14th International Workshop on Compilers for
Parallel Computing (CPC-2009), Zürich, Switzerland, January 2009.
39. Khronos Group. The OpenCL specification, V 1.0. http://www.khronos.org/
opencl, 2009.
40. M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor communication net-
work: Built for speed. IEEE Micro, 26:10–23, 2006.
41. D. Kunzman. Charm++ on the Cell processor. Master’s thesis, Depart-
ment of Computer Science, University of Illinois, Urbama, IL, 2006.
http://charm.cs.illinois.edu/newpapers/06-16/paper.pdf
42. T. Liu, H. Lin, T. Chen, J.K. O’Brien, and L. Shao. DBDB: Optimizing DMA
transfer for the Cell BE architecture. In Proceedings of the International Confer-
ence on Supercomputing, New York, pp. 36–45, June 2009. ACM.
43. A. Lokhmotov, A. Mycroft, and A. Richards. Delayed side-effects ease multi-
core programming. In Proceedings of the Euro-Par-2007 Conference, Rennes,
France, Springer LNCS 4641, pp. 641–650, 2007.
44. M. Monteyne. Rapidmind multi-core development platform. White paper,
www.rapidmind.com, February 2008.
45. M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T Nakatani. MPI Microtask for
programming the CELL Broadband EngineTM processor. IBM Syst. J., 45(1):85–
102, 2006.
46. I. Papp, N. Lukic, Z. Marceta, N. Teslic, and M. Schu. Real-time video quality
assessment platform, In Proceedings of the International Conference on Con-
sumer Electronics 2009, ICCE ’09, Las Vegas, NV pp. 1–2, January 2009.
47. A. Richards. The Codeplay Sieve C++ parallel programming system. White
paper, Codeplay Software Ltd., Edinburgh, UK, 2006.
Programming the Cell Processor 197
58. S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The poten-
tial of the Cell processor for scientific computing. In CF’06: Proceedings of the
Third Conference on Computing Frontiers, Ischia, Italy, pp. 9–20, 2006. New
York: ACM.
59. S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific
computing kernels on the Cell processor. Int. J. Parallel Prog., 35(3):263–298,
2007.
60. D. Wu, Y.-H. Li, J. Eilert, and D. Liu. Real-time space-time adaptive processing
on the STI CELL multiprocessor. In Proceedings of the Fourth European Radar
Conference, Munich, Germany, October 2007.
61. D. Wu, B. Lim, J. Eilert, and D. Liu. Parallelization of high-performance video
encoding on a single-chip multiprocessor. In Proceedings of the IEEE Inter-
national Conference on Signal Processing and Communications, Dubai, UAE,
November 2007.
Part IV
Emerging Technologies
This page intentionally left blank
Chapter 9
Automatic Extraction of Parallelism from
Sequential Code
Contents
201
202 Fundamentals of Multicore Software Development
9.1 Introduction
9.1.1 Background
Previous chapters have discussed many of the tools available to a programmer look-
ing to parallelize code by hand. While valuable, each requires considerable effort by
the programmer beyond that which is required for a typical single-threaded program.
As more power is placed in the hands of the programmer, programmers must concern
themselves with additional issues only posed by parallel programs—race conditions,
deadlock, livelock, and more [1–3]. Furthermore, sequential legacy code presents its
own problem. Not designed with parallelism in mind, the programmer may encounter
great trouble in transforming the code to a parallel form.
Automatic parallelization offers another option to programmers interested in par-
allelizing their code. Automatic parallelism extraction techniques do not suffer the
same limitations as manual parallelization. Indeed, through analysis, a compiler can
often find parallelism in sequential code that is not obvious even to a skilled pro-
grammer. However, automatically parallelizing code poses a significant challenge of
its own: finding means to extract and exploit parallelism in the code. The compiler
must determine first what transformations can be done, and throughout this chapter,
we will see a number of the possible techniques in the compiler’s transformation tool-
box. Simply put, this means the compiler must both find an exploitable region of code
through analysis and transform the code into a parallel form. Analyses and transfor-
mations must work together to find and then exploit parallelization opportunities.
of the speculation. Discussed in more detail later, alias, control, and value speculation
have all been proposed to aid parallelization [4].
1: while (node)
2: {ncost = calc(node);
3: cost += ncost;
4: node = node->next;}
5: ...
parallel?” The answer here is, at first glance, categorically no—each iteration depends
on knowing the node operated on in the previous iteration so that it can advance to
the next node. A parallelizing compiler would infer this property by formalizing the
concept of a data dependence.
When we further consider the code from Figure 9.1, we notice another type of
dependence. The loop will only continue executing so long as the node is not NULL.
Since the execution of the loop body (statements 2, 3, 4) is dependent on whether the
condition tested in the while statement is true, these three statements are said to be
control dependent on statement 1.
Sections 9.2.2 and 9.2.3 formalize the concepts of data dependence and control
dependence. Section 9.2.4 then describes a program structure that represents both
kinds of dependences in a uniform way that is useful for parallelization.
Start
Exit
• Both A and B either write or read the same program variable, or in general, the
same memory location.
The data dependence graph for the program in Figure 9.1 is shown in Figure 9.3.
The nodes are the same as the nodes in the control flow graph of Figure 9.2 (state-
ment 5 is presumed to have no data dependences for our purposes and thus omitted),
while the edges represent the data dependence relation and are of three kinds: f edges
representing flow dependence, a edges representing anti dependence, and o edges
representing output dependence.
1
f
a
a
f
2 a 3 4
f,a,o
f
f,a,o
9.2.2.2 Analysis
In this section, we give an overview of the various analyses that have been pro-
posed by compiler researchers to efficiently construct the data dependence graph with
varying degrees of accuracy. All analyses discussed in this section were proposed in
the past few decades. Traditionally, there has always been a trade-off between the
scalability of a data dependence analysis and the accuracy of the constructed data
dependence relation. First, it should be noted that determining the exact dynamic
data dependence relation between all statements in a program is undecidable in gen-
eral [8]. This means any data dependence relation that is constructed by an analysis
is a safe approximation of the actual dependences manifested at runtime. The run-
ning time and memory requirements of a data dependence analysis are, therefore,
inversely related to the accuracy of the analysis: a fast analysis usually reports more
dependences than a slower, more accurate analysis [9].
The complexity of a data dependence analysis also depends on the properties of
the language being analyzed: usually it is easier to analyze data dependences in lan-
guages like FORTRAN 77 that permit only restricted uses of pointer variables than to
analyze languages like C where the use of pointers and heap allocated data structures
create complicated data dependence relations through multiple levels of indirection.
Languages like Java make the analysis of pointers simpler with features like strong
typing and the absence of pointer arithmetic but still have other features like polymor-
phism, and dynamic class loading, which introduce other complications that a data
dependence analysis has to contend with. In the following paragraphs, we will discuss
the different types of analyses in increasing order of accuracy (and hence increasing
order of complexity).
The reaching definitions analysis [10] is a simple data-flow analysis that can be
used to determine flow dependences between statements in a program that do not have
pointers (or in case of a low-level intermediate representation, the flow dependences
between register/temporary variables).
Relatedly, liveness analysis [10] is a data flow analysis that is used to determine at
each program point the variables that may be potentially read before their next write.
Applying this definition in the context of loops, a variable v is said to be a live-in into
a loop if v is defined at a program point outside the loop and used at a point inside
the loop and there exists a path from the point of definition to the point of use with no
intervening definitions in between. Similarly, a variable is said to be a live-out of a
loop if it is defined at a point inside the loop and used outside the loop and there exists
a path from the point of definition to the point of use with no intervening definitions
in between.
Reaching definitions analysis along with liveness analysis, can be used for deter-
mining flow and anti-dependences in a data dependence graph. Output dependences
for programs with no pointer aliasing (two pointers are aliases of each other if they
may point to the same memory location at runtime) can be determined by inter-
secting the variables defined at program points that are reachable from each other.
In programs written in languages that permit the use of pointer variables, pointer
analysis is employed for determining the data dependences between statements. Given
Automatic Extraction of Parallelism from Sequential Code 207
a pair of pointer variables, a pointer analysis determines whether the two variables
can point to the same memory location at runtime (may alias or may not alias). Pointer
analysis plays an important role in disproving data dependences in pointer-based lan-
guages like C and Java, as can be inferred by the large amount of research literature
describing its various forms [11]. Pointer analysis is further augmented by shape
analysis that tries to determine the form or “shape” of allocated data structures, dis-
ambiguating trees from linked lists, etc. For more details about flow-sensitive analysis
[6], field-sensitive analysis [12] and context-sensitive analysis [13], the interested
reader is referred to the relevant citations. More detailed discussion of shape analysis
and the logic used to describe data structures is also beyond the scope of this book
[14–16].
Both pointer and shape analysis, in their basic form, are not loop and array-element
sensitive; they do not disprove dependencies between statements accessing different
array elements that execute on different iterations of a loop (called inter-iteration or
loop-carried dependences). The main reason for this is that the complexity of main-
taining and propagating distinct points to sets for each array element on different
iterations of a loop is prohibitive in a pointer-based language like C. Inter-iteration
dependences are often those with which we are most concerned for our paralleliza-
tions. Furthermore, the prevalence of array-based loops in scientific programs has led
researchers to develop a separate class of dependence analysis called the array depen-
dence analysis [17–19]. Array dependence analysis expresses the array accesses in
a loop in the form of linear functions of the loop induction variables and then goes
on to express the conditions for the existence of dependences between these accesses
as an integer linear programming problem does not admit a feasible solution, then
it proves the absence of a dependence between the multiple iterations of a loop and
hence they can be executed in parallel. Although the solving of ILP in the general
case is NP-hard, there have been different kinds of tests like the GCD test and Baner-
jee test [17,20] that have been successfully applied for specific commonly occurring
cases. It should be noted that array dependence analysis is easier to apply in lan-
guages like FORTRAN 77 (which is traditionally the language of choice for scien-
tific computation) which have restrictive pointer aliasing than in languages like C and
Java where the use of dynamically allocated arrays and the use of pointer indirection
requires both pointer analysis and array dependence analysis in a single framework
to be effective [21].
2 3 4
• There exists a valid execution path p in the control flow graph from A to B
9.2.3.2 Analysis
Ferrante et al. [22] give an algorithm for computing the control dependence edges
using the idea of dominance frontiers on the reverse control flow graph of a program.
In order to understand the term “dominance frontier,” we need to first define what
we mean by the term “dominance.” A node A is said to dominate a node B in a
control flow graph if every path from start node of the control flow graph to B has
to necessarily pass through A. It is not difficult to see that dominance is transitive,
which means that it is possible to arrange the nodes of a control flow graph in the
form of a dominator tree, where there is an edge from A to B (nodes are the nodes
of a control flow graph) if A is an immediate dominator of B (i.e., there is no other
node C such that A dominates C and C dominates B). For example, the dominator
tree for the control flow graph of Figure 9.1 is shown in Figure 9.5. The dominance
frontier of a node A in the control flow graph is defined as the set of all nodes X such
that A dominates a predecessor of X but not X itself. The iterated dominance frontier
of node A is the transitive closure of the dominance frontier relation for the node A.
The reverse control flow graph is the control flow graph of a program with its edges
reversed. Ferrante et al.’s algorithm constructs the dominance frontiers of every node
in the reverse control flow graph, which is the same as the post-dominance frontier
relation. For every node B in the post-dominance frontier of a node A, an edge from
B to A is created in the control dependence graph. For more details on the efficient
Automatic Extraction of Parallelism from Sequential Code 209
Start
3 5
4 Exit
1
D
C
C
C,D
D
2 3 4
D
D
D D
FIGURE 9.6: Program dependence graph for Figure 9.1. The data and control
dependence edges are marked D and C, respectively.
FIGURE 9.7: Bad DOALL candidates. (a) WAR dependence. (b) RAW
dependence. (c) WAW dependence. (d) Loop exit condition.
Figure 9.7b violates the no loop-carried RAW dependence condition because a[i]
is written in the previous iteration. Like the previous case, if iteration i+1 is executed
before iteration i the old a[i + 1] value can be read.
Figure 9.7c violates the no loop-carried WAW dependence condition because two
different iterations write their results to the same location c. If this loop is parallelized
using DOALL and c is read after the loop, the wrong value can be read.
Figure 9.7d violates the loop exit condition. If the loop is parallelized using
DOALL, later iterations may be executed even though the loop should have exited
before iteration N.
FIGURE 9.8: DOALL example. (a) Original program. (b) Parallelized program.
analysis can tell whether a variable will be used later in the program execution. If tmp
is not live-out, the WAW dependence can be ignored. The WAR dependence can be
also ignored if tmp is privatized. Therefore, using liveness analysis and privatization,
this loop can be parallelized using DOALL as shown in Figure 9.10b. Note that since
tmp is a local variable in each thread, tmp is privatized.
In all of these cases, the intra-iteration dependences are inevitably respected as any
single iteration is executed on a single thread, and thus the compiler does not need to
handle these dependences.
FIGURE 9.10: DOALL example with WAW and WAR dependences. (a) Original
program. (b) Parallelized program.
an array and counting the number of elements meeting some condition. These latter
two reductions are called as max/min reduction and count reduction [24].
We will take a dot product code in Figure 9.11a as an example for sum reduction.
As mentioned earlier, this loop cannot be parallelized with the DOALL technique
because there are loop-carried WAW and WAR dependences on sum. Add operations
are commutative and associative, and so we can actually perform any set of add oper-
ations in any order. What this means is that if the sum operations are divided into
several groups and then accumulate all partial-sums from the groups after computa-
tion, the correct result will be achieved. Therefore, this example can be changed into
code shown in Figure 9.11b.
214 Fundamentals of Multicore Software Development
Partial-sums are initialized in the first loop, and they are calculated in each group in
the second loop. In the last loop, the final sum is calculated by accumulating sub-sum
results. Unlike the previous code, in this code, L1 can be parallelized using DOALL
because L2 loops for different j can be executed at the same time.
Thread 1: Thread 2:
FIGURE 9.12: Speculative DOALL example. (a) Original program. (b) Paral-
lelized program.
Automatic Extraction of Parallelism from Sequential Code 215
FIGURE 9.13: Loop interchange. (a) Original loop. (b) Interchanged loop.
The loops, however, can be interchanged. That is, their order can be swapped with-
out changing the resultant array. If we make this interchange, we generate the code
seen in Figure 9.13b. In this transformed code, our dependence recurrence is now
across the inner loop while iterations of the outer loop fulfill our criteria for making
a successful DOALL parallelization. The benefit is that we no longer need to syn-
chronize and communicate results at the end of every invocation of the inner loop.
Instead, we can distribute iterations of the outer loop to different threads, thus com-
pleting each inner loop invocation on a different thread, and need only communicate
results once at the end.
Another useful loop transformation that can obtain DOALL parallelism in scien-
tific programs where none exists in its initial form is loop distribution. Loop distribu-
tion takes what is originally a single loop with multiple statements with inter-iteration
dependences and splits it into multiple loops where there are no inter-iteration depen-
dences. Let us take the code in Figure 9.14a as an example. As we can see, there is an
inter-iteration flow dependence on the array A. We cannot immediately use a DOALL
parallelization on our loop because of this dependence. However, we note that all of
the computations in statement 2 could be completed independently of the calcula-
tions in statement 3 (statement 2 does not depend on statement 3). Statement 3, in
fact, could wait and be executed after all calculations from statement 2 are executed.
The compiler, recognizing that statement 3 depends on statement 2 but that statement
2 has no loop-carried dependences, can distribute this loop.
FIGURE 9.14: Loop distribution. (a) Original loop. (b) Distributed loops.
Automatic Extraction of Parallelism from Sequential Code 217
The code generated by distributing this loop is shown in Figure 9.14b. If we look at
this code, the statements in the first loop (statements 1 and 2) very closely resemble
our first DOALL example from Section 9.3.2; it is absolutely a good candidate for
the DOALL technique. The second loop now depends on the first loop but has no
loop-carried dependences. It, too, may be further transformed by the DOALL tech-
nique. By small changes to the code structure, the compiler is able to take one sequen-
tial loop and transform it into two perfect candidates for the DOALL technique.
A more detailed discussion of both loop interchange and loop distribution can be
found in various modern compiler texts such as the text by Allen and Kennedy [24].
While these two straightforward transformations are good examples of code trans-
formations that enable extraction of more parallelism with the DOALL technique
in scientific code, they are only samples of the vast toolbox of transformations that
enable further DOALL parallelism. We have seen a few other important transforma-
tions and techniques with privatization, reduction, and speculative DOALL. A few
others are worthy of mention here but cannot be covered in detail.
We have only covered a few techniques amongst many for creating parallelism
from non-parallel loops. Others include loop reversal and loop skewing. While the
loop transformation techniques discussed here unlock parallelism in well-nested loops,
imperfectly nested loops require other techniques to find parallelism. For this, we
may turn to multilevel loop fusion [27]. While these methods all unlock opportunities
for parallelism, we are generally at least as concerned with how we actually schedule
and distribute the workload. A technique called strip mining is useful in vectorization
techniques for packaging parallelism in the best manner for the hardware [28].
Finally, there is one technique that brings together many of these transformations
and techniques, including loop fusion and loop distribution amongst others, in order
to find parallelism. This transformation, called an affine transform, partitions code
into what are called affine partitions. Affine transforms are capable not only of gen-
erating parallelism that fits with the DOALL technique but can find parallelism in
array-based programs with any number of communications [29]. In fact, it can be
used to find pipeline parallelism as in DOPIPE, which will be discussed in detail
in Section 9.5. This makes DOALL among the most powerful transformations for
extracting parallelism from scientific code.
introduced as a widely recognized example, and the TLS technique known as specu-
lative parallel iteration chunk execution (Spice) will be discussed.
DOACROSS schedules each iteration on a processor with synchronization added to
enforce cross-iteration dependences. In Section 9.2, we saw in Figure 9.1 a loop that
contained loop-carried dependences. The corresponding program dependence graph
was formed in Figure 9.6. If we assign different iterations of the loop onto different
cores, synchronizations are communicated between different cores. As we saw with
DOALL (Section 9.3), since a single iteration is executed on a single thread, we will
inevitably respect the intra-thread anti-dependences as ignored in Figure 9.15a shows
the DOACROSS execution model for this loop. As one iteration is executed on a
single thread, the intra-iteration dependences are ignored in the figure.
In Figure 9.15a, we assume that the inter-core communication latency is only 1
cycle and we are able to achieve a speedup of 2× over single-threaded execution.
However, since synchronizations between different cores are put on the critical path,
DOACRORSS is quite sensitive to the communication latency. If we increase the
communication latency to 2 cycles, as seen in Figure 9.15b, it now takes DOACROSS
Core 1 Core 2
0
1.1
Core 1 Core 2 1
0 4.1
1.1 2
1 2.1
4.1 3
2 3.1 1.2
2.1 1.2 4
3 4.2
3.1 4.2 5
4 2.2
6
1.3 2.2
5 1.3 3.2
7
4.3 3.2
6
(a) (b)
FIGURE 9.15: DOACROSS example. (a) DOACROSS with latency = 1 cycle. (b)
DOACROSS with latency = 2 cycles.
Automatic Extraction of Parallelism from Sequential Code 219
• test (R, ): R is the variable to be tested and is the dependence distance. R
is shared by all of the threads and it is always equal to the number of iterations
that have finished. The test instruction does not complete execution until the
value of the variable is at least equal to the number of the iteration containing
the source of the dependence being synchronized. In other words, test (R, )
in iteration i will keep executing until value of R becomes at least i − .
• testset (R): testset checks whether the testset in the previous iteration has
finished or not. If it has, then testset updates the R value to the present iteration
number.
After transformation, the original loop becomes the one in Figure 9.17.
Here, test (R, 2) makes sure that before statement 1 can execute in the present
iteration i, iteration 1, 2 has already finished executing because only after that will
the value R be set to the iteration number by instruction testset (R). Now different
iterations can be run in parallel on different cores.
1: while(node) {
2: ncost = calc(node);
3: cost += ncost;
4: if (cost > T)
5: break;
6: node = node->next;
7: }
1. The value of cost is always less or equal to the value of T, which means the
branch in statement 4 is not taken.
2. The values of nodes are the same in different invocations of the loop, which
means we can use the node value in former invocations to predict the node
value in the present invocation.
2 3 4 5 6
2 3 4 5 6
Thread 1:
mispred = 1; Thread 2:
while(node) {
ncost = calc(node); node = predicted_node;
cost += ncost; cost = 0;
if (cost > T) while(node) {
break; ncost = calc(node);
node = node->next; cost += ncost;
if (node == predicted_node) { if (cost > T)
mispred = 0; break;
break; node = node->next;
} }
} send(thread1, cost);
if (!mispred) {
receive(thread2, cost2);
cost += cost2;
}
Figure 9.20 shows the PDG after speculation. The dashed lines mark the depen-
dences having been speculated out. Now if we have two threads, we will transform
the example code into the code seen in Figure 9.21.
Examining this code, the reader may have one significant question: how do we get
this value for “predicted node” used by thread 2 as a starting position? One possi-
ble answer is a technique known as speculative parallel iteration chunk execution
(Spice) [33]. Spice determines which loop-carried live-ins require value prediction.
It then inserts code to gather values to be predicted. On the first invocation of the
loop, it collects the values—in our case, this is a pointer to a location somewhere in
the middle of our list. Since many loops in practice are invoked many times, the next
invocation can then speculate that the same value will be used. In this case, so long
as the predicted node was not removed from the list (even if nodes around it were
removed), our speculation will succeed.
Automatic Extraction of Parallelism from Sequential Code 223
Since we have two threads here, only one node value needs to be predicted. Thread
1 is executed nonspeculatively. It keeps executing until the node value is equal to the
predicted node value for thread 2. If that happens, it means there is no misspeculation,
and the result from thread 2 can be combined with the result in thread 1. However, if
the node value in thread 1 is never equal to the predicted value, thread 1 will execute
all of the iterations and a misspeculation will be detected in the end. When misspec-
ulation happens, all the writes to registers and memory made by thread 2 must be
undone and the result of thread 2 is discarded.
There has been a great deal of work done with TLS techniques beyond what has
been demonstrated in this section, beyond what could possibly be covered in this
chapter. Some efforts have investigated hardware support and design for TLS tech-
niques as with Stanford’s Hydra Core Multiprocessor (CMP) [34], among many oth-
ers [35–39]. Other TLS works have investigated software support for these techniques
[14,32]. Bhowmik and Franklin investigate a general compiler framework for TLS
techniques [42]. Other work by Kim and Yeung examines compiler algorithms [43].
Work by Johnson et al. examines program decomposition [44]. Further work by Zilles
and Sohi examines a TLS technique involving master/slave speculation [45]. Readers
who are interested are directed to these publications for more information about the
numerous TLS techniques and support systems.
(a) (b)
Pipeline parallelization relies on the loop containing pipelineable stages. That is,
DSWP works only if the loop can be split into several stages that do not form cyclic
dependences in the PDG. The example code shown in Section 9.4.2 will not be a
good candidate for DSWP since the two instructions contained in the loop form a
cyclic data dependence. Therefore DSWP does have the same universal applicabil-
ity as DOACROSS. However, DSWP is not as sensitive to communication cost as
DOACROSS because the execution of the next iteration overlaps with communication.
Communication latency only affects performance as a one-time cost. Therefore,
pipeline parallelization is more latency tolerant than the DOACROSS technique. In
Figure 9.15, we saw that DOACROSS performs poorly when the communication
latency is high. DSWP allows overlapping of communication with execution of the
same stage in the next iteration to hide the latency, as shown in Figure 9.22.
1 1
2 3 4 2 3 4
(a) (b)
FIGURE 9.23: DSWP code partitioning. (a) DAGSCC of sample loop. (b) DAGSCC
partitioned.
the graph. If each of the SCCs is contracted to a single vertex, the result graph is
transformed to a graph that consists of only SCCs, called DAGSCC as in Figure 9.23a.
After that, DAGSCC is partitioned into a fixed number of threads while making sure
that there is no cyclic inter-thread dependences (see Figure 9.23b). In this example,
instructions 1 and 4 form an SCC that cannot be split. They have to be put on the
same stage in pipeline; otherwise we will have cyclic inter-thread dependence. It is
usually common to have more than one partition potentially available. Heuristics are
used to guide the choice of the most appropriate partition to generate a pipeline that
balances workloads and minimizes communication cost [48].
FIGURE 9.24: DSWP partitioned code. (a) Produce thread. (b) Consume
thread.
226 Fundamentals of Multicore Software Development
This DSWP parallelization decouples the execution of the two code slices and
allows the code execution to overlap with the inter-thread communication. Further-
more, the loop is split into smaller slices so that the program can make better use of
cache locality.
1: while(node) {
2: ncost = calc(node);
3: cost += ncost;
4: if (cost > T)
5: break;
6: node = node->next;
7: }
2 3 4 5 6
∗ Profiling involves instrumenting the original code with a code to collect data at run time. The instru-
mented code is then compiled and run on representative sets of data. Information collected from the
instrumented code can be used as a supplement for analysis as it provides information that may not be
determined practically (or at all) by static analysis.
228 Fundamentals of Multicore Software Development
2 3 4 5 6
taken control edges, possible memory aliases that rarely or never manifest, and opera-
tions that frequently return the same result. If profiling information is not available, it
has been found that early loop exit branches are often good candidates for speculation
as loops generally execute many times before exiting in practice.
In the example in Figure 9.25, let us assume we know that the early exit for cost
greater than a threshold (cost > T) is rarely taken and thus a good candidate for
speculation. We will also assume that the exit due to a null node is also rarely taken
(profiling reveals we have long lists) and thus a possible candidate for speculation as
well. We see in Figure 9.27 the dashed lines representing the candidates for speculated
dependences.
The second issue in selecting dependences is actually choosing dependences that
break a recurrence. It may be, and in fact often is, the case that it is necessary to specu-
late more than one dependence in order to break a recurrence. Ideally, one would want
to consider speculating all sets of dependences. Unfortunately, exponentially many
dependence sets exist and so a heuristic solution is necessary. In SpecDSWP, this is
handled in a multistep manner. After determining which edges are easily speculat-
able, all of these edges are prospectively removed. The PDG is then partitioned in the
same manner as in DSWP with these edges removed. Edges that have been prospec-
tively removed and would break an inter-thread recurrence that originates later in
the pipeline become speculated. Other edges that do not fit these criteria are effec-
tively returned to the PDG. That is, the dependences are respected and not actually
speculated.
In the example as seen in Figure 9.28, we see that DSWP partitioning algorithm
has elected to place nodes 1 and 6 on a thread, statement 2 on a thread, and state-
ments 3, 4, and 5 on a thread. Using this partitioning, it is clear that speculating the
dependences on the early exit due to cost exceeding a threshold (outbound edges
from 4) is necessary. However, edges speculated due to a null node do not break any
inter-thread recurrences. These dependences, therefore, will not be speculated.
Though the previous paragraphs describe what is necessary to select dependences
to speculate, the question of how to check for misspeculation must still be resolved.
This is partly a problem of code generation and partly a problem of support systems.
Unlike speculation in DOACROSS techniques (i.e., TLS), every iteration is specu-
lative and no single thread executes a whole iteration. This means that there must
be some commit unit to guarantee a proper order of commit and handle recovery.
Automatic Extraction of Parallelism from Sequential Code 229
2 3 4 5 6
while (TRUE) {
while (TRUE) { ncost = consume();
while (node) { node = consume(); cost += ncost;
node = node->next; ncost = calc(node); if (cost > T)
produce(node); produce(ncost); FLAG_MISSPECULATION();
} } }
Furthermore, the memory system must allow multiple threads to execute inside the
same iteration at one time. The system must check for proper speculation across all the
threads in the same iteration before it can be committed. These checks for misspecu-
lation may either be inserted into the code by the compiler or handled in a hardware
memory system. SpecDSWP handles most speculation types by inserting checks to
flag misspeculation. The noteworthy exception is the case of memory alias specula-
tion where the memory system is left to detect cases where a memory alias did exist.
In our example, we will insert code to flag misspeculation if we have, in fact, mis-
speculated. We will place this misspeculation flagging code when we can determine
if misspeculation has occurred; this will be in the final thread. Three threads will be
generated from the partitioned PDG in the same way as DSWP. The code generated
is shown in Figure 9.29.
In all cases, the final requirement is a method by which speculative state can be
buffered prior to commit and a unit to commit nonspeculative state. When misspec-
ulation is signaled, the nonspeculative state must be recovered and the work must
be recomputed. Version memory systems [51] that allow multiple threads to operate
on a given version and commit only when the version has been completed and well
speculated have been proposed in hardware and in software but will not be discussed
here.
The considerations made in this section can be summarized as a step-by-step
procedure. In this case, we can break the overall procedure here into six steps:
1. Build the PDG for the loop to be parallelized.
2. Select the dependence edges to speculate.
230 Fundamentals of Multicore Software Development
1: while(node) {
2: p = node->list; // ’p’ is iteration local
3: while(p) {
4: if (p->val == 0) exit();
5: p->val++;
6: p = p->next;
}
7: node = node->next;
}
1 7
1 7
2
2
3 4 5 6 3 4 5 6
(a) (b)
FIGURE 9.31: PDG of sample code (a) before partitioning and (b) after
partitioning.
in Section 9.5.4. With this control dependence removed via speculation, the compiler
can create a two-stage pipeline as before. Figure 9.31b shows the partitioned PDG.
However, the second stage is likely to be much longer than the first stage—it con-
tains a whole linked-list traversal of its own. This imbalance greatly diminishes the
gains from a SpecDSWP parallelization. The fix for this is related to the concept of
independence we saw before with DOALL in Section 9.3. Note that the invocations of
the inner loop (lines 3–6) are independent of one another, if memory analysis proves
that each invocation of the inner loop accesses different linked lists. This stage could
be replicated and all invocations of this loop could be executed in parallel. This par-
allel stage can be executed in parallel on many cores and can be fed with the nodes
found in the first sequential stage of the pipeline, as shown in Figure 9.32. The paral-
lel stages threads need not communicate with each other, just as DOALL threads need
not communicate with each other. Their only common link is the first DSWP pipeline
stage. Because the parallel stage can be replicated many times, Spec-PS-DSWP is a
highly scalable technique, achieving better speedup as the number of cores increases.
S1.1
S1.2
S1.3 S2.1
S2.2
S2.3
9.7 Conclusion
Single-threaded programs cannot make use of the additional cores on a multicore
machine. Thread-level parallelism (TLP) must be extracted from programs to achieve
meaningful performance gains. Automatic parallelization techniques that extract
threads from single-threaded programs expressed in the sequential programming
model relieve the programmer of the burden of identifying the parallelism in the
application and orchestrating the parallel execution on different generations of target
architectures. This solution appears even more appealing in the face of immense pro-
grammer productivity loss due to the limited tools to write, debug, and performance-
optimize multi-threaded programs. Automatic parallelization promises to allow the
programmer to focus attention on developing innovative applications and providing
a richer end-user experience, rather than on optimizing the performance of a piece of
software.
Automatic parallelization techniques focus on extracting loop-level parallelism.
Executing each iteration of a loop concurrently with others (DOALL) is a highly
scalable strategy, but few general-purpose loops have dependence patterns amenable
to such execution. In the face of loop-carried dependences, alternative schemes such
as DOACROSS and DSWP have been developed. Like DOALL, DOACROSS sched-
ules independent iterations for concurrent execution while synchronizing dependent
iterations. However, DOACROSS puts the inter-thread communication latency on
the critical path as seen in Figure 9.15. DSWP addresses this problem by partition-
ing the loop body in such a way as to keep dependence recurrences local to a thread
and communicate only off-critical path dependences in a unidirectional pipeline as in
Figure 9.22. However, DSWP’s scalability is restricted by the balance among, and the
number of, pipeline stages. PSDSWP [52] addresses this issue: It combines the appli-
cability of DSWP with the scalability of DOALL by replicating a large pipeline stage
across multiple threads, provided the stage has no loop-carried dependences with
respect to the loop being parallelized. All of these techniques depend on compile-time
dependence analyses that, unfortunately, become very complicated on programs that
exhibit complex memory access patterns. Worse yet, some dependences can be dis-
covered only at runtime. Speculative execution has been proposed to overcome these
limitations. Thread-level speculation (TLS) techniques allow the compiler to opti-
mistically ignore statically ambiguous dependences, by relying on a runtime system
to detect misspeculation and resume execution from a nonspeculative program state.
Historically, automatic parallelization was successful primarily in the numeric and
scientific domains on well-behaved, regular programs written in Fortran-like lan-
guages [53,54]. The techniques discussed in this chapter expand the scope of auto-
matic parallelization by successfully extracting parallelism from general-purpose
programs, such as those from the SPEC benchmark suite, written in Fortran, C, C++,
or Java [4,37,55–58].
In order to speed up a wider range of applications, a modern auto-parallelizing
compiler must possess an arsenal of parallelization techniques and optimizations.
DSWP and its extensions, SpecDSWP and PSDSWP, have proven to be robust and
Automatic Extraction of Parallelism from Sequential Code 233
scalable and will form an integral part of the arsenal. Traditional data-flow analyses to
compute register and control dependences and state-of-the-art memory dependence
analyses will be needed to statically disprove as many dependences as possible. Addi-
tionally, to be able to target loops at any level in the program, from innermost to out-
ermost, it is necessary that all analyses and techniques have interprocedural scope.
Speculation, synchronization placement, and other optimizations require accurate
loop-aware profile information. Branch profilers, silent store value profilers, memory
alias profilers, etc., will all be part of the compilation framework. To support specula-
tion, runtime support either in software or hardware will be necessary. This layer will
be responsible for detecting misspeculation (the checks may be inserted at compile
time) and rolling back program state. In summary, such a tool-chain will allow a soft-
ware developer to continue developing in a sequential programming model using the
robust development environments of today while still obtaining scalable performance
on the multicore and manycore architectures of tomorrow.
Future parallelization systems must be capable of dynamically adapting the com-
piled application to the underlying hardware and the execution environment. Instead
of complicating the static optimization phase and incorporating all possible dynamic
scenarios, a better solution is to assume a generic runtime configuration and perform
aggressive optimizations statically while simultaneously embedding useful metadata
(hints) into the binary. A runtime system can then use this metadata to adapt the code
to the underlying operating environment. Adaptation to resource constraints, execu-
tion phases in the application, and dynamic dependence patterns will be necessary.
As the number of cores on a processor grows, one or more cores may be dedicated
to the tasks of collecting profile information and performing dynamic program trans-
formation. To reduce the overheads of speculation and profiling, significant advances
in memory dependence analysis must be made to inform the compiler that some
dependences do not exist and hence need not be profiled and speculated. Research
in memory shape analysis [14–16] promises to be an interesting avenue to address
this issue.
Automatic thread extraction will not be restricted just to sequential programs; as
many parallel programs have been and will continue to be written, individual threads
of these programs must be further parallelized if they are to truly scale to hundreds
of cores. Furthermore, just as modern optimizing ILP compilers include a suite of
optimizations with many of them interacting in a synergistic fashion, so must the var-
ious parallelization techniques be applied synergistically, perhaps hierarchically, at
multiple levels of loop nesting, to extract maximal parallelism across a wide
spectrum of applications.
References
1. J.C. Corbett. Evaluating deadlock detection methods for concurrent software.
IEEE Transactions on Software Engineering, 22(3):161–180, 1996.
234 Fundamentals of Multicore Software Development
15. B. Guo, N. Vachharajani, and D.I. August. Shape analysis with inductive recur-
sion synthesis. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation, pp. 256–265, New
York, 2007. ACM, New York.
16. C. Calcagno, D. Distefano, P. O’Hearn, and H. Yang. Compositional shape anal-
ysis by means of bi-abduction. In POPL ’09: Proceedings of the 36th Annual
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,
pp. 289–300, New York, 2009. ACM, New York.
17. K. Kennedy and J.R. Allen. Optimizing Compilers for Modern Architec-
tures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San
Francisco, CA, 2002.
18. W. Pugh. A practical algorithm for exact array dependence analysis. Communi-
cations of the ACM, 35(8):102–114, 1992.
19. D.A. Padua and Y. Lin. Demand-driven interprocedural array property analysis.
In LCPC ’99, pp. 303–317, London, U.K., 1999.
20. U. Banerjee. Data dependence in ordinary programs. Master’s thesis, University
of Illinois at Urbana-Champaign, Champaign, IL, 1976.
21. V. Sarkar and S.J. Fink. Efficient dependence analysis for Java arrays. In Euro-
Par ’01: Proceedings of the 7th International Euro-Par Conference Manch-
ester on Parallel Processing, pp. 273–277, London, U.K., 2001. Springer-Verlag,
Berlin, Germany.
22. J. Ferrante, K.J. Ottenstein, and J.D. Warren. The program dependence graph
and its use in optimization. ACM Transactions on Programing Languages and
System, 9(3):319–349, 1987.
23. A.J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions
on Electronic Computers, 15(5):757–763, October 1966.
24. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures.
Morgan Kaufmann, San Francisco, CA, 2002.
25. H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop
level parallelism in sequential applications. In HPCA’08: Proceedings of the 14th
International Symposium on High-Performance Computer Architecture (HPCA),
Ann Arbor, MI, February 2008.
26. J. Allen and K. Kennedy. Automatic loop interchange. In Proceedings of the ACM
SIGPLAN ’84 Symposium on Compiler Construction, pp. 233–246, Montreal,
Quebec, Canada, June 1984.
27. Q. Yi and K. Kennedy. Improving memory hierarchy performance through com-
bined loop interchange and multi-level fusion. International Journal of High Per-
formance Computing Applications, 18(2):237–253, 2004.
236 Fundamentals of Multicore Software Development
28. A. Wakatani and M. Wolfe. A new approach to array redistribution: Strip mining
redistribution. In Proceedings of Parallel Architectures and Languages Europe
(PARLE 94), pp. 323–335, London, U.K., 1994.
29. A.W. Lim and M.S. Lam. Maximizing parallelism and minimizing synchroniza-
tion with affine partitions. Parallel Computing, 24(3–4):445–475, 1998.
30. S.P. Midkiff and D.A. Padua. Compiler algorithms for synchronization. IEEE
Transactions on Computers, 36(12):1485–1495, 1987.
31. P. Tang, P.-C. Yew, and C.-Q. Zhu. Compiler techniques for data synchronization
in nested parallel loops. SIGARCH Computer Architecture News, 18(3b):177–
186, 1990.
32. H.-M. Su and P.-C. Yew. Efficient doacross execution on distributed shared-
memory multiprocessors. In Supercomputing ’91: Proceedings of the 1991
ACM/IEEE Conference on Supercomputing, pp. 842–853, New York, 1991.
ACM, New York.
33. E. Raman, N. Vachharajani, R. Rangan, and D.I. August. Spice: Speculative par-
allel iteration chunk execution. In Proceedings of the 2008 International Sympo-
sium on Code Generation and Optimization, New York, 2008.
34. L. Hammond, B.A. Hubbert, M. Siu, M.K. Prabhu, M. Chen, and K. Olukotun.
The Stanford Hydra CMP. IEEE Micro, 20(2):71–84, January 2000.
35. H. Akkary and M.A. Driscoll. A dynamic multithreading processor. In Proceed-
ings of the 31st Annual ACM/IEEE International Symposium on Microarchi-
tecture, pp. 226–236, Los Alamitos, CA, 1998. IEEE Computer Society Press,
Washington, DC.
36. P. Marcuello and A. González. Clustered speculative multi-threaded proces-
sors. In Proceedings of the 13th International Conference on Supercomputing,
pp. 365–372, New York, 1999. ACM Press, New York.
37. J.G. Steffan, C. Colohan, A. Zhai, and T.C. Mowry. The STAMPede approach to
thread-level speculation. ACM Transactions on Computer Systems, 23(3):253–
300, February 2005.
38. J.-Y. Tsai, J. Huang, C. Amlo, D.J. Lilja, and P.-C. Yew. The superthreaded
processor architecture. IEEE Transactions on Computers, 48(9):881–902,
1999.
39. T.N. Vijaykumar, S. Gopal, J.E. Smith, and G. Sohi. Speculative versioning
cache. IEEE Transactions on Parallel and Distributed Systems, 12(12):1305–
1317, 2001.
40. M. Cintra and D.R. Llanos. Toward efficient and robust software speculative par-
allelization on multiprocessors. In PPoPP ’03: Proceedings of the Ninth ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming,
pp. 13–24, New York, 2003. ACM, New York.
Automatic Extraction of Parallelism from Sequential Code 237
Contents
239
240 Fundamentals of Multicore Software Development
10.1 Introduction
Software needs to be parallelized to exploit the performance potential of multi-
core chips. Unfortunately, parallelization is difficult and often leads to disappoint-
ing results. A myriad of parameters such as choice of algorithms, number of threads,
size of data partitions, number of pipeline stages, load balancing technique, and other
issues influence parallel application performance. It is difficult for programmers to set
these parameters correctly. Moreover, satisfactory choices vary from platform to plat-
form: Programs optimized for a particular platform may have to be retuned for others.
Given the number of parameters and the growing diversity of multicore architectures,
a manual search for satisfactory parameter configurations is impractical. This chapter
presents a set of techniques that find near-optimal parameter settings automatically,
by repeatedly executing applications with different parameter settings and searching
for the best choices. This process is called auto-tuning.
Result data
Input data
Pipeline layer
1 2
Stage 1 Stage 2 Stage 3 Stage 4
M3
3 M4 M5 M6
Module layer
Pre-
M1
Post-
processing M7 M8 processing
M2
M9
M10
M10
Data layer
Result
4 Input bin 1 (instance 1) bin 1
consolidation
partitioning
Result data
5
M10 Result
Data
In the past, most auto-tuning work has focused on the properties of numerical appli-
cations [7,29]. The methods applied are highly specialized and require significant pro-
gramming effort to prepare applications for tuning. We present a broader approach
that tunes general-purpose applications. This approach is highly automated, requiring
merely a short specification of the parallel architecture of the application to be tuned.
All tuning-related code is generated automatically or supplied by libraries. Of course,
the tuning process itself is also fully automated. The rest of the chapter discusses this
approach and how it compares to other work.
10.3 Terminology
This section provides an overview of auto-tuning terminology and a classification
of various auto-tuning approaches.
242 Fundamentals of Multicore Software Development
10.3.1 Auto-Tuning
Tuning parameter: A tuning parameter p represents a program variable with a value
set Vp , where v ∈ Vp is called a parameter value of p.
Adjusting p by assigning different parameter values, v ∈ Vp may have an impact
on program performance.
Parameter configuration: A parameter configuration C is a tuple consisting of a
value for each tuning parameter within a given program. Let P = {p1 , p2 , . . . , pn } be
the finite set of all tuning parameters in the program. Then a configuration of C is an
n-tuple (vi ), where vi ∈ Vpi and i ∈ {1, . . . , n}.
Search space: The search space S of an auto-tuner is the cartesian product of all
tuning parameter value sets of a given program. Let P = {p1 , p2 , . . . , pn } be the set of
all tuning parameters of a program and Vpi the value set of pi (with i ∈ {1, 2, . . . , n}).
Then the search space is defined as:
S = V p 1 × Vp 2 × . . . × V p n (10.1)
The size |S| of the search space (i.e., the number of parameter configurations in S)
is given as
Production time
Development time
Off-line tuning Online tuning
l Tuning type 1
rica
pi
em
sed
ba
el-
od
M
Tuning strategy 3
repeated over and over again. The program must also run long enough to allow
an auto-tuner to sample enough parameter configurations in a single execution.
If there is not enough time, online and off-line tuning can be combined.
Time of tuning (dimension 2): The time of tuning specifies when the auto-tuner is
active. We distinguish between development time and production time.
• Development time: Tuning runs are performed during the final test phase of the
development process. The auto-tuner becomes a development tool finding the
best parameter configuration for a specific target platform.
• Production time: Tuning runs are performed during initial production runs, or
periodically for re-tuning. Tuning may slow down the application because of
data gathering and because unfavorable configurations must be tried out.
Apply methods
Specify tuning for search space
Tune the application
instructions partitioning and
reduction
program’s architecture are rarely done because they require invasive and complex
code changes. Developers are hesitant to make such changes because they prefer to
avoid the risk of introducing bugs.
Automated software architecture adaptation can be based on using parallel pat-
terns as building blocks. Similar to design patterns introduced in [8], a parallel pattern
describes a solution for a class of parallelization problems. Examples of commonly
used patterns are the pipeline pattern and the fork-join pattern. A pattern’s behavior
and its specific performance bottlenecks can be studied and this knowledge can be
exploited to (1) equip patterns with tuning parameters and (2) use patterns as prede-
fined configurable building blocks for a parallel program. We call parallel patterns
exposing tuning parameters tunable patterns.
We employ a predefined set of tunable patterns to define a program’s architecture.
Compared to low-level parallelization techniques (not normally considered during
the design phase of a program), this approach offers several advantages: a structured
top-down design, modularity, and the exploitation of coarse-grained parallelism.
The following section describes our tunable architectures approach and CO2 P3 S,
a related approach from the literature.
10.5.1.2 Connectors
Connectors are operators composing atomic components or other connectors. Thus,
a tunable architecture has a tree structure with connectors as nodes and atomic compo-
nents as leaves. In contrast to the definition of connectors in architecture description
languages [16], where a connector handles the communication between two com-
ponents, a TADL connector defines a set of components including the interactions
among those components.
TADL provides five connectors types: (1) Sequential composition encloses child
items that must not execute concurrently. (2) Tunable alternative expresses an exclu-
sive alternative of two or more components. (3) Tunable fork-join introduces task
parallelism. That is, all enclosed components run in parallel (fork) followed by an
implicit barrier (join). (4) Tunable producer-consumer provides a technique to pass
data from one component to another. (5) Tunable pipeline introduces pipeline paral-
lelism. Each enclosed component represents a pipeline stage.
A connector encloses a set of child items that can be either atomic components or
other connectors. Every child item enclosed by a connector can have an associated
input and output component for providing and retrieving data. Similar to atomic com-
ponents, we assume that input and output components are implemented as methods
in an object-oriented program.
Speedup
5.0
4.0
3.0
2.0
1.0
0.0
MID GrGen DS Video MB Matrix
(1) Worst configuration 1.3 1.8 1.1 1.2 2.0 2.2
(2)Standard configuration 2.3 6.5 1.2 1.3 4.7 7.9
(3) Best configuration 3.1 7.1 6.5 5.0 7.9 8.3
(4) Optimal configuration 3.1 7.7 6.9 5.6 7.9 8.4
TPG 1 138% 294% 491% 317% 295% 277%
TPG 2 35% 9% 442% 285% 68% 5%
MID is the commercial biological data analysis application that has been paral-
lelized as described earlier in Section 10.2. The performance numbers are obtained
for an input file of 1 GB that contains mass spectrogram data. GrGen is a graph rewrit-
ing system. The benchmark simulates gene expression as a graph rewriting task on
E. coli DNA with over nine million graph nodes. DS is a parallel desktop search
application that generates an index for text files. The input consists of over 10,000
ASCII text files with sizes between 6 kB and 613 kB. The video processing applica-
tion applies four filters in parallel on the picture stream of an AVI file, as discussed
earlier. The input consists of 180 frames at a resolution of 800 × 600 pixels. MB
computes the Mandelbrot set and Matrix is the multiplication of two 1024 × 1024
matrices.
The table in Figure 10.4 shows for each application the parallel speedup with the
worst configuration (1), with a default configuration (2), with the best configuration
found by the auto-tuner (3), and the optimum configuration (4) obtained by exhaus-
tive search. The default configuration is what a competent developer might guess.
The tuning performance gain between the worst and the best found configuration
(TPG1) shows that auto-tuning produces significant speedup gains, from 138% to
almost 500%. The performance gain between the default configuration and the best
found configuration (TPG2) is smaller. As can be seen, the developer’s guess is far off
the optimum in two cases, and moderately off in two others. The auto-tuner always
finds a configuration that is close to the optimum.
It is also interesting to note that the auto-tuner may choose non-obvious config-
urations. In the video application, the final stage is the slowest one. In the optimal
configuration, this stage is replicated eight times. Since the other stages are quite
fast, they are fused with the replicates of the final stage. The result is a pipeline that is
250 Fundamentals of Multicore Software Development
replicated eight times, but each replica runs sequentially. Thus, there is no buffering
or synchronization except at the beginning and the end of the pipelines. DS also has a
non-obvious configuration. Apparently, it is precisely in those non-obvious situations
where the auto-tuner works best.
10.5.2 CO2 P3 S
CO2 P3 S (correct object-oriented pattern-based parallel programming system) [15]
combines object-oriented programming, design patterns, and frameworks for parallel
programming. The approach only supports the development of new parallel programs.
CO2 P3 S has a graphical interface. The user chooses a parallel pattern for the entire
program structure. The system generates the appropriate code including instructions
for communication and synchronization. The generated code provides a predefined
structure with placeholders (called hook methods) for sequential code. The user has
to implement these hook methods by inheriting the corresponding framework classes
and overwriting the methods. The user can thus choose to work on the intermediate
code layer or the native code layer.
The approach uses parameterized pattern templates to influence the code genera-
tion process. The programmer can configure the selected pattern and adapt it for a
specific use. For example, the parameters for a two-dimensional regular mesh tem-
plate include the topology, the number of neighboring elements, and the synchroniza-
tion level [15]. Although the selection of an appropriate pattern variant can influence
program performance, the template parameters do not necessarily represent tuning
parameters, as they do not primary address performance bottlenecks relevant for par-
allelism. Moreover, the parameters cannot be directly adjusted after code generation.
The approach aims to simplify the creation of a parallel application and consists
of the following five steps [15]:
1. Identify one or more parallel design patterns that can contribute to a solution
and pick the corresponding design pattern templates.
2. Provide application-specific parameters for the design pattern templates to gen-
erate collaborating framework code for the selected pattern templates.
3. Provide application-specific sequential hook methods and other nonparallel
code to build a parallel application.
4. Monitor the parallel performance of the application and if it is not satisfactory,
modify the generated parallel structure code at the intermediate code layer.
5. Re-monitor the parallel performance of the application and if it is still not
acceptable, modify the actual implementation to fit the specialized target archi-
tecture at the native code layer.
10.5.3 Comparison
Both approaches exploit parallelism at the software architecture level. Tunable
architectures consist of a description language to describe parallel architectures. In
contrast to CO2 P3 S, the association between program code and architecture
description is part of the design process. This makes tunable architectures applicable
for new programs as well as for existing programs. Architecture adaptation and auto-
matic performance optimization represent an integral part of the tunable architectures
approach. CO2 P3 S employs the generation of framework code that uses calls to hook
methods to integrate application-specific code. The framework supplies the appli-
cation structure and the programmer writes application-specific utility routines. The
configuration of the program and its patterns happens before code generation, which
results in a hard-coded variant. Manual performance tuning is required to optimize
the patterns after implementation.
10.6.1 Atune-IL
Atune-IL is used in our auto-tuning system. It targets parallel applications in gen-
eral, not just numerical programs, and works with empirical auto-tuners [25].
Atune-IL uses pragma directives (such as #pragma for C++ and C# or /*@ for
Java) that are inserted into regular code. Atune-IL provides constructs for declaring
tuning parameters, permutation regions (to allow an auto-tuner to reorder source code
statements), and measuring points. Additional statements describe software archi-
tecture meta-information that helps an empirical auto-tuner reduce the search space
before employing search algorithms. Therefore, Atune-IL is able to capture the struc-
ture of an application along with characteristics important for tuning, such as parallel
sections and dependencies between tuning parameters.
Atune-IL uses tuning blocks to define scopes and visibility of tuning parameters
and to mark program sections that can be optimized independently. A developer can
mark two code sections to be independent if they cannot be executed concurrently in
any of the application’s execution paths. Tuning blocks can be nested. Global tuning
parameters and measuring points are automatically bound to an implicit root tuning
block that wraps the entire program. Thus, each parameter and each measuring point
belongs to a tuning block (implicitly or explicitly). Each independent block structure
represents a tuning entity.
Tunable parameters can be defined with the setvar statement, which means that
the value of variable preceded by such a statement can be set by an auto-tuner. The
number of threads to create is an example of such a variable. A measuring point is
defined by inserting the gauge statement. Two such statements with the same iden-
tifier in the same block define two measuring points whose difference will be used by
the auto-tuner. Tuning blocks are defined by startblock and endblock. List-
ing 10.2 shows an example of an instrumented program that searches strings in a
text, stores them in an array, and sorts the array using parallel sorting algorithms.
Finally, the program counts the total number of characters in the array. The tuning
block sortBlock is logically nested within fillBlock, while countBlock
is considered to be independent from the other tuning blocks. Thus, the parameters
sortAlgo and depth in sortBlock have to be tuned together with parameter
stringSearch in fillBlock, as the order of the array elements influences sort-
ing. This results in two separate tuning entities that can be tuned one after the other.
The sizes of the search spaces are dom(fillOrder) × dom(sortAlgo) × dom(depth) and
dom(numThreads), respectively. In addition, the dependency of depth on sortAlgo can
be used to prune the search space, because invalid combinations are ignored.
# pragma a t u n e s t a r t p e r m u t a t i o n f i l l O r d e r
# pragma a t u n e n e x t e l e m
words . Add ( t e x t . F i n d ( " c a r s " ) ) ;
# pragma a t u n e n e x t e l e m
words . Add ( t e x t . F i n d ( " do " ) ) ;
# pragma a t u n e n e x t e l e m
words . Add ( t e x t . F i n d ( " Auto−t u n i n g " ) ) ;
# pragma a t u n e e n d p e r m u t a t i o n
s o r t P a r a l l e l ( words ) ;
# pragma a t u n e g a u g e mySortExecTime
# pragma a t u n e e n d b l o c k f i l l B l o c k
c o u n t W o r d s ( words )
}
/ / Sorts s t r i n g array
v o i d s o r t P a r a l l e l ( L i s t < s t r i n g > words ) {
# pragma a t u n e s t a r t b l o c k s o r t B l o c k i n s i d e f i l l B l o c k
I P a r a l l e l S o r t i n g A l g o r i t h m s o r t A l g o = new P a r a l l e l Q u i c k S o r t ( ) ;
int depth = 1;
# pragma a t u n e s e t v a r s o r t A l g o t y p e g e n e r i c
v a l u e s " new P a r a l l e l M e r g e S o r t ( d e p t h ) " , " new P a r a l l e l Q u i c k S o r t ( ) "
s c a l e nominal
# pragma a t u n e s e t v a r d e p t h t y p e i n t
v a l u e s 1−4 s c a l e o r d i n a l
d e p e n d s s o r t A l g o = ’new P a r a l l e l M e r g e S o r t ( d e p t h ) ’
s o r t A l g o . Run ( words ) ;
# pragma a t u n e e n d b l o c k s o r t B l o c k
}
/ / Counts t o t a l c h a r a c t e r s of s t r i n g array
i n t c o u n t C h a r a c t e r s ( L i s t < s t r i n g > words ) {
# pragma a t u n e s t a r t b l o c k c o u n t B l o c k
# pragma a t u n e g a u g e myCountExecTime
i n t numThreads = 2 ;
# pragma a t u n e s e t v a r numThreads t y p e i n t
v a l u e s 2−8 s c a l e o r d i n a l
i n t t o t a l = c o u n t P a r a l l e l ( words , numThreads ) ;
254 Fundamentals of Multicore Software Development
# pragma a t u n e g a u g e myCountExecTime
return t o t a l ;
# pragma a t u n e e n d b l o c k c o u n t B l o c k
}
10.6.2 X-Language
The X-Language [4] provides constructs to declare source code transformations.
The software developer specifies transformations on particular parts of the source
code (such as a loop). X-Language uses #pragma directives to annotate C/C++
source code. The X-Language interpreter applies the specified code transformations
before the program is compiled.
The X-Language offers a compact representation of several program variants and
focuses on fine-grained loop optimizations. Listing 10.3 shows a simple example of
how to annotate a loop to automatically unroll it with a factor of 4. Listing 10.4 shows
the code resulting from the transformation.
Listing 10.3: Annotated Source Code with X-Language
# pragma x l a n g name l 1
f o r ( i = 0 ; i < 2 5 6 ; i ++) {
s = s + a[ i ];
}
# pragma x l a n g t r a n s f o r m u n r o l l l 1 4
Custom code transformations can be defined with pattern matching rewrite rules.
The X-Language does not include an optimization mechanism.
10.6.3 POET
POET (parameterized optimizing for empirical tuning [30]) is a domain-specific
language for source code transformations. POET uses an XML-like script to specify
transformations, parameters, and source code fragments. A source code generator
produces a compilable program based on such a script.
Compared to the X-Language, a POET script contains the complete code frag-
ments of the original application that are relevant for tuning. POET also allows the
generic definition of transformations that can be reused in other programs. POET
focuses on small numerical applications solving one particular problem, and does
not provide explicit constructs to support parallelism. It works at the detailed state-
ment level rather than the architecture level and is quite difficult to use. Specifying
the loop unrolling in the previous example requires 30 lines of POET code.
10.6.4 Orio
The Orio [9] tuning system consists of an annotation language for program variant
specification and an empirical tuning component to find the best-performing variant.
Orio parses annotated source code (limited to C code), generates new source code
to create program variants, and searches for the variant with the shortest execution
time. Although this process is straightforward, Orio’s annotation language is complex.
For example, instrumenting a single loop in the program’s source (e.g., for automatic
loop unrolling) requires a large external script and multiple annotations. Moreover,
parts of the source code have to be redundantly embedded within the annotation
statements.
Orio uses separate specifications for tuning parameters, while the source code
annotations contain the actual transformation instructions. This technique allows the
reuse of parameter declarations and keeps the source code cleaner. However, the anno-
tations entirely replicate the source code in order to place the transformation instruc-
tions. This results in code duplication that is not suitable for large applications.
Orio supports automatic code transformation at loop level. Its tuning component
uses two heuristics: the Nelder–Mead simplex method and simulated annealing. How-
ever, the experimental setup presented in [9] demonstrates applicability only on small
numerical programs.
10.6.5 Comparison
The tuning instrumentation languages mainly differ in their level of granularity and
in the targeted problem domain. Orio and POET are suitable for numerical applica-
tions and perform typical code transformations on loop level required in that domain.
From a software engineering perspective, they have the drawback of requiring code
duplication. This limits their applicability for larger projects and makes maintenance
and debugging difficult. The X-Language also targets scientific programs and loop
256 Fundamentals of Multicore Software Development
Parallelized and
instrumented program
0
Tuning instructions
1
Calculate parameter values
Feedback of Parameter
performance values configuration
3 2
Execute and monitor
Configure program
program Parameterized and
executable program variant
one cycle of steps 1–3 a tuning iteration. With each tuning iteration, the auto-tuner
samples the search space and maps a parameter configuration to a performance value.
• Hill climbing: This widely used algorithm randomly selects an initial configu-
ration and follows the steepest assent in the neighborhood. A major drawback
is that the algorithm can get stuck in local optima.
• Simulated annealing: The inspiration for this approach comes from the pro-
cesses of annealing in metallurgy. Controlled heating and cooling of a material
results in a stable crystal structure with reduced defects. The process must be
performed slowly in order to form an optimal uniform structure of the material.
By analogy with this physical process, simulated annealing does not always
continue with the best configurations so far, but temporarily chooses a poorer
one. The probability for this choice is governed by a global parameter T (the
temperature). T is gradually decreased for each tuning iteration. This method
saves the algorithm from getting stuck in local optima. Simulated annealing
thus improves hill climbing. It allows a temporary decline of the tested
258 Fundamentals of Multicore Software Development
configuration’s quality, but gains the advantage of finding the global optimum
in the long run.
• Tabu search: This approach performs an iterative search as well, but keeps a
tabu list that contains parameter configurations that are temporarily excluded
from testing. This method avoids a circular exploration of the search space. A
commonly used tabu strategy adds the current configuration’s complement to
the list.
Global search: Global search algorithms start with several initial parameter config-
urations. Global search algorithms differ from local searchers mainly in the strategy
defining how new configurations are selected. Well-known approaches are swarm
optimization [13] and genetic algorithms [6].
10.7.3.1 Atune
Atune [24,25] is our off-line empirical auto-tuner. It works with programs instru-
mented with Atune-IL (see Section 10.6.1). In combination with TADL and its paral-
lel patterns, Atune supports the automatic optimization of large parallel applications
on an architectural level. The auto-tuner is general-purpose, that is, not tied to a spe-
cific problem domain.
Atune follows a multistage optimization process:
1. Create tuning entities: First, Atune analyzes the program structure using Atune-
IL’s tuning blocks and creates tuning entities. The result of this step are one or
more search space partitions that will be handled independently. The partitions
are automatically tuned one after the other.
2. Analyze tuning entities: For each tuning entity, Atune extracts the tuning instruc-
tions, such as parameter declarations, and measuring points. Furthermore, it
classifies the tuning parameters according to their scale (nominal or ordinal).
This differentiation is important for the tuning process, as common search tech-
niques require an ordinal scale. Nominally scaled parameters, for instance for
expressing algorithm alternatives, must be handled differently. Atune first sam-
ples the nominal parameters randomly and then continues with the best choice
for them into the next optimization phase.
Atune also gathers information about the patterns present and applies suitable
heuristics in this case. For example, if the tuning entity consists of the pipeline
pattern, it will use a special search procedure that balances the pipeline stages
by replicating and fusing stages.
3. Search: Each tuning entity that still contains unconfigured parameters will be
auto-tuned using search algorithms like the ones described in the previous
section.
Auto-Tuning Parallel Application Performance 259
The multistage optimization process reduces the search space, so Atune is able to
handle complex parallel applications consisting of several (nested) parallel sections
exposing a large number of tuning parameters.
10.7.3.2 ATLAS/AEOS
ATLAS (automatically tuned linear algebra software) [29] combines a library with
a tuning tool called AEOS (automated empirical optimization of software) to find the
fastest implementation of a linear algebra operation on a specific hardware platform.
The tool generates high-performance libraries for a particular hardware platform
using domain-specific knowledge about the problem to solve. AEOS performs the
source code optimizations in an iterative off-line tuning process, while ATLAS rep-
resents the optimized library.
The authors of AEOS/ATLAS also developed a model-based tuning approach.
Instead of feeding the performance results back into the optimization process to grad-
ually adjust the parameter values, a manually created analytical model is used to
directly calculate the best parameter configuration.
An enhancement of AEOS/ATLAS is described in [5]. The approach combines
model-based and empirical tuning. The idea is to reduce the initial search space using
an analytical model, before the search starts. The models are created using machine
learning approaches. If limited to the numerical programs targeted in ATLAS, the
approach offers promising results. Furthermore, this work is one of the few approaches
that specifically targets the reduction of the search space before applying any search
algorithm.
Although AEOS/ATLAS and other tuning approaches for numerical application
(such as FFTW [7], FIBER [12], and SPIRAL [23]) form the background of today’s
auto-tuning concepts, they focus on small application domains and numerical
problems.
10.7.4 Comparison
The aforementioned auto-tuning approaches implement different techniques to find
the best possible parameter configuration of a program. The systems mainly differ in
their assumptions about the application domains and in the targeted software abstrac-
tion levels. While AEOS/ATLAS focuses on optimizing linear algebra operations,
Active Harmony supports optimized libraries. Atune provides tuning techniques that
specifically target large parallel applications to enable optimization on a software
architecture level. Furthermore, Atune is not restricted to particular problem domain.
An enhanced version of AEOS/ATLAS [5] as well as Atune reduce the search space
prior to the actual search. While AEOS/ATLAS uses domain-specific models to pre-
define parameters of the algebraic library, Atune gathers context information about
the program structure using the tuning instrumentation language Atune-IL.
patterns are used for tuning. A good starting point might be pattern-based perfor-
mance diagnosis [14]. Another promising direction is to improve the interaction
between compilers and auto-tuners and implicitly introduce auto-tuning [11]. Yet
another largely neglected area is simultaneous tuning of several parallel programs
[10]: While most of the current research concentrates on optimizing program perfor-
mance for just one program in isolation, in practice the performance is influenced
by other processes competing for available resources on the same machine. A global
auto-tuner working at the level of the operating system is key to finding optimum
performance in such complex situations.
References
1. E. Csar, J.G. Mesa, J. Sorribes, and E. Luque. Modeling master/worker appli-
cations in POETRIES. In Proceedings of the ninth International Workshop on
High-Level Parallel Programming Models and Supportive Environments, Santa
Fe, NM, pp. 22–30, April 2004.
2. E. Csar, A. Moreno, J. Sorribes, and E. Luque. Modeling master/worker applica-
tions for automatic performance tuning. Parallel Computing, 32(7–8):568–589,
2006. Algorithmic Skeletons.
3. E. Csar, J. Sorribes, and E. Luque. Modeling Pipeline Applications in POET-
RIES. In Proceedings of the Eighth International Euro-Par Conference on Par-
allel Processing, Lisbon, Portugal, pp. 83–92, 2005.
4. S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, A. Cohen,
M. Garzaran, D. Padua, and K. Pingali. A language for the compact represen-
tation of multiple program versions. In Proceedings of the 18th International
Workshop on Languages and Compilers for Parallel Computing, vol. 4339 of
LNCS, Montreal, Quebec, Canada, pp. 136–151, 2006.
5. A. Epshteyn, M. Jess Garzaran, G. DeJong, D. Padua, G. Ren, X. Li, K. Yotov,
and K. Pingali. Analytic models and empirical search: A hybrid approach to code
optimization. In Proceedings of the Workshop on Languages and Compilers for
Parallel Computing, vol. 4339/2006, New Orleans, LA, pp. 259–273, 2006.
6. L.J. Fogel, A.J. Owens, and M.J. Walsh. Artificial Intelligence through Simulated
Evolution. John Wiley & Sons, New York, 1966.
7. M. Frigo and S.G. Johnson. FFTW: An adaptive software architecture for the
FFT. In Proceedings of the International Conference on Acoustics, Speech and
Signal Processing, vol. 3, Seattle, WA, pp. 1381–1384, May 1998.
8. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements
of Reusable Object-Oriented Software. Addison-Wesley Professional, Boston,
MA, 1995.
262 Fundamentals of Multicore Software Development
22. V. Pankratius, C.A. Schaefer, A. Jannesari, and W.F. Tichy. Software engi-
neering for multicore systems: An experience report. In Proceedings of the
International Workshop on Multicore Software Engineering, Leipzig, Germany,
pp. 53–60, 2008. ACM, New York.
23. M. Püschel, J.M.F. Moura, J.R. Johnson, D. Padua, M.M. Veloso, B.W. Singer,
J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R.W. Johnson, and
N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the
IEEE, 93(2):232–275, February 2005.
24. C.A. Schaefer. Reducing search space of auto-tuners using parallel patterns. In
Proceedings of the 2nd ICSE Workshop on Multicore Software Engineering,
Vancouver, British Columbia, Canada, pp. 17–24, 2009. IEEE Computer Society,
Washington, DC.
25. C.A. Schaefer, V. Pankratius, and W.F. Tichy. Atune-IL: An instrumentation lan-
guage for auto-tuning parallel applications. In Proceedings of the 15th Interna-
tional Euro-Par Conference on Parallel Processing, vol. 5704/2009 of LNCS,
Delft, the Netherlands, pp. 9–20. Springer Berlin/Heidelberg, Germany, January
2009.
26. C.A. Schaefer, V. Pankratius, and W.F. Tichy. Engineering parallel applications
with tunable architectures. In Proceedings of the 32nd ACM/IEEE International
Conference on Software Engineering, ICSE ’10, vol. 1, Cape Town, South Africa,
pp. 405–414, 2010. ACM, New York.
27. V. Tabatabaee, A. Tiwari, and J.K. Hollingsworth. Parallel parameter tuning
for applications with performance variability. In Proceedings of the ACM/IEEE
Supercomputing Conference, Heidelberg, Germany, pp. 57–57, November 2005.
28. C. Tapus, I-H. Chung, and J.K. Hollingsworth. Active harmony: Towards auto-
mated performance tuning. In Proceedings of the ACM/IEEE Supercomputing
Conference, New York, November 2002.
29. R.C. Whaley, A. Petitet, and J.J. Dongarra. Automated empirical optimizations
of software and the ATLAS Project. Journal of Parallel Computing, 27:3–35,
January 2001.
30. Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized
optimizations for empirical tuning. In Proceedings of International Parallel and
Distributed Processing Symposium, Long Beach, CA, pp. 1–8, March 2007.
31. C.A. Schaefer, Automatische Performanzoptimierung Paralleler Architekturen
(PhD thesis in German), Karlsruhe Institute of Technology, 2010.
This page intentionally left blank
Chapter 11
Transactional Memory
Tim Harris
Contents
11.1 Introduction
Many of the challenges in building shared-memory data structures stem from need-
ing to update several memory locations at once—e.g., updating four pointers to insert
an item into a doubly linked list. Transactional memory (TM) provides a mechanism
for grouping together this kind of series of operations, with the effect that either all
of them appear to execute, or none of them does. As with database transactions, TM
lets the programmer focus on the changes that need to be made to data, rather than
dealing with the low-level details of exactly which locks to acquire, or how to prevent
deadlock.
In this chapter, we look at how TM can be used by a programmer, at the ways it can
be implemented in hardware (HTM) and software (STM), and at how higher-level
constructs can be built over it in a programming language. As a running example,
265
266 Fundamentals of Multicore Software Development
Double-ended queue
Left Right
10 20 90
sentinel sentinel
FIGURE 11.1: A double-ended queue built over a doubly linked list with sentinel
nodes.
Class Q {
QElem leftSentinel;
QElem rightSentinel;
elements, but what about the case when the deque is almost empty? It is difficult
to avoid deadlock when a thread needs to acquire both locks.
In this particular example, for a deque, several good scalable solutions are known.
However, designing deques is still the subject of research papers, and it is difficult
and time-consuming to produce a good design. TM aims to provide an alternative
abstraction for building this kind of shared-memory data structure, rather than requir-
ing ingenious design work to be repeated for each new case.
Figure 11.3 shows how the pushLeft operation could be implemented using
Herlihy and Moss’ HTM (we look at this HTM in detail in Section 11.3). The code is
essentially the same as the sequential version, except that the related operations are
grouped together in the do...while loop, and the individual reads and writes to
shared memory are made using two new operations: LT (“load transactional”) and
ST (“store transactional”).
This looping structure is typical when using TM; all of the implementations we
describe rely on optimistic concurrency control, meaning that a thread speculatively
attempts a transaction, checking whether or not its speculative work conflicts with
any concurrent transactions. At the end of a transaction, the speculative work can be
made permanent if no conflicts occurred. Conversely, if there was any conflict, then
the speculative work must be discarded. TM works well in cases where speculation
pays off—i.e., where conflicts occur rarely in practice.
In the next section, we describe the general design choices that apply to different
TMs. Then, in Section 11.3, we focus on HTM, examining Herlihy and Moss’ HTM,
and Moore et al.’s (2006) LogTM. Both these HTM systems modify the processor
so that it tracks the speculative work that transactions do, and the designs rely on the
processor to detect conflicts between transactions. In Section 11.4, we turn to STM,
268 Fundamentals of Multicore Software Development
and show how data management and conflict detection can be built without requiring
any changes to current hardware.
Programming directly over HTM or STM can be cumbersome; there is little porta-
bility between different implementations, and there can be a lot of boilerplate (e.g.,
the do...while loop and explicit LT and ST operations in Figure 11.3).
Furthermore—and more seriously from a programming point of view—this kind of
basic TM system provides atomic updates, but does not provide any kind of condi-
tion synchronization. This is a bit like a language providing locks but not providing
condition variables: there is no way for a thread using TM to block while working
on a shared data structure (e.g., for a popLeft operation on a deque to wait until an
element is available to be removed).
Section 11.5 shows how these concerns can be tackled by providing atomic
blocks in a programming language. We show how atomic blocks can be built over
TM, and how the semantics of atomic blocks can be defined in a way that programs
are portable from one implementation to another. Also, we show how atomic blocks
can be coupled with constructs for condition synchronization.
Finally, in Section 11.6, we examine the performance of the Bartok-STM system,
and the ways in which it is exercised by different transactional benchmarks.
Conversely, with lazy detection, the presence of a conflict might only be checked
periodically—or it might only be detected at the end of the transaction when it attempts
to commit.
The performance trade-offs between these approaches are more complicated than
between lazy/eager version management. Under low contention, the performance of
the conflict detection mechanism itself is more important than the question of whether
it detects conflicts eagerly or lazy (in Section 11.4, we shall see examples of how this
can influence the design of STM systems, where decisions have often been motivated
by performance under low contention).
Under high contention, the way that conflicts are handled is of crucial importance,
and most simple strategies will perform poorly for some workloads. Lazy conflict
detection can cause threads to waste time, performing work that will eventually be
aborted due to conflicts. However, eager conflict detection may perform no better: a
transaction running in Thread 1 might detect a conflict with a transaction running in
Thread 2 and, in response to the conflict, Thread 1 might either delay its transaction
or abort it altogether. This kind of problem can be particularly pernicious if a conflict
is signaled between two running transactions (rather than just between one running
transaction and a concurrent transaction that has already committed). For instance,
after forcing Thread 1’s transaction to abort, Thread 2’s transaction might itself be
aborted because of a conflict with a third thread—or even with a new transaction in
Thread 1. Without care, the threads can live-lock, with a set of transactions continu-
ally aborting one another.
TM implementations handle high contention in a number of ways. Some provide
strong progress properties, for instance guaranteeing that the oldest running transac-
tion can never be aborted, or guaranteeing that one transaction can only be aborted
because of a conflict with a committed transaction, rather than because of a conflict
with another transaction’s speculative work (this avoids live-lock because one trans-
action can only be forced to abort because another transaction has succeeded). Other
TMs use dynamic mechanisms to reduce contention, e.g., adding back-off delays, or
reducing the number of threads in the system.
11.2.3 Semantics
Before introducing TM implementation techniques in Sections 11.3 and 11.4, we
should consider exactly what semantics are provided to programs using TM. Current
prototype implementations vary a great deal on all of these questions.
Access granularity: Does the granularity of transactional accesses correspond
exactly to the size of the data in question, or can it spill over if the TM manages data
at a coarser granularity (say, whole cache lines in HTM, or whole objects in STM)?
This is important if adjacent data is being accessed non-transactionally—e.g., with-
out precise access granularity, a TM using eager version management may roll back
updates to adjacent non-transactional data.
Consistent view of memory: Does a running transaction see a consistent view of
memory at all times during its execution? For example, suppose that we run the fol-
lowing transactions, with shared variables v1 and v2 both initially 0.
270 Fundamentals of Multicore Software Development
Transaction 1: Transaction 2:
temp1 = LT(&v1); ST(&v1,1);
temp2 = LT(&v2); ST(&v2,1);
if (temp1 != temp2) while(1) {} COMMIT();
COMMIT();
These two transactions conflict with one another, and so they cannot both be
allowed to commit if they run in parallel. However, what if Transaction 2 runs in its
entirety in between the two LT reads performed by Transaction 1? Some TMs guar-
antee that Transaction 1 will nevertheless see a consistent pair of values in temp1
and temp2 (either both 0, or both 1). Other TMs do not guarantee this and allow
temp1 != temp2, so Transaction 1 may loop. This notion has been formalized
as the idea of “opacity”. When using TMs without opacity, programs must either be
written defensively (so that they do not enter loops, or have other non-transactional
side effects), or explicit validation must be used to check whether or not the reads
have been consistent. Transactions that have experienced a conflict but not yet rolled
back are sometimes called “zombies” or “orphans”.
Strong atomicity: What happens if a mix of transactional and non-transactional
accesses is made to the same location? Some TMs provide “strong atomicity”, mean-
ing that the non-transactional accesses will behave like short single-access
transactions—e.g., a non-transactional write will signal a conflict with a concurrent
transaction, and a non-transactional read will only see updates from committed trans-
actions. Strong atomicity is typically provided by HTM implementations, but not by
STM.
As we show in Section 11.5, atomic blocks can be defined in a way that allows
portable software to be implemented over a wide range of different TMs, making
these questions a concern for the language implementer (building atomic blocks
over TM), rather than the end programmer (writing software using atomic blocks).
Tx-active
Processor core
Tx-status
cache. The protocol used between caches remains largely unchanged, and committing
or aborting a transaction can be done locally in the cache without extra communica-
tion with other processors or with main memory.
As Figure 11.3 shows, with Herlihy and Moss’ HTM, distinct operations are used
for transactional accesses: LT(p1) performs a transactional load from address p1,
and ST(p2,x2) performs a transactional store of x2 to address p2. As an optimiza-
tion, an additional LTX operation is available; this provides a hint that the transaction
may subsequently update the location, and so the implementation should fetch the
cache line in exclusive mode. A transaction is started implicitly by a LT/LTX/ST,
and it runs until a COMMIT or an ABORT.
Within a transaction, any ordinary non-transactional memory accesses proceed
as normal; the programmer is responsible for cleaning up the effects of any non-
transactional operations that they do, and for making sure that non-transactional
operations attempted by zombies are safe. The programmer is also responsible for
invoking COMMIT at the end of the transaction (hence the while loop), and possi-
bly adding an explicit delay to reduce contention.
Figure 11.4 sketches the implementation. The processor core is extended with
two additional status flags. Tx-active indicates whether or not a transaction is in
progress. If there is an active transaction, then Tx-status indicates whether or not
it has experienced a conflict; although conflicts are detected eagerly by the hardware
in Herlihy and Moss’ design, they are not signaled to the thread until a VALIDATE
or COMMIT.
The normal data cache is extended with a separate transactional cache that buffers
the locations that have been accessed (read or written) by the current transaction.
The transactional cache is a small, fully associative structure. This means that the
maximum size of a transaction is only bounded by the size of the cache (rather than
being dependent on whether or not the data being accessed maps to different cache
sets).
272 Fundamentals of Multicore Software Development
Unlike the normal cache, the transactional cache can hold two copies of the same
memory location. These represent the version that would be current if the running
transaction commits, and the version that would be current if the running transaction
aborts. The values held in any lower levels of the cache, and in main memory, all
reflect the old value, and so this is a lazy versioning system. The two views of memory
are distinguished by the Tx-state field in their transactional cache lines—there are
four states: XCOMMIT (discard on commit), XABORT (discard on abort), NORMAL
(data that was been accessed by a previously committed transaction, but is still in the
cache), and EMPTY (unused slots in the transactional cache).
Conflicts can be detected by an extension of the conventional MESI cache protocol:
a remote write to a line in shared mode triggers an abort, as does any remote access
to a line in modified mode. In addition, a transaction is aborted if the transactional
cache overflows, or if an interrupt is delivered.
Committing a transaction simply means changing the XCOMMIT entries to EMPTY,
and changing XABORT entries to NORMAL. Correspondingly, aborting a transaction
means changing the XCOMMIT entries to NORMAL, and changing XABORT entries
to EMPTY. The use of XCOMMIT and XABORT entries can improve performance
when there is temporal locality between the accesses of successive transactions—
data being manipulated can remain in the cache, rather than needing to be flushed
back to memory on commit and fetched again for the next transaction.
11.3.2 LogTM
LogTM is an example of an HTM system built on eager version management and
eager conflict detection. Figure 11.5 illustrates the hardware structures that it uses.
Unlike Herlihy and Moss’ HTM, it allows speculative writes to propagate all the way
out to main memory before a transaction commits: the cache holds the new values
tentatively written by the transaction, while per-thread in-memory undo-logs record
the values that are overwritten.
The processor core is extended to record the current base of the log and an offset
within it. A single-entry micro-TLB provides virtual-to-physical translation for the
current log pointer. This design allows the undo log entries to be cached in the normal
way, and only evicted from the cache to main memory if necessary.
LogTM allows a transaction to be aborted immediately upon conflict, rather than
continuing to run as a zombie. This is done by augmenting each thread’s cache with
additional metadata to detect conflicts between transactions, introducing R/W bits on
each cache block. The R bit is set when a transaction reads from data in the block, and
the W bit is set when a transaction writes to data in the block. A conflict is detected
when the cache receives a request from another processor for access to a block in a
mode that is incompatible with the current R/W bits—e.g., if a processor receives a
request for write access to a line that it holds with the W bit set. When such a conflict
is detected, either the requesting processor is stalled by sending a form of NACK mes-
sage (eventually backing off to a software handler to avoid deadlock), or the transac-
tion holding the data can be aborted before granting access to the requesting processor
(in the case of a transaction updating the line, the update must be rolled back so that
Transactional Memory 273
LogBase
Processor core
LogPtr
Cache
Main memory
Undo log
...
...
FIGURE 11.5: LogTM hardware overview. The data cache tracks read/write bits
for each block, and buffers the new values being written by a transaction. The old
values are held in an in-memory undo log so that they can be restored on abort.
the requester is provided with the correct contents). A transaction commits by flash
clearing all the R/W bits and discarding the undo log.
In a directory-based implementation, LogTM can support transactions that over-
flow a processor’s local cache by setting a special “Sticky-M” bit in the cache direc-
tory entry for a line that has been evicted by a transaction with its W bit set. The
directory records which processor had accessed the line, and the protocol prevents
conflicting access to the line until the transaction has committed or aborted.
LogTM-SE (“Signature Edition”) provides decoupling between the structures used
by the HTM and the L1 cache. Instead of detecting conflicts using per-block R/W
bits, LogTM-SE uses read/write signatures that summarize the locations that have
been read/written by each transaction. This avoids adding complexity to the cache,
and it allows the cache to be shared between hardware threads without needing sep-
arate R/W bits for each thread. Signatures can be saved and restored by the operating
system, allowing transactions to be preempted or rescheduled onto different cores.
Furthermore, the operating system can maintain a summary signature representing
the access of all currently descheduled threads, allowing conflict detection to be per-
formed with them.
11.4.1 Bartok-STM
Bartok-STM is an example of an STM that uses eager version management. It
employs a hybrid form of conflict detection, with lazy conflict detection for reads,
but eager conflict detection for writes. This follows from the use of eager version
management: updates are made in-place, and so only one transaction can be granted
write access to a location at any one time.
With Bartok-STM, there is no guarantee that a zombie transaction will have seen a
consistent view of memory. The combination of this with eager version management
Transactional Memory 275
means that the STM API must be used with care to ensure that a zombie does not
write to locations that should not be accessed transactionally.
Bartok-STM maintains per-object metadata which is used for concurrency con-
trol between transactions. Each object’s header holds a transactional metadata word
(TMW) which combines a lock and a version number. The version number shows
how many transactions have written to the object. The lock shows if a transaction has
been granted write access to the object. A per-transaction descriptor records the cur-
rent status of the transaction (ACTIVE, COMMITTED, ABORTED). The descrip-
tor also holds an open-for-read log recording the objects that the transaction has read
from, an open-for-update log recording the objects that the transaction has locked for
updating, and an undo log of the updates that need to be rolled back if the transaction
aborts.
The implementation of writes is relatively straightforward in Bartok-STM. The
transaction must record the object and its current TMW value in the open-for-update
log, and then attempt to lock the TMW by replacing it with a pointer to the thread’s
transaction descriptor. Once the transaction has locked the TMW, it has exclusive
write access to the object.
The treatment of reads requires greater care. This is because Bartok-STM uses
“invisible reads” in which the presence of a transaction reading from a location is
only recorded in the transaction’s descriptor—the reader is invisible to other threads.
Invisible reading helps scalability by allowing the cache lines holding the TMW, and
the lines holding the data being read, to all remain in shared mode in multiple pro-
cessors’ caches. The key to supporting invisible reads is to record information about
the objects a transaction accesses, and then to check, when a transaction tries to com-
mit, that there have been no conflicting updates to those objects. In Bartok-STM, this
means that a read proceeds by (1) mapping the address being accessed to the TMW,
and (2) recording the address of the TMW and its current value in the open-for-read
log, before (3) performing the actual access itself. The decision to keep these steps
simple, rather than actually checking for conflicts on every read, is an attempt to
accelerate transactions in low-contention workloads.
When a transaction tries to commit, the entries in the open-for-read log must be
checked for conflicts. Figure 11.7 illustrates the different cases. Figure 11.7a–c is the
conflict-free case. In Figure 11.7a, the transaction read from an object and, at commit-
time, the version number is unchanged. In the example, the version number was 100
in both cases. In Figure 11.7b, the transaction read from the object, and then the object
was subsequently updated by the same transaction. Consequently, the transaction’s
open-for-update log and open-for-read logs record the same version number (100),
and the TMW refers to the transaction’s descriptor (showing that the object is locked
for writing by the current transaction). In Figure 11.7c, the transaction opened the
object for updating, and then subsequently opened it for reading: the TMW refers to
the transaction descriptor, and the open-for-read log entry also refers to the transaction
descriptor (showing that the object had already been written to by this transaction
before it was first read).
The remaining cases indicate that a conflict has occurred. In Figure 11.7d, the
object was read with version number 100 but, by commit-time, another transaction
276 Fundamentals of Multicore Software Development
v100 Open-for-read:
(a) 500: x==42 addr = 500, v100
Open-for-read:
(b) 500: x==17 addr = 500, v100
Open-for-update:
addr = 500, v100
Undo log:
addr = 500, val = 42
Open-for-read:
(c) 500: x==17 addr = 500
Open-for-update:
addr = 500, v100
Undo log:
addr = 500, val = 42
v101 Open-for-read:
(d) 500: x==42 addr = 500, v100
v101 Open-for-read:
(e) 500: x==42 addr = 500
Open-for-read:
(f) 500: x==42 addr = 500, v100
Open-for-read:
(g) 500: x==17 addr = 500, v100
Open-for-update:
addr = 500, v101
Undo log:
addr = 500, val = 42
FIGURE 11.7: Read validation in Bartok-STM. In each case, the transaction has
read from data at address 500. In cases (a)–(c), the transaction has not experienced a
conflict. In cases (d)–(g), it has. The object itself is shown on the left, and the trans-
action’s logs are shown on the right.
Transactional Memory 277
Finish
Read objects confirmed as unchanged for this duration Close objects in commit
write-set
Start Validate
commit read-set
has made an update and increased the version number to 101. In Figure 11.7e, the
object was already open for update by a different transaction at the point when it
was opened for reading: the open-for-read log entry recorded the other transaction’s
descriptor. In Figure 11.7f, the object was opened for update by another transac-
tion concurrent with the current one: the object’s TMW refers to the other transac-
tion’s descriptor. Finally, Figure 11.7g is the case where one transaction opened the
object for reading, and later opened the same object for writing, but, in between these
steps, a different transaction updated the object—the version number 100 in the open-
for-read log does not match the version number 101 in the open-for-update log.
Given that the work of validation involves a series of memory accesses, how do we
know that the transaction as a whole still has the appearance of taking place atomi-
cally? The correctness argument is based on identifying a point during the transac-
tion’s execution where (1) all of the locations read must have the values that were
seen in them and (2) the thread running the transaction has exclusive access to all of
the data that it has written. Such a “linearization point” is then an instant, during the
transaction’s execution, where it appears to occur atomically.
Figure 11.8 summarizes this argument: with Bartok-STM, the linearization point
of a successful transaction is the start of the execution of the commit operation. At this
point, we already know that the transaction has exclusive access to the data that it has
written (it has acquired all of the locks needed, and not yet released any). Furthermore,
for each location that it has read, it recorded the TMW’s version number before the
read occurred, and it will confirm that this version number is still up-to-date during
the subsequent validation work.
11.4.2 TL2
The second STM system which we examine is TL2. This system takes a number of
different design decisions to Bartok-STM: TL2 uses lazy version management, rather
than maintaining an undo log, and TL2 uses lazy conflict detection for writes as well
as reads. Furthermore, TL2 enables a transaction to see a consistent view of memory
278 Fundamentals of Multicore Software Development
TxCommit
TxStart TxRead TxWrite
Increment
global clock Finish
Write back updates, commit
unlock write set
Start commit, Validate
lock write set read-set
Exclusive access to write set
to execute directly as a hardware transaction (with reads and writes replaced with
transactional reads and writes, if the particular HTM interface requires this).
An implementation of atomic blocks must take care to handle object allocations.
One approach is to view the actual allocation work to be part of a transaction imple-
menting the atomic block, meaning that the allocation is undone if the transaction
is rolled back. However, this can introduce false contention (e.g., if allocations of
large objects are done from a common pool of memory, rather than per-thread pools).
An alternative approach is to integrate the language’s memory allocator with the TM
implementation, tracking tentative allocations in a separate log, and discarding them
if a transaction rolls back.
In a system using a garbage collector, tentative allocations can simply be discarded
if a transaction rolls back: the GC will reclaim the memory as part of its normal work.
However, the GC implementation must be integrated with the TM—e.g., scanning
the TM’s logs during collection, or aborting transactions that are running when the
collector starts.
// Thread 2
atomic {
o1.isShared = false;
}
o2.val = 100;
with Bartok-STM, Thread 1 could update o2.val from a zombie transaction, even
after Thread 2’s transaction has finished. Conversely, idioms that rely on some par-
ticular quirks of a given STM may not work with HTM.
Arguably, a cleaner approach is to define the semantics of atomic blocks indepen-
dently of the notion of transactions. For example, the idea of “single lock atomicity”
(SLA) requires that the behavior of atomic blocks be the same as that of critical
sections that acquire and release a single process-wide lock. With this model, the pri-
vatization idiom is guaranteed to work (because it would work with a single lock).
Many of the examples that only “work” with STM will involve a data race when
implemented with a single global lock, and so their behavior would be undefined in
many programming languages.
To support atomic blocks with SLA, it is necessary that granularity problems do
not occur, that the effects of zombie transactions are not visible to non-transacted
code, and that ordering between transactions (say, Tx1 is serialized before Tx2)
ensures ordering of the surrounding code (code that ran before Tx1 in one thread
must run before code after Tx2 in another thread).
An alternative approach to SLA is to define the semantics of atomic blocks inde-
pendently from TM and from existing constructs. This may require additional work
when designing a language (or when learning it), but is more readily extendable to
include additional constructs such as operations for condition synchronization. Fol-
lowing this approach, typical definitions require a “strong semantics” in which an
atomic block in one thread appears to execute without any interleaving of work
from other threads—i.e., no other work at all, neither other atomic blocks, nor nor-
mal code outside atomic blocks. The “appears to” is important, of course, because
the definition is not saying that atomic blocks actually run serially, merely that the
program will behave as if they do so.
If it can be implemented, then such strong semantics would mean that the program-
mer does not need consider the details of particular TM implementations—or indeed
the question of whether atomic blocks are implemented optimistically using TM,
or via some kind of automated lock inference.
282 Fundamentals of Multicore Software Development
The basic implementation we have sketched clearly does not provide strong seman-
tics to all programs, because it does not implement examples like the privatization
idiom when built over the STM systems from Section 11.4. Furthermore, it is not
even the case that an implementation built over HTM with strong atomicity would run
all programs with strong semantics because of the effect of program transformations
during compilation, or execution on a processor with a relaxed memory model. This
dilemma can be reconciled by saying that only “correctly synchronized” programs
need to execute with strong semantics; this is much the same as when programming
with locks, where only data-race-free programs are typically required to execute with
sequential consistency.
With atomic blocks, there is a trade-off between different notions of “correct”
synchronization, and the flexibility provided to use a wide range of TM implemen-
tations. At one extreme, STM-Haskell enforces a form of static separation in which
transactional data and non-transactional data are kept completely distinct; the type
system checks this statically, and all well-typed STM-Haskell programs are correctly
synchronized. This provides a lot of flexibility to the language implementer, but the
simple type system can make code reuse difficult, and require explicit marshalling
between transactional and normal data structures. For example, the privatization idiom
is not well-typed under static separation.
An alternative notion of correct synchronization is to support transactional data-
race-free (TDRF) programs. Informally, a program is TDRF if, under the strong
semantics, there are no ordinary data races, and there are no conflicts between accesses
from normal code and code inside atomic blocks. This is modeled on the conven-
tional definition of data-race freedom from programming language memory models.
The privatization idiom from Figure 11.11 is TDRF.
Given the need to consider notions of correct synchronization in defining the seman-
tics of atomic blocks, is it actually fair to say that they provide an easier program-
ming model than using explicit locks? That is a question that must ultimately be
tested experimentally, but intuitively, even if the programming model is a form of
SLA, programming with a single lock is simpler than programming with a set of
locks; the question of exactly which lock to hold becomes the question of whether or
not to hold the single lock.
int popLeft() {
atomic {
if (this.leftSentinel.right == this.rightSentinel) retry;
...
}}
Unlike when programming with locks and condition variables, it is not necessary
to identify what condition popLeft is dependent on, or what condition will cause
it to succeed in the future. This avoids lost-wake-up problems where a variable is
updated, but threads waiting on conditions associated with the variable are not sig-
naled. With STM-Haskell, the thread running the atomic block waits until an update
is committed to any of the locations that the atomic block has read.
This form of blocking is composable in the sense that an atomic block may call
a series of operations that might wait internally, and the atomic block as a whole
will only execute when all of these conditions will succeed. For example, to take two
items:
atomic {
x1 = q.popLeft();
x2 = q.popLeft();
}
The combined atomic block can only complete when both popLeft calls return
items, and atomicity requires that the two items be consecutive. Of course, the same
structure could be used for other operations—taking more than two items, or taking
elements that meet some other requirement (say, taking a different number of items
dependent on the value of x1). All these alternatives can be built without changing
the underlying deque.
Having said that, retry must be used with care—the programmer must ensure
that it will actually be possible for the operations to execute together; it is no good
enclosing two popLeft calls in an atomic block if the underlying buffer is built
over a single storage cell. Similarly, sets of possibly blocking operations that involve
communication with other threads cannot generally be performed atomically—e.g.,
updating one shared buffer with a request to a server thread, and then waiting for a
response to arrive. The server’s work must happen in between the request and the
response, so the request–response pair cannot happen atomically.
An orelse construct provides a way to try a piece of code and to catch any
attempt is makes to block, e.g., to turn popLeft into an operation that returns a
failure code instead of waiting:
Transactional Memory 285
int popLeftNoWait() {
atomic {
return popLeft();
} orelse {
return -1;
}}
The semantics of orelse is that either (1) the first branch executes as normal,
or (2) the first branch reaches retry, in which case the second branch executes
in its place (i.e., any putative updates from the first branch are discarded). If both
branches call retry, then the retry propagates, as with an exception, out to an
enclosing orelse. Alternative constructs could be defined which provide a non-
deterministic choice between their branches, rather than the left-biased form pro-
vided by orelse—however, note that the left bias is essential for the usage in
popLeftNoWait.
11.6 Performance
The design and implementation of TM systems remains an active research topic,
and numerous prototype systems have been described in the literature, or are available
for experimental use.
As with any parallel algorithm, several different metrics are interesting when eval-
uating the performance of STM systems. If a system is to be useful in practice, then
the sequential overhead of using transactions must not be so great that a program-
mer is better off sticking with single-threaded code. For instance, if transactional
code runs 16× slower than single-threaded code, then a program would need to be
able to scale perfectly to 16 cores just to recoup this loss; contention in the memory
hierarchy, conflicts between transactions, uneven work distribution between threads,
and synchronization elsewhere in the language runtime system would make this kind
of scaling difficult to achieve. Sequential overhead is largely affected by the costs
introduced by the TM system; e.g., additional bookkeeping in an STM system, or
additional pressure on cache space in an HTM system. The sequential overhead of
an STM is highly dependent on the performance of the baseline language implemen-
tation, and on the engineering and tuning that has been employed in building the
STM itself; a poor language implementation can mask the high costs of a given STM
system.
In addition to sequential overhead, a programmer should be interested in the way in
which a particular TM implementation affects the scalability of their programs. Scal-
ability is a function of both the program’s workload (e.g., if the transactions involve
conflicts), and the internals of the TM system (e.g., whether or not the implementation
introduces synchronization between unrelated transactions).
In this section, we briefly examine the performance achieved by the Bartok-STM
system. The implementation operates as an ahead-of-time compiler from C# to native
286 Fundamentals of Multicore Software Development
Normalized execution time 1.2
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
#Threads #Threads
2 1.6
x86 code, employing standard program transformations to optimize the C# code, and
using the techniques from Section 11.5.2 to optimize the placement of STM opera-
tions within transactions. Other STM systems can provide lower sequential overhead
(e.g., NOrec), or better scalability (e.g., SkySTM), but we stick with Bartok-STM to
complete the description of a full system.
Our results look at a set of four benchmarks. Three of these were derived from
the STAMP 0.9.9 benchmark suite (Genome, Labyrinth, Vacation). We translated the
original C versions of these programs into C#, making each C struct into a C#
class. The fourth benchmark is a Delaunay triangulation algorithm implemented
following the description by Scott et al. (2007). In all four benchmarks, we added
atomic blocks to the source code, and the compiler automatically added calls to the
STM library for code within the atomic blocks, before automatically optimizing the
placement of the STM operations.
Figure 11.13 shows the results on a machine with two quad-core Intel Xeon
5300-series processors. All results show execution time for a fixed total amount of
work, normalized against execution of a sequential program running on a single
core. This means that the 1-thread STM results show the sequential overhead that
is incurred, and the point at which the STM curves cross 1.0 shows the number of
threads that are needed in order to recoup this overhead. For all of these workloads, the
STM-based implementation out-performs the sequential version when two threads
are used.
These workloads vary a great deal. Delaunay uses short transactions when stitching
together independently triangulated parts of the space: most of the program’s execu-
tion is during non-transactional phases in which threads work independently. As Scott
et al. (2007) observed, one way that TM can affect performance for this workload is
Transactional Memory 287
encapsulate the use of transactions for a data structure, using HTM when it is avail-
able, and otherwise falling back to a specialized software implementation of the data
structure. This is similar to the use of other architecture-specific techniques, say SSE4
instructions.
The argument for atomic blocks in a general-purpose language is different. The
aim there is to provide a higher-level abstraction for building concurrent data struc-
tures; in any particular case it is very likely that a specialized design can perform
better—but potentially losing the composability of different data structures built over
transactions, or the ability to control blocking via constructs like orelse. Of course,
for atomic blocks to be useful in practice, the performance must be sufficiently
good that the cost of using a general-purpose technique rather than a specialized one
is acceptable. Currently, the state of the art in STM is perhaps akin to garbage collec-
tion in the early 1990s; a number of broad design choices are known (eager version
management vs. lazy version management on the one hand, copying vs. mark-sweep
vs. reference counting on the other), but there is still work to be done in understanding
how to build a high-performance implementation and, in particular, one that performs
predictably across workloads.
An alternative, more incremental, approach is to use TM within the implementation
of existing language constructs. Forms of speculative lock elision have been studied
in hardware and in software; the idea is to allow multiple threads to execute critical
sections speculatively, allowing them to run in parallel so long as they access dis-
joint sets of memory locations. TM can form the basis of an implementation, and the
question of whether to use TM, or whether to use an actual lock, can be based on the
performance characteristics of the particular TM implementations that are available.
the application of static analyses to optimize the placement of calls onto a TM library.
Dice et al (2006). designed the original TL2 algorithm.
Two examples of recent STM systems which illustrate different design choices are
NOrec (Dalessandro et al. 2010) and SkySTM (Lev et al. 2009). The NOrec STM sys-
tem provides an example of design choices taken to reduce the sequential overheads
incurred by an STM. It avoids the need to maintain any per-object or per-word meta-
data for conflict detection. Instead, transactions maintain a value-based log of their
tentative reads and writes, and use commit-time synchronization to check whether
or not these values are still up-to-date. The SkySTM system demonstrates a series
of design choices to provide scalability to multiprocessor CMP systems comprising
256 hardware threads. It aims to avoid synchronization on centralized metadata, and
employs specialized scalable nonzero indicator (SNZI) structures to maintain dis-
tributed implementations of parts of the STM system’s metadata.
Harris et al. (2005, 2006) introduced the retry and orelse constructs in GHC-
Haskell and provided an operational semantics for a core of the language. Moore
et al. (2006) and Abadi et al. (2008) provided formal semantics for languages includ-
ing atomic actions. They showed that the static separation programming discipline
allowed flexibility in TM implementation, and they showed that a simple type sys-
tem could be used to ensure that a program obeys static separation.
Shpeisman et al. (2007) provided a taxonomy of problems that occur when using
STM implementations that do not provide strong atomicity. Menon et al. (2008)
studied the implementation consequences of extending such an implementation to
support SLA.
References
Abadi, M., A. Birrell, T. Harris, and M. Isard, Semantics of transactional memory
and automatic mutual exclusion, Proceedings of the 35th Annual Symposium on
Principles of Programming Languages, POPL 2008, San Francisco, CA.
Adl-Tabatabai, A.-R., B.T. Lewis, V. Menon, B.R. Murphy, B. Saha, and
T. Shpeisman, Compiler and runtime support for efficient software transactional
memory, Proceedings of the 2006 Conference on Programming Language Design
and Implementation, PLDI 2006, Ottawa, Ontario, Canada.
Dalessandro, L., M.F. Spear, and M.L. Scott, NOrec: Streamlining STM by abolishing
ownership records, Proceedings of the 15th Symposium on Principles and Practice
of Parallel Programming, PPoPP 2010, Bangalore, India.
Dice, D., O. Shalev, and N. Shavit, Transactional locking II, Proceedings of the
20th International Symposium on Distributed Computing, DISC 2006, Stockholm,
Sweden.
Harris, T., J. Larus, and R. Rajwar, Transactional Memory, 2nd edn., Morgan &
Claypool Publishers, San Rafael, CA, 2010.
290 Fundamentals of Multicore Software Development
Harris, T., S. Marlow, S. Peyton Jones, and M. Herlihy, Composable memory trans-
actions, Proceedings of the 10th Symposium on Principles and Practice of Parallel
Programming, PPoPP 2005, Chicago, IL.
Harris, T., M. Plesko, A. Shinnar, and D. Tarditi, Optimizing memory transac-
tions, Proceedings of the 2006 Conference on Programming Language Design and
Implementation, PLDI 2006, Ottawa, Ontario, Canada.
Herlihy, M. and J.E.B. Moss, Transactional memory: Architectural support for lock-
free data structures, Proceedings of the 20th International Symposium on Computer
Architecture, ISCA 1993, San Diego, CA.
Herlihy, M., V. Luchangco, M. Moir, and W.N. Scherer III, Software transactional
memory for dynamic-sized data structures, Proceedings of the 22nd Annual Sym-
posium on Principles of Distributed Computing, PODC 2003, Boston, MA.
Lev, Y., V. Luchangco, V. Marathe, M. Moir, D. Nussbaum, and M. Olszewski,
Anatomy of a scalable software transactional memory, TRANSACT 2009,
Raleigh, NC.
Menon, V., S. Balensiefer, T. Shpeisman, A.-R. Adl-Tabatabai, R.L. Hudson, B. Saha,
and A. Welc, Practical weak-atomicity semantics for Java STM, Proceedings of
the 20th Symposium on Parallelism in Algorithms and Architectures, SPAA 2008,
Munich, Germany.
Moore, K.E., J. Bobba, M.J. Moravan, M.D. Hill, and D.A. Wood, LogTM: Log-
based transactional memory, Proceedings of the 12th International Symposium on
High-Performance Computer Architecture, HPCA 2006, Austin, TX.
Moore, K.F. and D. Grossman, High-level small-step operational semantics for
transactions, Proceedings of the 35th Annual Symposium on Principles of Pro-
gramming Languages, POPL 2008, San Francisco, CA.
Scott, M.L., M.F. Spear, L.Dalessandro, and V.J. Marathe, Delaunay triangulation
with transactions and barriers, Proceedings of the 2007 IEEE International Sym-
posium on Workload Characterization, Boston, MA.
Shavit, N. and D. Touitou, Software transactional memory, Proceedings of the 14th
Annual Symposium on Principles of Distributed Computing, PODC 1995, Ottawa,
Ontario, Canada.
Shpeisman, T., V. Menon, A.-R. Adl-Tabatabai, S. Balensiefer, D. Grossman,
R.L. Hudson, K.F. Moore, and B. Saha, Enforcing isolation and ordering in STM,
Proceedings of the 2007 Conference on Programming Language Design and
Implementation, PLDI 2007, San Diego, CA.
Chapter 12
Emerging Applications
Pradeep Dubey
Contents
12.1 Introduction
A wave of digitization is all around us. Digital data continues to grow in leaps and
bounds in its various forms, such as unstructured text on the Web, high-definition
images, increasing digital content in health clinics, streams of network access logs or
ecommerce transactions, surveillance camera video streams, as well as massive vir-
tual reality datasets and complex models capable of interactive and real-time render-
ing, approaching photo-realism and real-world animation. While none of us perhaps
has the crystal ball to predict the future killer app, it is our belief that the next round
of killer apps will be about addressing the data explosion problem for end users, a
problem of growing concern and importance for both corporate and home users. The
aim of this chapter is to understand the nature of such applications and propose a
291
292 Fundamentals of Multicore Software Development
Recognition Mining Synthesis
Data mining
Clustering/classification Web mining
Mining (M)
Bayesian network Semantic search
Markov model Streaming data mining
Decision trees Distributed data mining
Forests of trees Content-based image retrieval
Neural networks Query by humming
Probabilistic networks Video mining
Optimization-based models: Intrusion mining
Liner/nonlinear/stochastic
Time-series models
Photo-real synthesis
Virtual world simulation
Modeling or recognition (R)
Synthesis (S)
Behavioral synthesis
Physical simulation
Strategy simulation
Audio synthesis
Video/image synthesis
Summary synthesis
Machine translation
Execution Evaluation
Mining (M)
Modeling
or
(R)
synthesis (S)
(a+b)2 = a2 + 2ab + b2
FIGURE 12.5: Most RMS apps are about enabling interactive (real-time) RMS
loop (iRMS).
Procedural or
analytical
Physical
Model Rendering
simulation
Procedural or
analytical
Outer loop:
Trade bots iRMS visual loop:
one per real user
Inner loop:
iRMS analytics loop:
Chat bots one per bot per user
Shop bots
Performance needs:
can far exceed typical i/o limits of
human perception
Gambler
bots Player bots
Reporter bots Performance needs:
limited by input/output limits of
human perception
photographs, used for mining relevant photos and using them instead to fill the
image hole. For example, in our example this would mean finding a photo taken
of the building of interest without the car parked in front. Thus, an image pro-
cessing problem is transformed largely into a data-mining problem, an approach
that can only be practical with access to a huge image database.
For another example illustrating algorithmic implications of massive data use, the
reader is referred to [4].
nature and the amount of parallelism. Dependences can be removed exposing more
parallelism, or a faster solution, if one is willing to accept an approximate answer. For
example, end-biased histogram (part of the so-called iceberg query) is a good exam-
ple of an approximate query. It answers how often the most frequent item exceeds a
certain threshold, as opposed to actual count for the most frequent item. Space com-
plexity of the latter has a provable lower bound of (N), whereas the end-biased
histogram query has a sublinear space complexity. Such reductions in space and time
complexities [6] are critical for the underlying streaming (often in-memory) usages
for many RMS applications. In general, the statistical nature of most RMS applica-
tions lends them to implementations based on randomized algorithm (such as Monte
Carlo). As we know, randomized algorithms generally have a higher degree of paral-
lelism, and hence a faster solution, compared to their deterministic counterparts [7].
Furthermore, a significant subset of RMS applications is targeted at digital con-
tent creation or audiovisual synthesis. Driven by the needs and limitations of human
sensory perceptions, one can make various algorithmic and accuracy trade-offs that
would not be permissible in scientific discovery context. For example, a simulation of
fluid, aimed at visual fidelity alone, can approximate fluid with a collection of inde-
pendent particles; whereas, the same for scientific accuracy would mandate a much
more complex simulation, in line with the venerated Navier–Stokes equation. These
accuracy approximations, when acceptable to specific human perception needs, sim-
ilar to the previous section, offer new opportunities for parallelization. There are lim-
ited and more recent attempts of compile-time optimizations as well for automating
the discovery of such performance opportunities in real-time applications, as in the
loop perforation technique proposed in [8].
kernel, looping over a large stream of data, or a large number of parallel threads of
widely varying sizes, including some very small in execution duration. Furthermore,
since the subtasks, such as lower-order surfaces, are part of a higher-order model,
low overhead sharing of control and data is often crucial for an efficient implementa-
tion. Consider, for example, the three levels of parallelism described for parallelizing
Cholesky factorization in [11]. There is coarse-grain parallelism at the level of elim-
ination tree, a task dependence graph that characterizes the computation and data
flow among the supernodes. However, the parallelism inside each supernode is fine
grain in nature. Modern chip-level multiprocessors (CMP) have significantly reduced
the traditional overhead associated with fine-grain control and data sharing, as the
on-chip processing nodes are only nanoseconds apart from each other, and not mil-
liseconds apart in older HPC systems with comparable computing power. Smelyan-
skiy [11] demonstrates the potential of very high CMP speedup for a very important,
yet traditionally hard-to-parallelize optimization problem of interior point, benefiting
from the architectural support for both coarse- and fine-grain parallelism.
fixed cost (area power) CMP designs. We capture it with the following reformulation:
S = ((1−P)∗ KN +P/N)−1 . Note the factor, KN applied to scalar component, imply-
ing a slowing down of single-thread performance by a factor that depends on target
N. For example, one may be able to offer an 8× increase in compute density from
N = 4 to N = 32, provided we are willing to accept a 4× slowdown on single-thread
performance, or K32 = 4. Under such conditions, performance benefit of higher N
is not applicable to all applications as before, rather only to applications with high
enough P. Thus, the proposed reformulation captures the high-level trade-off implied
by throughput computing, and it no longer has monotonic performance implication
for all parallel applications. With a little bit of algebra, one can derive the minimum
parallelism, Pmin , for positive speedup as, Pmin = N(KN − 1)/N(KN − 1). For the
illustrative set of parameters here, P must be greater than 0.75 for there to be posi-
tive speedup. There have been several recent attempts at revisiting Amdahl’s law. The
interested reader is referred to [13–15].
• Quite a few of the applications exhibit large amount of parallelism, and hence
near-linear scalability, and nearly all of them show better than 50% resource
utilization up to large core counts.
• Primary scalability challenge with many of the applications (e.g., fluid simula-
tion) has less to do with lack of parallelism, and more to do with not being able
to data-feed the parallel instances, in a cost-constrained CMP context. This is
often referred to as the feeding the beast challenge, as the compute density of
multicores is expected to increase faster than the external memory bandwidth,
rapidly approaching lower than 0.1B/flop from 1B/flop in a single-core era.
40
30
20
10
0
0 8 16 24 32 40 48 56 64
Number of cores
60
1 2 4 8 16 32 64
Manycore speedup wrt single core
50
40
30
20
10
0
521 × 1k
Bar 32 × 32
Small input
4k × 4k
Medium input
Large input
Small dataset
Large dataset
Small grid
Medium grid
Large grid
12.4 Conclusion
System-level analysis of important applications leading to improved workload and
benchmark proxies is not new. However, the present multicore era has added a
dimension of urgency to it. This urgency is especially acute for the compute-intensive
306 Fundamentals of Multicore Software Development
subset of applications for the following simple reason: it is very easy for a program-
mer to get a performance crippling mismatch between the assumed compute model
and the host machine model of a modern multicore/manycore, highly threaded pro-
cessor, than the same in the case of a pre-multicore era, single-core, single-thread
processor. In other words, multicore era potentially implies a much higher degree of
performance variability than the preceding single-core generation. Coping with this
challenge requires a deeper and structured approach to understanding the nature of
emerging compute-intensive applications. It is hoped that the RMS taxonomy and
the proposed structured framework for application analysis, introduction to the key
attributes of the emerging applications and their system implications, has provided
the reader the necessary background for delving deeper into this subject. Real-time
availability of massive data for a vast majority of tomorrow’s computer users, cou-
pled with the rapidly growing compute capabilities of multicore/manycore compute
nodes in modern datacenters, offers an unprecedented opportunity for enabling new
usages of compute, perhaps making it almost as critical and yet implicit as electricity
today.
References
1. Y.-K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V. W.
Lee, A. D. Nguyen, M. Smelyanskiy. Convergence of recognition, mining,
and synthesis workloads and its implications. Proceedings of the IEEE, 96(5),
790–807, April 2008.
2. http://www.nature.com/news/2006/061106/full/news061106-6.html
3. J. Hays and A. A. Efros. Scene completion using millions of photographs. In
Siggraph 2007. Also Communications of the ACM, 51(10), 87–94, October 2008.
4. S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building Rome
in a day. In International Conference on Computer Vision, Kyoto, Japan, 2009.
5. S. Mitra, S. K. Pal, and P. Mitra. Data mining in soft computing framework: A
survey. IEEE Transactions on Neural Networks, 13(1), 3–14, 2002.
6. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues
in data stream systems. In ACM PODS 2002, Madison, WI, June 3–6, 2002.
7. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University
Press, New York, 1995.
8. S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of service
profiling. In ICSE’10, Cape Town, South Africa, May 2–8, 2010.
9. M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. M. Carmean,
D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V. W. Lee, A. D. Nguyen,
L. Seiler, and R. Robb. Mapping high-fidelity volume rendering for medical
Emerging Applications 307
Fundamentals of Multicore
Computational Science Series Computational Science Series
Software Development
With multicore processors now in every computer, server, and embedded
device, the need for cost-effective, reliable parallel software has never been
greater. By explaining key aspects of multicore programming, Fundamentals of
Multicore Software Development helps software engineers understand parallel
programming and master the multicore challenge.
Accessible to newcomers to the field, the book captures the state of the art
of multicore programming in computer science. It covers the fundamentals
of multicore hardware, parallel design patterns, and parallel programming in
C++, .NET, and Java. It also discusses manycore computing on graphics cards
and heterogeneous multicore platforms, automatic parallelization, automatic
performance tuning, transactional memory, and emerging applications.
Features
• Presents the basics of multicore hardware and parallel programming
• Explains how design patterns can be applied to parallel programming
• Describes parallelism in C++, .NET, and Java as well as the OpenMP API
• Discusses scalable manycore computing with CUDA and programming
approaches for the Cell processor
• Covers emerging technologies, including techniques for automatic extraction
of parallelism from sequential code, automatic performance tuning for
parallel applications, and a transactional memory programming model
• Explores future directions of multicore processors
Adl-Tabatabai
Pankratius
As computing power increasingly comes from parallelism, software developers
must embrace parallel programming. Written by leaders in the field, this book Tichy
provides an overview of the existing and up-and-coming programming choices
for multicores. It addresses issues in systems architecture, operating systems,
languages, and compilers.
K10647