Communication
Communication
Introduction
Normal point-to-point communication is the process of sending and receiving messages from one
unit of execution(UE)[MSM04] to another UE. Collective communication, on the other hand, is the
process of exchanging information between multiple UEs: one-to-all or all-to-all communications.
While arbitrarily complex communication patterns can emerge depending on the parallel algo-
rithms used and the topology of the UEs, most communication between UEs actually follow very
regular patterns.
Our pattern language, shown in Table 1.1, provides a catalog of those regular collective com-
munication patterns that are used in scientific computing1 . While there are various collective com-
munication patterns, we have distilled those that we have found to be used commonly in various
algorithms.
Following such patterns whenever possible not only make programs more succinct but also ex-
pose their intents better to other programmers. These collective communication patterns are so
common in scientific computing that most parallel programming environments such as OpenMP,
MPI and Java provide their own built-in constructs to support some, if not all, of them. Program-
mers are encouraged to browse through the list of APIs on their specific platforms to see if such
patterns have already been provided before implementing their own. Using the built-in constructs
not only prevents reinventing the wheel but also offers better performance since the constructs have
been tuned for their specific platforms.
In this paper, we examine the following parallel programming environments and discuss the
facilities that they provide for our patterns:
1. MPI[MPI]
2. OpenMP[Ope]
3. Charm++[Cha]
1
For this submission, we will only focus on the first two patterns: Broadcast and Reduction.
1
4. CUDA[CUD]
5. Cilk[Cil]
6. FJ Framework[JSR]
7. Intel TBB[TBB]
8. Java[Jav]
Broadcast How does one efficiently share the same data from one UE to One-to-all
other UEs?
Reduction How does one efficiently combine a collection of values, one All-to-one
on each UE, into a single object?
Scatter How does one efficiently divide data from one UE into chunks One-to-all
and distribute those chunks to other UEs?
Gather How does one efficiently gather chunks of data from different All-to-one
UEs and combine them on one UE?
These patterns often work together in an algorithm. Some algorithms will use the reduction
pattern to combine the values followed immediately by a broadcast to send the values to every
other UE. Because this combination of reduce-broadcast is used often, some environments actually
provide constructs to support them. In MPI, this is provided as the MPI Allreduce construct.
Our pattern language does not present the combination of patterns in our catalog. Instead,
only the basic patterns are presented and the programmers can determine what combinations to
use depending on their algorithms and the constructs that are provided in their environments.
2
CHAPTER 2
Broadcast
2.1 Problem
How does one efficiently share the same data from one UE to other UEs?
2.2 Context
In parallel computing it is common for one UE, say UE#1 to have the data that other UEs require
for their computation. This arises naturally from using the Task Decomposition and Data
Sharing patterns[MSM04] where each UE is assigned a different responsibility. For instance,
UE#1 might be responsible for reading data from disk and it needs to share the data that it
has read with the other UEs. Or UE#1 might be responsible for computing the result of a long
running computation that it needs to share with the other UEs once it is done; the other UEs can
still proceed with their computations while waiting for UE#1.
Consider the case of matrix-vector multiplication used in solving circuit equations[Boy97]. One
simple way to parallelize this algorithm would be to decompose the original matrix A into rows and
assign a row to each UE. Another UE would be responsible for obtaining the vector b and sharing
the vector b with all the other UE. We call this process of sharing the same data with other UEs
a Broadcast.
Once the data has been successfully shared, each UE can then calculate the result for a particular
row in the resulting vector c. If necessary, the partial results for each row of the vector c can then
be combined using the Gather pattern. The parallelized matrix-vector multiplication is illustrated
in Figure 2.1. Same colored elements reside on the same UE.
3
Figure 2.1 Matrix-vector Multiplication
2.4 Forces
One-to-all or One-to-many In a broadcast operation, it is possible to send the data to every
UE in the topology (one-to-all ) or to a particular group of UE (one-to-many). Whenever
possible, restricting the broadcast to only UEs that require the data is better for performance.
In the case of message-passing, doing so reduces the messages that need to be sent on the
network. And in the case of shared-memory environments, doing so reduces the overhead of
maintaining cache coherence when a shared variable is modified.
Push vs. Pull A broadcast operation is essentially a push operation. Every UE that is involved
in the broadcast operation must be prepared at some point to receive the data. Some parallel
programming algorithms fit this push model very well. On the other hand, some algorithms
might fit better with a pull model where the UEs that are interested in obtaining the data
will query the UE with that data when it needs it.
2.5 Solution
Use the broadcast construct if one is provided by the parallel programming environment. Broad-
cast is such a common pattern in scientific computing that it is included in most parallel program-
4
ming environments (see Table 2.1). The broadcast construct that is provided by the programming
environments has been tuned to satisfy the needs of most programmers.
5
2.5.1 Sequential Broadcast
A sequential broadcast is the simplest way to implement a broadcast operation. In Figure 2.2,
UE#1 is in charge of sending the same data to the other UEs. While this is a simple approach, it
is also inefficient and unscalable because UE#1 becomes a bottleneck; the communication channels
between pairs of the other UEs are not being utilized at all.
Nonetheless this approach is simple and convenient when there are not a lot of UEs to broadcast
the data to.
And in some cases, this might be the only way to do a broadcast. In an MPI environment,
all the UEs are aware of one another and can communicate by using their rank IDs. This works
because there is a static number of UEs that are initialized once and remain available throughout
the entire computation. However, in a more dynamic environment such as an Actor system, UEs
might be created and destroyed as the computation proceeds. Thus, not every UE will be aware
of all the newly created or just destroyed UEs. Each UE knows only a subset of the available
UEs in the topology. In fact in such a dynamic environment, making each UE aware of the other
UEs is prohibitively expensive since it would require an update to be send to every creation and
destruction of a UE.
So in Figure 2.2, it might be the case that UE#1 is the only UE that knows about UE#2,
UE#3, UE#4, UE#5 and UE#6. If so, only UE#1 can broadcast the data and we cannot use the
more efficient recursive doubling technique described in Section 2.5.2.
6
Figure 2.3 Recursive Doubling
2.6 Invariants
Precondition A value on one UE that needs to be distributed to other UEs.
Invariant The initial value on the source UE.
Postcondition The value from the source UE is now distributed on all the destination UEs.
2.7 Examples
2.7.1 Matrix-vector Multiplication in MPI
Listing 2.1 shows an example of using the built-in MPI Bcast construct in MPI to solve the matrix-
vector multiplication problem discussed in Section 2.2 and illustrated in Figure 2.1.
In our simple implementation, we assume that 5 UEs are available. The UE with the
VECTOR PROVIDER NODE ID is responsible for obtaining the value of the vector. Each of the other
UEs is in charge of a particular row of the matrix. Once the VECTOR PROVIDER NODE UE has
obtained the value of the vector, every UE (including itself) calls the MPI Bcast function. The
following code snippet shows the arguments to the function.
67 i f ( my id == VECTOR PROVIDER NODE)
68 obtain vector ( ) ;
69
70 MPI Bcast( v e c t o r , VECTOR SIZE , MPI INT , VECTOR PROVIDER NODE, MPI COMM WORLD) ;
The vector argument is the data to be broadcasted to all the UEs. In this case, it is an ar-
ray of integers. The VECTOR SIZE argument is the number of elements in the array to broadcast.
The VECTOR PROVIDER NODE argument tells the implementation that the value of the vector argu-
ment resides on the node with VECTOR PROVIDER NODE as its ID. And finally the MPI COMM WORLD
argument tells the implementation to broadcast the value to every UE in the topology.
7
Listing 2.1 Matrix-vector Multiplication Example Using MPI Bcast
39 i n t c a l c u l a t e r e s u l t f o r r o w ( i n t row ) {
40 int r e s u l t = 0 ;
41
42 i n t ∗ m a t r i x r o w = g e t m a t r i x r o w ( row ) ;
43
44 int i ;
45 f o r ( i = 0 ; i < VECTOR SIZE ; i ++) {
46 r e s u l t += m a t r i x r o w [ i ] ∗ v e c t o r [ i ] ;
47 }
48
49 return r e s u l t ;
50 }
51
52 i n t main ( i n t argc , char∗ argv [ ] ) {
53
54 i n t my id ;
55 int n u m b e r o f p r o c e s s o r s ;
56
57 //
58 // I n i t i a l i z e MPI and s e t up SPMD programs
59 //
60 M P I I n i t (& argc ,& argv ) ;
61 MPI Comm rank (MPI COMM WORLD, &my id ) ;
62 MPI Comm size (MPI COMM WORLD, &n u m b e r o f p r o c e s s o r s ) ;
63
64 //
65 // One o f t h e node o b t a i n s t h e v e c t o r and b r o a d c a s t s i t
66 //
67 i f ( my id == VECTOR PROVIDER NODE)
68 obtain vector ( ) ;
69
70 MPI Bcast( v e c t o r , VECTOR SIZE , MPI INT , VECTOR PROVIDER NODE, MPI COMM WORLD) ;
71
72 //
73 // A l l o t h e r UEs s h o u l d c a l c u l a t e t h e i r p a r t i a l v a l u e s
74 //
75 i f ( my id != VECTOR PROVIDER NODE) {
76 i n t m y p a r t i a l r e s u l t = c a l c u l a t e r e s u l t f o r r o w ( my id ) ;
77 p r i n t f ( ”The p a r t i a l r e s u l t f o r row %d i s %d\n” , my id , m y p a r t i a l r e s u l t ) ;
78 }
79
80 MPI Finalize ( ) ;
81
82 return 0 ;
83 }
After the vector has been successfully broadcasted, each UE calculates the partial value for each
row in the resulting matrix and prints it out to the console.
8
Listing 2.2 Matrix-vector Multiplication Example Using AtomicIntegerArray in Java
1 import j a v a . u t i l . c o n c u r r e n t . ∗ ;
2 import j a v a . u t i l . c o n c u r r e n t . atomic . AtomicIntegerArray ;
3
4 public c l a s s M a t r i x V e c t o r M u l t i p l i c a t i o n {
5 private s t a t i c AtomicIntegerArray vector ;
6 ....
7
8 s t a t i c c l a s s R o w M u l t i p l i e r implements Runnable {
9
10 ....
11
12 public void run ( ) {
13 waitForVectorBroadcast ( ) ;
14
15 i n t sum = 0 ;
16 f o r ( i n t i = 0 ; i < row . l e n g t h ; i ++) {
17 sum += row [ i ] ∗ vector . g e t ( i ) ;
18 }
19 System . out . p r i n t l n ( ”The p a r t i a l r e s u l t f o r row ” + rowNumber + ” i s ” +
20 sum ) ;
21 }
22
23 }
24
25 public s t a t i c void c r e a t e M a t r i x V e c t o r M u l t i p l i c a t i o n U E s ( ) {
26
27 E x e c u t o r S e r v i c e e x e c u t o r = E x e c u t o r s . newFixedThreadPool (VECTOR SIZE ) ;
28
29 f o r ( i n t row = 0 ; row < VECTOR SIZE ; row++) {
30 R o w M u l t i p l i e r s [ row ] = new R o w M u l t i p l i e r ( row ) ;
31 e x e c u t o r . e x e c u t e ( R o w M u l t i p l i e r s [ row ] ) ;
32 }
33
34 }
35
36 public s t a t i c void main ( S t r i n g [ ] argv ) {
37
38 createMatrixVectorMultiplicationUEs ( ) ;
39
40 obtainVector ( ) ;
41
42 }
43
44 }
9
spawned and run in their own threads while waiting for the main thread to broadcast the value of
the vector to them. Once the main thread obtains the value of the vector, the cached value of the
vector in each of the threads is invalidated and they all see the new value from the main thread.
As mentioned previously, a programmer can rely on the volatile keyword automatically to
immediately broadcast the modified value of a shared variable across all UEs in Java, However,
when applied to an array the volatile keyword doesn’t extend the broadcast capabilities to the in-
dividual elements of the array. Thus, updates to the elements of the array might not be broadcasted
to the other UEs which might still have cached copies of the values.
The AtomicIntegerArray class solves that problem by extending the semantics of the volatile
keyword. We use it in our Listing 2.2 to store the values of the vector variable used for matrix
multiplication. Any updates to the vector variable from the main UE will be broadcasted to the
other UEs.
Scatter Broadcast sends the same data to other UES; Scatter sends different chunks of data
to other UEs.
10
CHAPTER 3
Reduction
3.1 Problem
How does one efficiently combine a collection of values, one on each UE, into a single object?
3.2 Context
Most parallel algorithms divide the problem that needs to be solved into tasks that are assigned
to different UEs. Doing so allows the different tasks to be performed in parallel on different UEs.
Once those tasks are done, it is usually necessary to combine their partial results into a single
object that represents the answer to the original problem.
Consider the problem of numerical integration of some function. One way to parallelize this
algorithm would be to divide the domain of the function into different parts and assign a UE to
integrate each part. Once all the UEs are done with the integration of their sub-domains, we
combine their results using a sum operator to obtain the total value. This pattern of combining
the results is called a reduction.
The example in Figure 3.1 divides the numerical integration problem into tasks and maps those
tasks unto four different UEs.
A simple approach for combining those values involves waiting for the results from all four
tasks to finish before computing the sum. While this works, it also fails to exploit any inherent
parallelism between the tasks. In the example, it is actually unnecessary to wait for the results of
the four tasks before calculating the sum.
We could do better and partially sum the results as they arrive. For instance, we could compute
the partial sum from UE#1 and UE#2 and the partial sum of UE#3 and UE#4 in parallel and
then combine both partial sums to find the total sum. We perform the summation operation on
each piece of data as it arrives.
The inherent parallelism arises from the properties of the operator that is used to combine the
11
Figure 3.1 Numerical Integration
partial results. In our example, the summation operation is both associative1 and commutative2 .
Those properties allow the operation to be performed as each partial result arrives. Whenever pos-
sible, a properly implemented reduction allows the programmer to efficiently combine a collection
of values into a single value by exploiting any inherent parallelism in the process.
Nonetheless, combining a collection of values does not always require using Reduction. Con-
sider an application that stores its results in a bit array. Each UE computes the value (0 or 1) for
a range of positions with no overlaps in the bit array. Using Reduction to combine the values
would require merging the values from each UE and building up partial bit arrays. Because the
value from each UE corresponds to a range of positions in the final bit array, care must be taken
while merging and building up the partial bit arrays. Moreover, in message-passing environments,
all those partial bit arrays incur the overhead of being sent to different UEs in order to compute
the result.
Instead, for this application, it might be better to rely on the Gather pattern. One UE,
say UE#1, is responsible for merging all the results. The other UEs compute the values for non-
overlapping ranges in the bit array and send their results as a tuple to UE#1. After receiving all
the values, UE#1 produces the complete bit array in the desired order.
3.3 Forces
Common operator Reduction requires that the same operator is used to combine the values
from all the different UEs. This is a common pattern in scientific computing and will fit
1
Associative Property: (A + B) + C = A + (B + C)
2
Commutative Property: A + B = B + A
12
the needs of most algorithms. However, if an algorithm requires combining the results from
different UEs using different operations, then it is better to orchestrate the communication
using other patterns.
Load-balancing Reduction operations with associative and/or commutative operators offer po-
tential for load-balancing because the task of partially combining the values can be executed
on different UEs as each partial value is computed. This is useful especially for operations
that are computationally-intensive.
Floating-point numbers Care has to be taken while using reduction with floating-point num-
bers. Certain operations such as addition on floating-point numbers are neither strictly
associative nor commutative. Round-off errors can easily accumulate depending on the order
in which the operator is applied to the partial values. In cases where the partial values have
roughly the same magnitude, the loss of precision is usually acceptable for the programmer.
However, if the partial values have vastly different magnitudes, the error could be unaccept-
able; programmers should not rely on reduction to combine the partial values but should
instead orchestrate the communication using other patterns.
3.4 Solution
Use the reduction construct if one is provided by the parallel programming environment. Reduc-
tion is such a common pattern in scientific computing that it is included in most parallel program-
ming environments (see Table 3.1). The reduction construct that is provided by the programming
environments has been tuned to satisfy the needs of most programmers. It is designed to exploit
the inherent concurrency that is possible with associative and commutative binary operators.
13
Deciding what operator to use in combining the the collection of values is also part of the
process of applying the Reduction pattern. Most environments already provide some built-in
operators for common tasks such as determining the maximum, minimum, sum, product, etc. that
programmers can use directly in their reduction operations.
Moreover, most environments also provide support for user-defined operators that programmers
can write. When writing their own operators, programmers have to consider whether their operators
are associative and/or commutative. Programmers also have to decide on the data structure —
single value, tuples, etc — that is used to store the partial result.
On the other hand, if the parallel programming environment does not provide a reduction
construct programmers might need to write one of their own. Incidentally, the reduction construct
provided by the environment might make the implicit assumption that the operator needs to be
commutative and/or associative. If the operator doesn’t have those those properties, programmers
need to write their own reduction construct.
We summarize the two common methods — serial computation and tree-based reduction —
for implementing reduction operations presented in [MSM04]. More complex implementation tech-
niques are presented in [GGKK03].
If the reduction operator is not associative then the programmer needs to “serialize” the com-
putation. One way of doing so is shown in Figure 3.2. In this method, all UEs send their values to
14
UE#1 in a predetermined order. And UE#1 is in charge of combining those values using the OP
operator. At the end of the computation, only UE#1 contains the combined value.
However, if the reduction operator is associative, we can take advantage of the inherent paral-
lelism by performing the reduction operation in parallel using a tree-based reduction as shown in
Figure 3.3.
In this specific example, we assume that we only have 4 UEs. UE#1 applies the OP operator
to its own value and the value that it receives from UE#2. It then stores that value temporarily.
Similarly, UE#3 will apply the OP operator to the value it receives from UE#4 and stores that
value temporarily. Finally, UE#1 combines its value and the value it receives from UE#3 using
the OP operator and stores the combined results.
3.5 Invariants
Precondition A collection of values on different UEs that need to be combined using the same
operator.
Postcondition The values from different UEs are combined using the operator and are contained
in an object on one of the UEs.
15
3.6 Examples
3.6.1 Numerical Integration with MPI
Listing 3.1 shows an example of using the built-in MPI Reduce construct in MPI to solve the
numerical integration problem discussed in Section 3.2 and illustrated in Figure 3.1.
Our strategy involves dividing the domain of the function into several non-overlapping sub-
domains. The numerical integration for each sub-domain is calculated in parallel and their values
are reduced to obtain the value of integrating the function over the entire domain.
The domain (UPPER LIMIT - LOWER LIMIT) of the function to integrate is divided evenly be-
tween the number of available UEs as determined by the MPI Comm size function. Each UE handles
a particular sub-domain and none of the sub-domains overlap. In our example, each UE is alloted
the same number of intervals i.e. dx to calculate3 .
3
A better scheme could use the rate of change of the function to determine how many intervals to calculate for
16
Each UE starts integrating beginning from its own lower bound. The partial result is stored
in the my partial integration variable. After the partial result is available, each UE calls the
MPI Reduce function. The following code snippet shows the arguments to the function.
53 double t o t a l i n t e g r a t i o n ;
54 MPI Reduce(& m y p a r t i a l i n t e g r a t i o n , &t o t a l i n t e g r a t i o n , 1 , MPI DOUBLE, MPI SUM,
55 MASTER NODE, MPI COMM WORLD) ;
The MPI SUM argument tells the reduction operation to sum each of the my partial integration
variables on each UE and to store the result in the total integration variable. The result will
be stored on the MASTER NODE UE.
Internally, the MPI implementation would take advantage of the associative and commuta-
tive properties of the MPI SUM operation and perform the summation in parallel yielding better
performance.
Estimating π using this method involves randomly “throwing” points in the first quadrant of
the square shown in Figure 3.4. Some of those points will land inside the circumference of the unit
circle and some will land outside the circumference. The ratio of the points inside the circle to
total points thrown gives an estimate for π based on the following formula:
(π)(1 unit)2
Area of quarter unit circle 4
=
Area of quarter of square (1 unit)2
Points in circle
≈
Total points in first quadrant
Points in circle
⇒π ≈4∗
Total points in first quadrant
The function count points in circle(int *) on lines 17 - 27 of Listing 3.2 shows how the
points are randomly generated and tested to see if they are inside the circumference.
each sub-domain. A slower changing function would require less intervals.
17
Listing 3.2 Estimating π Example Using MPI Reduce
17 int c o u n t p o i n t s i n c i r c l e ( int ∗ stream id ){
18 i n t i , my count = 0 ;
19 double x , y , d i s t a n c e s q u a r e d ;
20 f o r ( i = 0 ; i < ITERATIONS ; i ++) {
21 x = ( double ) s p r n g ( s t r e a m i d ) ;
22 y = ( double ) s p r n g ( s t r e a m i d ) ;
23 d i s t a n c e s q u a r e d = x∗x + y∗y ;
24 i f ( d i s t a n c e s q u a r e d <= RADIUS) my count++;
25 }
26 return my count ;
27 }
28
29 i n t main ( i n t argc , char∗ argv [ ] ) {
30
31 i n t my id ;
32 int n u m b e r o f p r o c e s s o r s ;
33
34 //
35 // I n i t i a l i z e MPI and s e t up SPMD programs
36 //
37 M P I I n i t (& argc ,& argv ) ;
38 MPI Comm rank (MPI COMM WORLD, &my id ) ;
39 MPI Comm size (MPI COMM WORLD, &n u m b e r o f p r o c e s s o r s ) ;
40
41 //
42 // I n i t i a l i z e pseudo p a r a l l e l random number g e n e r a t o r
43 //
44 generate seed ( ) ;
45 i n t ∗ s t r e a m i d = i n i t s p r n g (SPRNG LFG, my id , n u m b e r o f p r o c e s s o r s , rand ( ) ,
46 SPRNG DEFAULT ) ;
47
48 i n t my count = c o u n t p o i n t s i n c i r c l e ( s t r e a m i d ) ;
49
50 //
51 // Combine t h e r e s u l t s from d i f f e r e n t UEs
52 //
53 int t o t a l c o u n t ;
54 MPI Reduce(&my count , &t o t a l c o u n t , 1 , MPI INT , MPI SUM, MASTER NODE,
55 MPI COMM WORLD) ;
56
57 i f ( my id == MASTER NODE) {
58 double e s t i m a t e d p i = ( double ) t o t a l c o u n t /
59 (ITERATIONS ∗ n u m b e r o f p r o c e s s o r s ) ∗ 4 ;
60 p r i n t f ( ”The e s t i m a t e o f p i i s %g \n” , e s t i m a t e d p i ) ;
61 }
62
63 MPI Finalize ( ) ;
64
65 return 0 ;
66 }
18
Each UE is initialized to perform the same algorithm in parallel albeit with different random
points. To ensure that each UE receives a set of random points with minimal chances of repetition,
we rely on the SPRNG[SPR] pseudo parallel random number generator library.
After each UE has successfully completed the algorithm, MPI Reduce function is called to com-
bine the estimates of π from each UE.
Listing 3.3 is the routine for the histogram accumulation. Listing 3.4 is the routine for the first
phase reduction. Listing 3.5 is the routine for the second phase reduction. Listing 3.6 is the routing
for tree-based reduction.
19
Listing 3.3 Routine for Histogram Accumulation
117 void CalcHistoGram ( i n t n , i n t ∗ devFeature , i n t ∗ d e v H i s t o R e s u l t , i n t r e d u c t i o n L e v e l )
118 {
119 i n t blockNum = ( n + BLOCKSIZE − 1 ) / BLOCKSIZE ;
120 dim3 blockDim (BLOCKSIZE, 1 ) ;
121 dim3 gridDim ( blockNum , 1 ) ;
122 int ∗ devHisto = 0 ;
123 cudaMalloc ( ( void ∗∗)& devHisto , blockNum∗HISTOLAYER∗ s i z e o f ( i n t ) ) ;
124
125 // F i r s t Phase R e d u c t i o n
126 C a l c H i s t o P e r B l o c k <<<gridDim , blockDim>>>(n , devFeature , devHisto , r e d u c t i o n L e v e l ) ;
127 dim3 oneGrid ( 1 , 1 ) ;
128
129 // Second Phase R e d u c t i o n
130 C a l c H i s t o P e r G r i d <<<oneGrid , blockDim>>>(blockNum , devHisto , d e v H i s t o R e s u l t , r e d u c t i o n L e v e l ) ;
131 cudaFree ( d e v H i s t o ) ;
132 }
20
Listing 3.5 Routine for Second Phase Reduction
76 global void C a l c H i s t o P e r G r i d ( i n t blockNumber , i n t ∗ bl ockHi stogram , i n t ∗
77 finalHistogram , int r e d u c t i o n L e v e l )
78 {
79 shared i n t d e v R e s u l t [ BLOCKSIZE ∗ HISTOLAYER ] ;
80
81 // I n i t i a l i z a t i o n
82 i f ( t h r e a d I d x . x < blockNumber )
83 {
84 f o r ( i n t i = 0 ; i < HISTOLAYER; i ++)
85 {
86 d e v R e s u l t [ i ∗BLOCKSIZE + t h r e a d I d x . x ] = 0 ;
87 }
88 }
89
90 // Apply s e r i a l r e d u c t i o n f o r taskPerTh e l e m e n t s
91 i n t taskPerTh = ( blockNumber + BLOCKSIZE − 1 ) /BLOCKSIZE ;
92
93 f o r ( i n t i = 0 ; i < taskPerTh ; i ++)
94 {
95 i n t i n d e x = t h r e a d I d x . x + i ∗BLOCKSIZE ;
96 i f ( i n d e x < blockNumber )
97 {
98 f o r ( i n t j = 0 ; j < HISTOLAYER; j ++)
99 {
100 d e v R e s u l t [ j ∗BLOCKSIZE + t h r e a d I d x . x ] += b l o c k H i s t o g r a m [ i n d e x + j ∗
101 blockNumber ] ;
102 }
103 }
104 }
105 syncthreads ( ) ;
106 TreeReduction ( d ev R e su l t , blockNumber , r e d u c t i o n L e v e l ) ;
107
108 i f ( t h r e a d I d x . x == 0 )
109 {
110 f o r ( i n t i = 0 ; i < HISTOLAYER; i ++)
111 {
112 f i n a l H i s t o g r a m [ i ] = d e v R e s u l t [ i ∗BLOCKSIZE ] ;
113 }
114 }
115 }
21
Listing 3.6 Tree-Based Reduction
22 device void TreeReduction ( i n t ∗ devArray , i n t n , i n t r e d u c t i o n L e v e l )
23 {
24 // Tree−Based R e d u c t i o n
25 i n t mask = 1 ;
26 f o r ( i n t l e v e l= 0 ; l e v e l < r e d u c t i o n L e v e l ; l e v e l ++)
27 {
28 i f ( ( t h r e a d I d x . x & mask ) == 0 )
29 {
30 int index1 = threadIdx . x ;
31 i n t i n d e x 2 = ( 1 << l e v e l ) + t h r e a d I d x . x ;
32 i f (IMUL( blockDim . x , b l o c k I d x . x ) + i n d e x 2 < n )
33 {
34 f o r ( i n t i= 0 ; i < HISTOLAYER; i ++)
35 {
36 devArray [ BLOCKSIZE ∗ i + i n d e x 1 ] += devArray [ BLOCKSIZE ∗ i + i n d e x 2 ] ;
37 }
38 }
39
40 }
41 mask = ( mask < <1)|1;
42 syncthreads ( ) ;
43 }
44 }
the reduction operation proceeds, all UEs block while waiting for each other to reduce their
values. Once all UEs have completed the reduction, they can proceed with their computation
again. A null-operator Reduction acts as a barrier that synchronizes all UEs. This tech-
nique is used in Charm++ which does not provide a Barrier construct, only a Reduction
construct.
22
Bibliography
[Agh86] Gul Agha. Actors: a model of concurrent computation in distributed systems. MIT
Press, Cambridge, MA, USA, 1986.
[Boy97] Robert L. Boylestad. Introductory Circuit Analysis. Prentice Hall, 8th edition, 1997.
[Cha] Charm++ Parallel Programming Model. http://charm.cs.uiuc.edu/.
[Cil] The Cilk Project. http://supertech.csail.mit.edu/cilk/.
[CUD] NVIDIA CUDA. http://www.nvidia.com/object/cuda what is.html.
[GGKK03] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to
Parallel Computing. Addison Wesley, 2003.
[Jav] Java. http://www.java.com/en/.
[JSR] JSR-166. http://gee.cs.oswego.edu/dl/concurrency-interest/.
[L2N] L2-Norm. http://mathworld.wolfram.com/L2-Norm.html.
[MPI] Message Passing Interface Forum. http://www.mpi-forum.org/.
[MSM04] Timothy Mattson, Beverly Sanders, and Berna Massingill. Patterns for Parallel Pro-
gramming. Addison Wesley, 2004.
[Ope] The OpenMP API Specification for Parallel Programming. http://openmp.org/wp/.
[OPL] Berkeley Pattern Language for Parallel Programming. http://parlab.eecs.
berkeley.edu/wiki/patterns/patterns.
[SPR] Scalable Parallel Pseudo Random Number Generators Library. http://sprng.cs.
fsu.edu/.
[TBB] Intel threading building blocks. http://www.threadingbuildingblocks.org/.
23