0% found this document useful (0 votes)

49 views12 pages

Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus

article

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views12 pages

Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus

article

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

J Supercomput (2018) 74:1461–1472

https://doi.org/10.1007/s11227-017-2231-3

Language-based vectorization and parallelization using

intrinsics, OpenMP, TBB and Cilk Plus

Przemysław Stpiczyński1

Published online: 11 January 2018

Abstract The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic
language-based tools for simple and efficient parallelization of recursively defined
computational problems and other problems that need both task and data parallelization
techniques. We show how to use these models of parallel programming to transform
a source code of Adaptive Simpson’s Integration to programs that can utilize multiple
cores of modern processors. Using the example of Belman–Ford algorithm for solving
single-source shortest path problems, we advise how to improve performance of data
parallel algorithms by tuning data structures for better utilization of vector extensions
of modern processors. Manual vectorization techniques based on Cilk array notation
and intrinsics are presented. We also show how to simplify such optimization using
Intel SIMD Data Layout Template containers.

Keywords Multicore · Manycore · Recursive algorithms · Shortest path problems ·

Intrinsics · OpenMP · Cilk Plus · TBB · SDLT containers

1 Introduction

Recently, multicore and manycore computer architectures have become very attractive
for achieving high-performance execution of scientific applications at relatively low
costs [5,13,17]. Modern CPUs and accelerators achieve performance that was recently
reached by supercomputers. Unfortunately, the process of adapting existing software to
such new architectures can be difficult if we expect to achieve reasonable performance

B Przemysław Stpiczyński
[email protected]

1 Institute of Mathematics, Maria Curie–Skłodowska University, Pl. Marii Curie-Skłodowskiej 1,

Lublin 20-031, Poland

123
1462 P. Stpiczyński

without putting much effort into software development. However, sometimes the use of
high-level language-based programming interfaces devoted to parallel programming
can get satisfactory results with rather little effort [15].
Software development process for modern Intel multicore CPUs and manycore
coprocessors such as Xeon Phi [5,13] requires special optimization techniques to
obtain codes that would utilize the power of underlying hardware. Usually it is not
sufficient to parallelize applications. For such computer architectures, efficient vector-
ization is crucial for achieving satisfactory performance [5,17]. Unfortunately, very
often compiler-based automatic vectorization is not possible because of some non-
obvious data dependencies inside loops [1]. Moreover, the performance of vectorized
programs can be improved by the use of proper memory layout. On the other hand,
people expect parallel programming to be easy. The use of simple and powerful pro-
gramming constructs that can utilize underlying hardware is highly desired.
Intel C/C++ compilers and development tools offer many language-based exten-
sions that can be used to simplify the process of developing high-performance parallel
programs [6,17]. OpenMP [3,18] is the most popular, but one can consider using
Threading Building Blocks (TBB for short) [6,12] or Cilk Plus [5,13]. More sophis-
ticated language-based optimization can be done by using intrinsics, which allow to
utilize Intel Advanced Vector Extensions (i.e., SIMD extensions) explicitly [6]. SDLT
template library can be applied to introduce SIMD-friendly memory layout transpar-
ently [6].
In this paper, we evaluate OpenMP, Intel TBB and Cilk Plus as language-based tools
for simple and efficient parallelization of recursively defined computational problems
and other problems that need both task and data parallelization techniques. We also
advise how to improve the performance of such algorithms by tuning data structures
manually and automatically using SDLT. We also explain how to explicitly utilize
vector units of modern multicore and manycore processors using intrinsics. We show
how to parallelize recursively defined Adaptive Simpson’s Integration Rule [7] using
OpenMP, Intel TBB and Cilk Plus [9], and we consider various implementations of
Belman–Ford algorithm for solving the single-source shortest path problem [4] and
examine their performance. These two computational problems have been chosen to
demonstrate the most important features of the considered language-based tools.

2 Short overview of selected language-based tools

In this section, we present a short overview of the considered language-based tools

for parallel and vector programming.
OpenMP is a well-known standard for shared memory parallel programming in C/C++
and Fortran [3,11]. It is based on compiler directives used to specify parallel execution
of selected parts of computer programs (i.e., loops and code sections). Directives are
also used to provide explicit synchronization constructs. Such directives may contain
clauses that define data sharing between threads. Additionally, OpenMP consists of a
small set of library routines and environment variables that influence runtime behavior.
Newer versions of the standard define more advanced programming constructs such
as tasking and SIMD support [18].

123
Language-based vectorization and parallelization using… 1463

TBB (Threading Building Blocks) is a C++ template library supporting task paral-
lelism on Intel multicore platforms [5,12]. It provides a rich collection of components
for parallel programming, and a scheduler which manages and schedules threads to
execute parallel tasks. TBB also provides low-level services for memory allocation and
atomic operations. TBB programs can be combined with OpenMP pragmas specifying
compiler supported vectorization; thus, it can exploit both task and data parallelism.
Cilk Plus offers several powerful extensions to C/C++ that allow to express both task
and data parallelism [5,8,13,14]. The most important constructs are useful to specify
and handle possible parallel execution of tasks. _Cilk_for followed by the body of
a for loop tells that iterations of the loop can be executed in parallel. Runtime applies
the divide-and-conquer approach to schedule tasks among active workers to ensure
balanced workload of available cores. _Cilk_spawn permits a given function to be
executed asynchronously with the rest of the calling function. _Cilk_sync tells that
all tasks spawned in a function must complete before execution continues. Another
important feature of Cilk Plus is the array notation which introduces vectorized oper-
ations on arrays. Expression A[start:len:stride] represents an array section
of length len starting from A[start] with the given stride. Omitted stride
means 1. The operator [:] can be used on both static and dynamic arrays. There
are also several built-in functions to perform basic computations among elements in
an array such as sum, min and max. It should be noticed that the array notation can
also be used for array indices. For example, A[x[0:len]] denotes elements of the
array A given by indices from x[0:len]. Intel Cilk Plus also supports Shared Virtual
Memory which allows to share data between the CPU and the coprocessor. It is perfect
for exchanging irregular data with limited size, when explicit synchronization is not
used frequently [16].
Intrinsics for SIMD instructions allow to take full advantage of Intel Advanced Vector
Extensions (AVX, AVX2, AVX-512) what cannot always be easily achieved due to
limitations of programming languages and compilers [6]. They allow programmers
to write constructs that look like C/C++ function calls corresponding to actual SIMD
instructions. Such calls are replaced with assembly code inlined directly into programs.
The disadvantage of this solution is the lack of the code portability between different
versions of vector extensions.
SDLT (SIMD Data Layout Template) is a C++11 template library which provides
containers with SIMD-friendly data layouts [6]. The use of such containers allows
for a transparent transition of data structures of the Array of Structures (AOS) type to
Structure of Arrays (SOA) or Arrays of Structure of Arrays (ASA) forms. Such con-
versions can improve vectorization and increase the efficiency of programs executed
on modern processors.

3 Two examples of computational problems

Now we will present two exemplary problems which can be easily parallelized and
optimized using considered language-based tools. All implementations have been
tested on a server with two Intel Xeon E5-2670 version 3 (totally 24 cores with

123
1464 P. Stpiczyński

Listing 1 Cilk version of Adaptive Simpson’s method

1 double c i l k A S A u x ( double (* f ) ( double ) , double a , double b ,
2 double eps , double S , double fa , double fb , double fc , int
depth )
3 {
4 double c = ( a + b ) /2 , h = b - a ;
5 double d = ( a + c ) /2 , e = ( c + b ) /2;
6 double fd = f ( d ) , fe = f ( e ) ;
7 double Sleft = ( h /12) *( fa + 4* fd + fc ) ;
8 double S r i g h t = ( h / 1 2 ) *( fc + 4* fe + fb ) ;
9 double S2 = S l e f t + S r i g h t ;
10 if ( depth <= 0 || fabs ( S2 - S ) <= 15* eps )
11 return S2 + ( S2 - S ) /15;
12 double din1 =
13 _ C i l k _ s p a w n c i l k A S A u x (f ,a , c , eps /2 , Sleft , fa , fc , fd , depth -1) ;
14 double din2 = c i l k A S A u x (f ,c , b , eps /2 , Sright , fc , fb , fe , depth -1) ;
15 _Cilk_sync ;
16 return din1 + din2 ;
17 }
18 double c i l k A S ( double (* f ) ( double ) , double a , double b , double eps ,
int depth )
19 { double c = ( a + b ) /2 , h = b - a ;
20 double fa = f ( a ) , fb = f ( b ) , fc = f ( c ) ;
21 double S = ( h /6) *( fa + 4* fc + fb ) ;
22 return c i l k A S A u x ( f , a , b , eps , S , fa , fb , fc , d e p t h ) ;
23 }

hyperthreading, 2.3 GHz), 128GB RAM, with Intel Xeon Phi Coprocessor 7120P (61
cores with multithreading, 1.238 GHz, 16GB RAM), running under CentOS 6.5 with
Intel Parallel Studio version 2017, C/C++ compiler supporting Cilk Plus, TBB and
SDLT. Experiments on Xeon Phi have been carried out using its native mode.

3.1 Adaptive Simpson’s Integration Rule

Let us consider the following recursive method for numerical integration called Adap-
b
tive Simpson’s Rule [7]. We want to find the approximation of I ( f ) = a f (x)dx
with a user-specified tolerance . Let S(a, b) = h6 ( f (a) + 4 f (c) + f (b)), where
h = b − a and c is a midpoint of the interval [a, b]. The method uses Simpson’s
rule to the halves of the interval in recursive manner until the stopping criterion
15 |S(a, c) + S(c, b) − S(a, b)| < . is reached [10].
1

Listing 1 shows our Cilk version of the straightforward recursive implementa-

tion of the method [2]. Note that we have only included keywords _Cilk_spawn
and _Cilk_sync. The first one specifies that cilkASAux() can execute in par-
allel with the remainder of the calling kernel. _Cilk_sync tells that all spawned
calls in the current call of the kernel must complete before execution continues. Our
OpenMP implementation (Listing 2) assumes the use of tasks [18], where the keywords
_Cilk_spawn and _Cilk_sync are simply replaced with task and taskwait
constructs. The main function ompAS() is a little bit more complicated. One should
create a parallel region and call ompASAux() as a single task. Listing 3 presents
our TBB version of the method, which is analogous to the Cilk version but calls
tbbASAux().

123
Language-based vectorization and parallelization using… 1465

Listing 2 OpenMP version of Adaptive Simpson’s method

1 double o m p A S A u x ( double (* f ) ( double ) , double a , double b ,
2 double eps , double S , double fa , double fb , double fc , int
depth )
3 { // as in Cilk ver s i o n
4 // ...
5 double din1 , din2 ;
6 # p r a g m a omp t a s k s h a r e d ( d i n 1 )
7 { din1 = o m p A S A u x A u x ( f ,a ,c , e p s i l o n /2 , Sleft , fa , fc , fd , depth -1) ;
}
8 din2 = o m p A S A u x A u x ( f , c ,b , e p s i l o n /2 , Sright , fc , fb , fe , depth -1) ;
9 # p r a g m a omp t a s k w a i t
10 return din1 + din2 ;
11 }
12 double ompAS ( double (* f ) ( double ) , double a , double b , double eps ,
int depth )
13 { // as in Cilk vers i o n
14 // ...
15 double y ;
16 # p r a g m a omp p a r a l l e l
17 {# p r a g m a omp s i n g l e
18 y = o m p A S A u x (f ,a , b , epsilon , S , fa , fb , fc , m a x R e c u r s i o n D e p t h ) ;
19 }
20 return y ;
21 }

Listing 3 TBB version of Adaptive Simpson’s method

1 double t b b A S A u x ( double (* f ) ( double ) , double a , double b ,
2 double eps , double S , double fa , double fb , double fc , int
depth )
3 { // as in Cilk ver s i o n
4 // ...
5 double din1 , din2 ;
6 tbb :: t a s k _ g r o u p g ;
7 g . run ([&]{ din1 = t b b A S A u x (f ,a , c , e p s i l o n /2 , Sleft , fa , fc , fd , depth
-1) ;}) ;
8 din2 = t b b A S A u x (f , c , b , e p s i l o n /2 , Sright , fc , fb , fe , depth -1) ;
9 g . wait () ;
10 return din1 + din2 ;
11 }

Table 1 shows the execution time of our three parallel implementations applied for
4.4
finding the approximation of −4.4 exp(x 2 )dx with = 1.0e − 7 and depth = 40. We
can observe that cilkAS() outperforms ompAS() significantly (about four times
faster for CPU and three times for Xeon Phi). cilkAS() is also about 2.0× faster than
tbbAS(). It should be noticed that the execution time (s) of the sequential version of
the method is 62.8 for CPU and 638.04 for Xeon Phi. Thus, the speedup achieved by
our Cilk implementation is 14.35 (CPU) and 70.66 (Xeon Phi), respectively. The Cilk
version scales very well when the number of Cilk workers increases up to 24 for CPU
and 60 for Xeon Phi, respectively, i.e., to the number of physical cores. The further
increase in the number of workers results in smaller and rather marginal gains.

123
1466 P. Stpiczyński

4.4
Table 1 Execution time (s) of cilkAS(), ompAS() and tbbAS() for −4.4 exp(x 2 )dx

2× E5-2670
Number of threads/workers 2 4 6 12 24 48
cilkAS() 61.99 31.06 20.64 10.57 5.39 4.32
ompAS() 202.71 101.43 68.03 34.36 17.43 15.45
tbbAS() 141.98 71.64 48.29 25.46 12.21 9.37
Xeon Phi 7120P (native mode)
Number of threads/workers 2 30 60 120 180 240
cilkAS() 478.11 32.44 16.71 10.51 9.33 9.03
ompAS() 1355.67 92.60 45.57 31.13 29.22 28.52
tbbAS() 839.71 55.32 27.74 18.94 18.26 17.46

3.2 Bellman–Ford algorithm for the single-source shortest path problem

Let G = (V, E) be a directed graph with n vertices labeled from 0 to n − 1 and m

arcs u, v ∈ E, where u, v ∈ V . Each arc has its weight w(u, v) ∈ R and we assume
w(u,
v) = ∞ when u, v ∈ / E. For each path v0 , v1 , . . . , v p , we define its length
p
as i=1 w(vi−1 , vi ). We also assume that G does not contain negative cycles. Let
d(s, t) denote the length of the shortest path from s to t or d(s, t) = ∞ if there are
no paths from s to t. Algorithm 1 is the well-known Belman–Ford method for finding
shortest lengths of paths from a given source s ∈ V to all other vertices [4].

Algorithm 1: Bellman–Ford Algorithm

Data: G = (V, E), |V | = n, s ∈ V , w(u, v) for all u, v ∈ V
Result: D[v] = d(s, v) for all v ∈ V
1 for v ∈ V do
2 D[v] ← w(s, v);
3 end
4 D[s] ← 0;
5 for k = 1, . . . , n − 2 do
6 for v ∈ V \ {s} do
7 for u ∈ V such that u, v ∈ E do
8 D[v] ← min (D[v], D[u] + w(u, v));
9 end
10 end
11 end

The most common basic implementations of the algorithm assume AOS (i.e., Array
of Structures) representations of graphs. It means that a graph is represented as an array
that describes its vertices. Each vertex is described by an array containing information
about incoming arcs. Each arc is represented by the initial vertex and arc’s weight. It is
also necessary to store the length of arrays describing vertices. In order to parallelize
such a basic implementation using OpenMP (see Listing 4), we should notice that

123
Language-based vectorization and parallelization using… 1467

Listing 4 OpenMP implementation of Algorithm 1: AOS, loop

1 void o m p B F 1 ( D G r a p h & g , float * d1 , float * d2 )
2 { float dist ;
3 # p r a g m a omp p a r a l l e l
4 {# p r a g m a omp for s c h e d u l e ( r u n t i m e )
5 for ( int i =1; i < g . n ; i ++)
6 { dist = INFTY ;
7 if (( g . node [ i ]. degIn >0) &&( g . node [ i ]. in [0]. v ==0) )
8 dist = g . node [ i ]. in [ 0 ] . w e i g h t ;
9 d1 [ i ]= dist ;
10 }
11 # p r a g m a omp s i n g l e
12 { d2 [ 0 ] = d1 [ 0 ] = 0 ; }
13 for ( int k =1; k < g .n -1; k ++)
14 {# p r a g m a omp for s c h e d u l e ( r u n t i m e )
15 for ( int i =1; i < g . n ; i ++)
16 { float t = d1 [ i ];
17 # p r a g m a omp s i m d r e d u c t i o n ( min : t )
18 for ( int j =0; j < g . node [ i ]. degIn ; j ++)
19 t = std :: min ( t , d1 [ g . node [ i ]. in [ j ]. v ]+ g . node [ i ]. in [ j ].
weight );
20 d2 [ i ]= t ;
21 }
22 float * temp = d1 ; d1 = d2 ; d2 = temp ;
23 }
24 }

the entire algorithm should be within the parallel construct. The loops 5–10 and
15–23 can be parallelized using for constructs. The assignment in line 12 needs to be
a single task (i.e., defined by the single construct). Moreover, we need two copies
of the array D for storing current and previous updates within each iteration of the
loop 13–23. The most inner loop 18–19 is automatically vectorized by compiler. For
the sake of simplicity, we also assume that the vertex labeled as 0 is the source.
In order to introduce SIMD-friendly memory layout for better utilization of Intel
Advanced Vector Extensions, we assume that each vertex of a given graph is repre-
sented by two arrays of the same size. The first one (i.e., inv) sorted in increasing
order contains labels of initial vertices of incoming arcs. The next one (i.e., inw) stores
weights of corresponding arcs. Thus, such a representation of graphs is of the SOA (i.e.,
Structure of Arrays) type. These arrays should be allocated using _mm_malloc()
to ensure proper memory alignment [17]. Listing 5 shows the most inner loop of our
another implementation of the algorithm. Note that the loop can be replaced with the
corresponding array expression (Listing 6). Listing 7 presents another possible modifi-
cation. The most inner loop is replaced with a simple call to the function vecmin(),
which uses AVX2 intrinsics explicitly. Note that the first loop works on 8-element
vectors, while the second one processes the remainder of input data sequentially.
Listing 8 presents the most interesting improvements. It uses SDLT template
library to move from AOS to SOA transparently. Each arc is represented as in
our first implementation. By the use of SDLT_PRIMITIVE, the data structure
Arc is declared as a primitive and its data members are identified. Then we can
define a data structure for arcs. It should be an array of containers of the type
sdlt::soa1d_container<Arc>. Internally, such containers are represented as

123
1468 P. Stpiczyński

Listing 5 The most inner loop of Algorithm 1: SOA, loop

1 float t = d1 [ i ];
2 #pragma omp simd r e d u c t i o n ( min : t )
3 for ( int j =0; j < g . node [ i ]. degIn ; j ++)
4 t = std :: min (t , d1 [ g . node [ i ]. inv [ j ]]+ g . node [ i ]. inw [ j ]) ;
5 d2 [ i ]= t ;

Listing 6 The most inner loop of Algorithm 1: SOA, array notation

1 int deg = g . node [ i ]. degIn ;
2 d2 [ i ]= std :: min ( d1 [ i ] , _ _ s e c _ r e d u c e _ m i n ( d1 [ g . node [ i ]. inv [0: deg ]]+
3 g . node [ i ]. inw [0: deg ]) )
;

Listing 7 The most inner loop of Algorithm 1: SOA, intrinsics for AVX2
1 float v e c m i n ( int n , float *d , int * inv , float * inw )
2 { float minv =0; int i , j =0; float * pd = d ;
3 float * pinw = inw ; int * pinv = inv ; float t [8];
4 _ _ m 2 5 6 x m i n = _ m m 2 5 6 _ s e t 1 _ p s (1.0 e +20) ;
5 t [ 0 ] = 1 . 0 e +20;
6 for ( i =0; i <( n /8) *8; i +=8)
7 { _ _ m 2 5 6 xw , x1 , x2 ; _ _ m 2 5 6 i idx ;
8 idx = _ m m 2 5 6 _ l o a d _ s i 2 5 6 (( _ _ m 2 5 6 i *) pinv ) ;
9 x1 = _ m m 2 5 6 _ i 3 2 g a t h e r _ p s ( pd , idx ,4) ;
10 xw = _ m m 2 5 6 _ l o a d _ p s ( pinw ) ;
11 x1 = _ m m 2 5 6 _ a d d _ p s ( xw , x1 ) ;
12 x2 = _ m m 2 5 6 _ p e r m u t e _ p s ( x1 ,0 x39 ) ;
13 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
14 x2 = _ m m 2 5 6 _ p e r m u t e _ p s ( x1 ,0 x4E ) ;
15 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
16 x2 = _ m m 2 5 6 _ p e r m u t e 2 f 1 2 8 _ p s ( x1 , x1 ,0 x1 ) ;
17 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
18 xmin = _ m m 2 5 6 _ m i n _ p s ( xmin , x1 ) ;
19 pinw +=8; pinv +=8;
20 }
21 _ m m 2 5 6 _ s t o r e _ p s (t , xmin ) ; minv = t [0];
22 for ( i =( n /8) *8; i < n ; i ++)
23 minv = std :: min ( minv , d [ inv [ i ]]+ inw [ i ]) ;
24 return min v ;
25 }

SOA with elements aligned properly to improve efficiency of AVX instructions. Vector-
ization of the loop is enforced by the use of “#pragma omp simd reduction.”
To ensure the compiler respects that the methods x() and weight() are inlined, we
use “#pragma force inline.”
Now let us consider the results of experiments performed to compare five con-
sidered implementations of Belman–Ford algorithm: BF1 (basic using AOS), BF2
(using SOA with the most inner loop vectorized by compiler), BF3 (using SOA
and the array notation), BF4 (using SOA and intrinsics) and BF5 (using SDLT).
All results have been obtained for graphs generated randomly for a given number of
vertices and a maximum degree (i.e., the maximum number of incoming arcs). Fig-

123
Language-based vectorization and parallelization using… 1469

Listing 8 The most inner loop of Algorithm 1: SDLT

1 struct Arc // data s t r u c t u r e s
2 { unsigned int v ;
3 float w e i g h t ;
4 };
5 S D L T _ P R I M I T I V E ( Arc , v , w e i g h t ) ;
6 struct D G r a p h
7 { int n ;
8 sdlt :: s o a 1 d _ c o n t a i n e r < Arc > * node ;
9 };
10 // ...
11 // the most inner loop of the a l g o r i t h m
12 float t = d1 [ i ];
13 auto arc = g . node [ i ]. c o n s t _ a c c e s s () ; // get a c c e s s to the
container
14 #pragma f o r c e i n l i n e r e c u r s i v e
15 { # p r a g m a omp s i m d r e d u c t i o n ( min : t )
16 for ( int j =0; j < g . node [ i ]. g e t _ s i z e _ d 1 () ; j ++)
17 t = std :: min ( t , d1 [ arc [ j ]. v () ]+ arc [ j ]. w e i g h t () ) ;
18 d2 [ i ]= t ;
19 }

ure 1 shows the speedup of five parallel implementations with outer loops parallelized
using schedule(static) against the sequential version of BF1. We can observe
that the parallel implementations are much faster than the basic sequential implemen-
tation. For sufficiently large graphs, all parallel implementations utilize multiple cores
achieving reasonable speedup. It happens when vectorized loops are sufficiently long.
Indeed, the speedup grows when the maximum degree (i.e., the length of the arrays
grows). Usually BF5 is much faster than other parallel versions. BF2, BF3 and BF4
outperform BF1 for larger and wider graphs, and their performance is comparable;
however, BF4 is slightly faster on Xeon Phi. Unexpectedly, for Xeon E5-2670, the
performance of BF5 drops for larger graphs. Such a behavior has not been observed
for Xeon Phi. For this platform, the implementation using SDLT always achieves the
best performance.
Figure 2 presents the execution time of the SDLT version of the algorithm paral-
lelized using OpenMP (for “static” and “dynamic,ChS” values of the clause
schedule, respectively), TBB and Cilk Plus. In case of our OpenMP implementa-
tions, Fig. 2 shows the best results chosen after several tests for various values of ChS.
Thus, our OpenMP “dynamic” version has been manually tuned. We have observed that
the best performance is achieved when the value of ChS is about 40 for Xeon E5-2670
and 20 for Xeon Phi. In case of TBB and Cilk, the runtime system has been responsible
for load balancing. The parallel loops in the Cilk version have been parallelized using
_Cilk_for construct. In case of TBB, we have used tbb::parallel_for tem-
plate. It should be noticed that TBB requires more changes in the source code than in
the case of OpenMP and Cilk Plus. The use of Cilk seems to be the easiest.
We can observe that for almost all cases, the OpenMP “static” version achieves
the best performance. In case of Xeon Phi, the situation is slightly different. For
smaller graphs the OpenMP “dynamic” version is really better. Both OpenMP versions
outperform TBB and Cilk Plus significantly. Usually, the Cilk version achieves the
worst performance. Generally, the use of technologies utilizing dynamic allocation

123
1470 P. Stpiczyński

2xE5-2670, n=4000 2xE5-2670, n=10000

60 60
BF1: AOS, loop BF1: AOS, loop
BF2: SOA, loop BF2: SOA, loop
BF3: SOA,array notation BF3: SOA,array notation
50 BF4: SOA, intrinsics 50 BF4: SOA, intrinsics
BF5: SDLT BF5: SDLT

40 40
speedup

speedup
30 30

20 20

10 10

0 0
10 100 1000 10 100 1000
maximum degree maximum degree

Xeon Phi 7120P, n=4000 Xeon Phi 7120P, n=10000

160 160
BF1: AOS, loop BF1: AOS, loop
BF2: SOA, loop BF2: SOA, loop
140 BF3: SOA,array notation 140 BF3: SOA,array notation
BF4: SOA, intrinsics BF4: SOA, intrinsics
BF5: SDLT BF5: SDLT
120 120

100 100
speedup
speedup

80 80

60 60

40 40

20 20

0 0
10 100 1000 10 100 1000
maximum degree maximum degree

Fig. 1 Speedup of the considered implementations against the sequential version of BF1 (static schedule
of for loops)

of computational tasks is associated with significant overheads. However, for large

graphs both OpenMP and TBB versions achieve almost the same performance.

4 Conclusions

We have compared OpenMP, TBB and Cilk Plus as basic language-based tools for
simple and efficient parallelization of computational problems. We have shown that
Cilk Plus can be very easily applied to parallelize recursively defined Adaptive Simp-
son’s Integration Rule and such implementation can also utilize coprocessors such
as Intel Xeon Phi. Unexpectedly, the OpenMP implementation using tasks achieves
much worse performance. The efficiency of the TBB implementation is also worse
than Cilk, but TBB is still better than OpenMP.

123
Language-based vectorization and parallelization using… 1471

2xE5-2670, n=4000 2xE5-2670, n=10000

1.2 16
OpenMP (static) OpenMP (static)
OpenMP (dynamic) OpenMP (dynamic)
TBB 14 TBB
1 Cilk Plus Cilk Plus
12
0.8
10
time [s]

time [s]
0.6 8

6
0.4
4
0.2
2

0 0
10 100 1000 10 100 1000
maximum degree maximum degree

Xeon Phi 7120P, n=4000 Xeon Phi 7120P, n=10000

4.5 18
OpenMP (static) OpenMP (static)
OpenMP (dynamic) OpenMP (dynamic)
4 TBB 16 TBB
Cilk Plus Cilk Plus
3.5 14

3 12
time [s]
time [s]

2.5 10

2 8

1.5 6

1 4

0.5 2

0 0
10 100 1000 10 100 1000
maximum degree maximum degree

Fig. 2 Execution time of the SDLT version of the algorithm parallelized using OpenMP (static and
dynamic), TBB and Cilk Plus

Using the example of Belman–Ford algorithm for solving single-source shortest

path problems, we have shown that OpenMP is the best language-based tool for
efficient parallelization of simple loops. We have also demonstrated that in case of
computational problems that need both task and data parallelization techniques, effi-
cient vectorization is crucial for achieving reasonable performance. Vector extensions
of modern multicore processors can be efficiently utilized by using Cilk array notation,
intrinsics or even automaticly by compilers, but on condition that data structures are
properly aligned in memory. In general, SOA data structures are much more SIMD-
friendly than data structures of the AOS form. The transition from AOS to SOA can
be made transparently by using SDLT template library. Usually, such semiautomatic
optimization results in better performance than rather complicated manual process. In
particular, programming using intrinsics involves a much greater effort and results are
not so impressive.

123
1472 P. Stpiczyński

Acknowledgements The use of computer resources installed at Institute of Mathematics, Maria Curie-
Skłodowska University, Lublin, is kindly acknowledged.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Interna-
tional License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made.

References
1. Allen R, Kennedy K (2001) Optimizing compilers for modern architectures: a dependence-based
approach. Morgan Kaufmann, Burlington
2. Cameron M (2010) Adaptive integration. http://www2.math.umd.edu/~mariakc/teaching/adaptive.pdf
3. Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R (2001) Parallel programming in
OpenMP. Morgan Kaufmann Publishers, San Francisco
4. Cormen T, Leiserson C, Rivest R (1994) Introduction to algorithms. MIT Press, Cambridge
5. Jeffers J, Reinders J (2013) Intel Xeon Phi coprocessor high-performance programming. Morgan
Kaufman, Waltham
6. Jeffers J, Reinders J, Sodani A (2016) Intel Xeon Phi processor high-performance programming.
Knights landing edition. Morgan Kaufman, Cambridge
7. Kuncir GF (1962) Algorithm 103: Simpson’s rule integrator. Commun ACM 5(6):347. https://doi.org/
10.1145/367766.368179
8. Leiserson CE (2011) Cilk. In: Padua DA (ed) Encyclopedia of parallel computing. Springer, Berlin,
pp 273–288. https://doi.org/10.1007/978-0-387-09766-4_2339
9. Leist A, Gilman A (2014) A comparative analysis of parallel programming models for C++. In:
Proceedings of ICCGI 2014: The Ninth International Multi-Conference on Computing in the Global
Information Technology, IARIA, pp 121–127
10. Lyness JN (1969) Notes on the adaptive Simpson quadrature routine. J ACM 16(3):483–495. https://
doi.org/10.1145/321526.321537
11. Marowka A (2007) Parallel computing on any desktop. Commun ACM 50(9):74–78. https://doi.org/
10.1145/1284621.1284622
12. Marowka A (2012) Tbbench: a micro-benchmark suite for intel threading building blocks. J Inf Process
Syst 8(2):331–346. https://doi.org/10.3745/JIPS.2012.8.2.331
13. Rahman R (2013) Intel Xeon Phi coprocessor architecture and tools: the guide for application devel-
opers. Apress, Berkely
14. Robison AD (2013) Composable parallel patterns with intel cilk plus. Comput Sci Eng 15(2):66–71.
https://doi.org/10.1109/MCSE.2013.21
15. Stpiczyński P (2016) Semiautomatic acceleration of sparse matrix-vector product using OpenACC. In:
Parallel Processing and Applied Mathematics, 11th International Conference, PPAM 2015, Kracow,
Poland, September 6–9, 2015, Revised Selected Papers, Part II, Springer, Lecture Notes in Computer
Science, vol 9574, pp 143–152. http://dx.doi.org/10.1007/978-3-319-32152-3_14
16. Stpiczyński P (2018) Efficient language-based parallelization of computational problems using cilk
plus. In: Parallel Processing and Applied Mathematics, 12th International Conference, PPAM 2017,
Lublin, Poland, September 10–13, 2017 (accepted)
17. Supalov A, Semin A, Klemm M, Dahnken C (2014) Optimizing HPC applications with intel cluster
tools. Apress, Berkely
18. van der Pas R, Stotzer E, Terboven C (2017) Using OpenMP—the next step. Affinity, accelerators,
tasking, and SIMD. MIT Press, Cambridge

123

BlueJ Answers
100% (2)
BlueJ Answers
42 pages
Section 5
33% (3)
Section 5
16 pages
01B. Introduction Data Strictures and Algorithims
100% (1)
01B. Introduction Data Strictures and Algorithims
331 pages
Introduction To Ab Initio: Prepared By: Ashok Chanda
100% (8)
Introduction To Ab Initio: Prepared By: Ashok Chanda
57 pages
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
No ratings yet
Ruud Van Der Pas - Eric Stotzer - Christian Terboven - Using Openmp - The Next Step - Affinity, Accelerators, Tasking, and Simd (2017, Mit Press) PDF
381 pages
Common Features (Characteristics) of A Word Processor
No ratings yet
Common Features (Characteristics) of A Word Processor
3 pages
Devnetzone 18
No ratings yet
Devnetzone 18
20 pages
Multi Core Architectures and Programming
No ratings yet
Multi Core Architectures and Programming
10 pages
Linq Query Programs
No ratings yet
Linq Query Programs
46 pages
Aspire Quiz
100% (3)
Aspire Quiz
73 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
Parallel Multiverse
No ratings yet
Parallel Multiverse
46 pages
TCC102-Unit1-Tutorial1 Slide
No ratings yet
TCC102-Unit1-Tutorial1 Slide
46 pages
Auto-Vectorization With The Intel Compilers: Is Your Code Ready For Sandy Bridge and Knights Corner?
No ratings yet
Auto-Vectorization With The Intel Compilers: Is Your Code Ready For Sandy Bridge and Knights Corner?
12 pages
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
No ratings yet
Exploiting Loop-Level Parallelism For Simd Arrays Using: Openmp
12 pages
Dependency-Based Automatic Parallelization of Java Applications
No ratings yet
Dependency-Based Automatic Parallelization of Java Applications
13 pages
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
No ratings yet
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
16 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
Master in High Performance Computing Advanced Parallel Programming LABS
No ratings yet
Master in High Performance Computing Advanced Parallel Programming LABS
2 pages
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
No ratings yet
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
4 pages
VK Dotnet Syllabus
No ratings yet
VK Dotnet Syllabus
4 pages
Unit-1: Software Architecture
No ratings yet
Unit-1: Software Architecture
79 pages
IOOP - Assignment Question
No ratings yet
IOOP - Assignment Question
10 pages
Parallel Matlab
No ratings yet
Parallel Matlab
27 pages
PHD Research Proposal: Reverse Engineering - Reverse Object Oriented Design Methodology (R-Oodm) 1 Summary
No ratings yet
PHD Research Proposal: Reverse Engineering - Reverse Object Oriented Design Methodology (R-Oodm) 1 Summary
7 pages
Syllabus Hadoop
No ratings yet
Syllabus Hadoop
4 pages
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
No ratings yet
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
10 pages
Multicore C Standard Template Library in A Generat
No ratings yet
Multicore C Standard Template Library in A Generat
11 pages
Baan Presentation
No ratings yet
Baan Presentation
17 pages
Saurabh Kumbhar: Education Profile Summary
No ratings yet
Saurabh Kumbhar: Education Profile Summary
1 page
Miaxis Fingerprint Getting Startded Guide
No ratings yet
Miaxis Fingerprint Getting Startded Guide
8 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Data Salary Survey 2022
No ratings yet
Data Salary Survey 2022
20 pages
XML Configuration File Generator Windows User Guide
No ratings yet
XML Configuration File Generator Windows User Guide
19 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
On The Evolution of Programming Languages
No ratings yet
On The Evolution of Programming Languages
7 pages
Top 12 CSS Code Tips
No ratings yet
Top 12 CSS Code Tips
14 pages
Sidexis 4 Installation
No ratings yet
Sidexis 4 Installation
11 pages
PP CS
No ratings yet
PP CS
89 pages
LAB5 Final Report
No ratings yet
LAB5 Final Report
14 pages
AWS Containers
No ratings yet
AWS Containers
12 pages
CTOS Compiler Testing For Optimization Sequences of LLVM
No ratings yet
CTOS Compiler Testing For Optimization Sequences of LLVM
20 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
Ui Design 100 Essay
No ratings yet
Ui Design 100 Essay
4 pages
15 Design Patterns Arquitetura
No ratings yet
15 Design Patterns Arquitetura
4 pages
ICED Writeup
No ratings yet
ICED Writeup
6 pages
OOAD Lab 5,6
No ratings yet
OOAD Lab 5,6
9 pages
Android Send Email With Examples
No ratings yet
Android Send Email With Examples
6 pages
SEDEMAC Mechatronics
No ratings yet
SEDEMAC Mechatronics
1 page
PLC Controls with Structured Text (ST): IEC 61131-3 and best practice ST programming
From Everand
PLC Controls with Structured Text (ST): IEC 61131-3 and best practice ST programming
Tom Mejer Antonsen
4/5 (12)
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Microsoft Visual C++ Windows Applications by Example
From Everand
Microsoft Visual C++ Windows Applications by Example
Stefan BjÃ¶rnander
3.5/5 (3)
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
Computer Programming Using C
From Everand
Computer Programming Using C
Ramkrishna Ghosh
No ratings yet
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Ivor Horton's Beginning Visual C++ 2013
From Everand
Ivor Horton's Beginning Visual C++ 2013
Ivor Horton
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
C Programming: Core Concepts and Techniques
From Everand
C Programming: Core Concepts and Techniques
William Smith
No ratings yet
C++ Programming: Effective Practices and Techniques
From Everand
C++ Programming: Effective Practices and Techniques
Joe Smith
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Java / J2EE Interview Questions You'll Most Likely Be Asked
From Everand
Java / J2EE Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
The Complete C++ Programming Guide
From Everand
The Complete C++ Programming Guide
gareth thomas
No ratings yet
C++ Learn in 24 Hours
From Everand
C++ Learn in 24 Hours
Alex Nordeen
No ratings yet
Learn C Programming in 24 Hours
From Everand
Learn C Programming in 24 Hours
Alex Nordeen
No ratings yet
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ Algorithms for Beginners: A Practical Guide with Examples
From Everand
C++ Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Algorithms for New Programmers: A Practical Guide with Examples
From Everand
C# Algorithms for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ Functional Programming for Starters: A Practical Guide with Examples
From Everand
C++ Functional Programming for Starters: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Data Structures Explained: A Practical Guide with Examples
From Everand
C# Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
IGNOU BCA Computer Basics and PC Software Previous Year Unsolved Papers BCS 011
From Everand
IGNOU BCA Computer Basics and PC Software Previous Year Unsolved Papers BCS 011
Manish Soni
No ratings yet
IGNOU PGDCA MCS 202 Computer Organisation Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 202 Computer Organisation Previous Years Unsolved Papers
Manish Soni
No ratings yet
Learn C in 24 Hours: The Complete Beginner’s Guide: Master Coding in 24 Hours
From Everand
Learn C in 24 Hours: The Complete Beginner’s Guide: Master Coding in 24 Hours
Aniket Jain
No ratings yet
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
IGNOU PGDCA MCS 201 Programming in C and Python Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 201 Programming in C and Python Previous Years Unsolved Papers
Manish Soni
No ratings yet
IGNOU PGDCA First Semester Previous Years Unsolved Papers
From Everand
IGNOU PGDCA First Semester Previous Years Unsolved Papers
Manish Soni
No ratings yet

Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus

Uploaded by

Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus

Uploaded by

J Supercomput (2018) 74:1461–1472

Language-based vectorization and parallelization using

Published online: 11 January 2018

Keywords Multicore · Manycore · Recursive algorithms · Shortest path problems ·

1 Institute of Mathematics, Maria Curie–Skłodowska University, Pl. Marii Curie-Skłodowskiej 1,

2 Short overview of selected language-based tools

In this section, we present a short overview of the considered language-based tools

3 Two examples of computational problems

Listing 1 Cilk version of Adaptive Simpson’s method

3.1 Adaptive Simpson’s Integration Rule

Listing 1 shows our Cilk version of the straightforward recursive implementa-

Listing 2 OpenMP version of Adaptive Simpson’s method

Listing 3 TBB version of Adaptive Simpson’s method

3.2 Bellman–Ford algorithm for the single-source shortest path problem

Let G = (V, E) be a directed graph with n vertices labeled from 0 to n − 1 and m

Algorithm 1: Bellman–Ford Algorithm

Listing 4 OpenMP implementation of Algorithm 1: AOS, loop

Listing 5 The most inner loop of Algorithm 1: SOA, loop

Listing 6 The most inner loop of Algorithm 1: SOA, array notation

Listing 8 The most inner loop of Algorithm 1: SDLT

2xE5-2670, n=4000 2xE5-2670, n=10000

Xeon Phi 7120P, n=4000 Xeon Phi 7120P, n=10000

of computational tasks is associated with significant overheads. However, for large

2xE5-2670, n=4000 2xE5-2670, n=10000

Xeon Phi 7120P, n=4000 Xeon Phi 7120P, n=10000

Using the example of Belman–Ford algorithm for solving single-source shortest

You might also like