Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus
Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus
Language-Based Vectorization and Parallelization Using Intrinsics, Openmp, TBB and Cilk Plus
https://doi.org/10.1007/s11227-017-2231-3
Przemysław Stpiczyński1
Abstract The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic
language-based tools for simple and efficient parallelization of recursively defined
computational problems and other problems that need both task and data parallelization
techniques. We show how to use these models of parallel programming to transform
a source code of Adaptive Simpson’s Integration to programs that can utilize multiple
cores of modern processors. Using the example of Belman–Ford algorithm for solving
single-source shortest path problems, we advise how to improve performance of data
parallel algorithms by tuning data structures for better utilization of vector extensions
of modern processors. Manual vectorization techniques based on Cilk array notation
and intrinsics are presented. We also show how to simplify such optimization using
Intel SIMD Data Layout Template containers.
1 Introduction
Recently, multicore and manycore computer architectures have become very attractive
for achieving high-performance execution of scientific applications at relatively low
costs [5,13,17]. Modern CPUs and accelerators achieve performance that was recently
reached by supercomputers. Unfortunately, the process of adapting existing software to
such new architectures can be difficult if we expect to achieve reasonable performance
B Przemysław Stpiczyński
[email protected]
123
1462 P. Stpiczyński
without putting much effort into software development. However, sometimes the use of
high-level language-based programming interfaces devoted to parallel programming
can get satisfactory results with rather little effort [15].
Software development process for modern Intel multicore CPUs and manycore
coprocessors such as Xeon Phi [5,13] requires special optimization techniques to
obtain codes that would utilize the power of underlying hardware. Usually it is not
sufficient to parallelize applications. For such computer architectures, efficient vector-
ization is crucial for achieving satisfactory performance [5,17]. Unfortunately, very
often compiler-based automatic vectorization is not possible because of some non-
obvious data dependencies inside loops [1]. Moreover, the performance of vectorized
programs can be improved by the use of proper memory layout. On the other hand,
people expect parallel programming to be easy. The use of simple and powerful pro-
gramming constructs that can utilize underlying hardware is highly desired.
Intel C/C++ compilers and development tools offer many language-based exten-
sions that can be used to simplify the process of developing high-performance parallel
programs [6,17]. OpenMP [3,18] is the most popular, but one can consider using
Threading Building Blocks (TBB for short) [6,12] or Cilk Plus [5,13]. More sophis-
ticated language-based optimization can be done by using intrinsics, which allow to
utilize Intel Advanced Vector Extensions (i.e., SIMD extensions) explicitly [6]. SDLT
template library can be applied to introduce SIMD-friendly memory layout transpar-
ently [6].
In this paper, we evaluate OpenMP, Intel TBB and Cilk Plus as language-based tools
for simple and efficient parallelization of recursively defined computational problems
and other problems that need both task and data parallelization techniques. We also
advise how to improve the performance of such algorithms by tuning data structures
manually and automatically using SDLT. We also explain how to explicitly utilize
vector units of modern multicore and manycore processors using intrinsics. We show
how to parallelize recursively defined Adaptive Simpson’s Integration Rule [7] using
OpenMP, Intel TBB and Cilk Plus [9], and we consider various implementations of
Belman–Ford algorithm for solving the single-source shortest path problem [4] and
examine their performance. These two computational problems have been chosen to
demonstrate the most important features of the considered language-based tools.
123
Language-based vectorization and parallelization using… 1463
TBB (Threading Building Blocks) is a C++ template library supporting task paral-
lelism on Intel multicore platforms [5,12]. It provides a rich collection of components
for parallel programming, and a scheduler which manages and schedules threads to
execute parallel tasks. TBB also provides low-level services for memory allocation and
atomic operations. TBB programs can be combined with OpenMP pragmas specifying
compiler supported vectorization; thus, it can exploit both task and data parallelism.
Cilk Plus offers several powerful extensions to C/C++ that allow to express both task
and data parallelism [5,8,13,14]. The most important constructs are useful to specify
and handle possible parallel execution of tasks. _Cilk_for followed by the body of
a for loop tells that iterations of the loop can be executed in parallel. Runtime applies
the divide-and-conquer approach to schedule tasks among active workers to ensure
balanced workload of available cores. _Cilk_spawn permits a given function to be
executed asynchronously with the rest of the calling function. _Cilk_sync tells that
all tasks spawned in a function must complete before execution continues. Another
important feature of Cilk Plus is the array notation which introduces vectorized oper-
ations on arrays. Expression A[start:len:stride] represents an array section
of length len starting from A[start] with the given stride. Omitted stride
means 1. The operator [:] can be used on both static and dynamic arrays. There
are also several built-in functions to perform basic computations among elements in
an array such as sum, min and max. It should be noticed that the array notation can
also be used for array indices. For example, A[x[0:len]] denotes elements of the
array A given by indices from x[0:len]. Intel Cilk Plus also supports Shared Virtual
Memory which allows to share data between the CPU and the coprocessor. It is perfect
for exchanging irregular data with limited size, when explicit synchronization is not
used frequently [16].
Intrinsics for SIMD instructions allow to take full advantage of Intel Advanced Vector
Extensions (AVX, AVX2, AVX-512) what cannot always be easily achieved due to
limitations of programming languages and compilers [6]. They allow programmers
to write constructs that look like C/C++ function calls corresponding to actual SIMD
instructions. Such calls are replaced with assembly code inlined directly into programs.
The disadvantage of this solution is the lack of the code portability between different
versions of vector extensions.
SDLT (SIMD Data Layout Template) is a C++11 template library which provides
containers with SIMD-friendly data layouts [6]. The use of such containers allows
for a transparent transition of data structures of the Array of Structures (AOS) type to
Structure of Arrays (SOA) or Arrays of Structure of Arrays (ASA) forms. Such con-
versions can improve vectorization and increase the efficiency of programs executed
on modern processors.
Now we will present two exemplary problems which can be easily parallelized and
optimized using considered language-based tools. All implementations have been
tested on a server with two Intel Xeon E5-2670 version 3 (totally 24 cores with
123
1464 P. Stpiczyński
hyperthreading, 2.3 GHz), 128GB RAM, with Intel Xeon Phi Coprocessor 7120P (61
cores with multithreading, 1.238 GHz, 16GB RAM), running under CentOS 6.5 with
Intel Parallel Studio version 2017, C/C++ compiler supporting Cilk Plus, TBB and
SDLT. Experiments on Xeon Phi have been carried out using its native mode.
Let us consider the following recursive method for numerical integration called Adap-
b
tive Simpson’s Rule [7]. We want to find the approximation of I ( f ) = a f (x)dx
with a user-specified tolerance . Let S(a, b) = h6 ( f (a) + 4 f (c) + f (b)), where
h = b − a and c is a midpoint of the interval [a, b]. The method uses Simpson’s
rule to the halves of the interval in recursive manner until the stopping criterion
15 |S(a, c) + S(c, b) − S(a, b)| < . is reached [10].
1
123
Language-based vectorization and parallelization using… 1465
Table 1 shows the execution time of our three parallel implementations applied for
4.4
finding the approximation of −4.4 exp(x 2 )dx with = 1.0e − 7 and depth = 40. We
can observe that cilkAS() outperforms ompAS() significantly (about four times
faster for CPU and three times for Xeon Phi). cilkAS() is also about 2.0× faster than
tbbAS(). It should be noticed that the execution time (s) of the sequential version of
the method is 62.8 for CPU and 638.04 for Xeon Phi. Thus, the speedup achieved by
our Cilk implementation is 14.35 (CPU) and 70.66 (Xeon Phi), respectively. The Cilk
version scales very well when the number of Cilk workers increases up to 24 for CPU
and 60 for Xeon Phi, respectively, i.e., to the number of physical cores. The further
increase in the number of workers results in smaller and rather marginal gains.
123
1466 P. Stpiczyński
4.4
Table 1 Execution time (s) of cilkAS(), ompAS() and tbbAS() for −4.4 exp(x 2 )dx
2× E5-2670
Number of threads/workers 2 4 6 12 24 48
cilkAS() 61.99 31.06 20.64 10.57 5.39 4.32
ompAS() 202.71 101.43 68.03 34.36 17.43 15.45
tbbAS() 141.98 71.64 48.29 25.46 12.21 9.37
Xeon Phi 7120P (native mode)
Number of threads/workers 2 30 60 120 180 240
cilkAS() 478.11 32.44 16.71 10.51 9.33 9.03
ompAS() 1355.67 92.60 45.57 31.13 29.22 28.52
tbbAS() 839.71 55.32 27.74 18.94 18.26 17.46
The most common basic implementations of the algorithm assume AOS (i.e., Array
of Structures) representations of graphs. It means that a graph is represented as an array
that describes its vertices. Each vertex is described by an array containing information
about incoming arcs. Each arc is represented by the initial vertex and arc’s weight. It is
also necessary to store the length of arrays describing vertices. In order to parallelize
such a basic implementation using OpenMP (see Listing 4), we should notice that
123
Language-based vectorization and parallelization using… 1467
the entire algorithm should be within the parallel construct. The loops 5–10 and
15–23 can be parallelized using for constructs. The assignment in line 12 needs to be
a single task (i.e., defined by the single construct). Moreover, we need two copies
of the array D for storing current and previous updates within each iteration of the
loop 13–23. The most inner loop 18–19 is automatically vectorized by compiler. For
the sake of simplicity, we also assume that the vertex labeled as 0 is the source.
In order to introduce SIMD-friendly memory layout for better utilization of Intel
Advanced Vector Extensions, we assume that each vertex of a given graph is repre-
sented by two arrays of the same size. The first one (i.e., inv) sorted in increasing
order contains labels of initial vertices of incoming arcs. The next one (i.e., inw) stores
weights of corresponding arcs. Thus, such a representation of graphs is of the SOA (i.e.,
Structure of Arrays) type. These arrays should be allocated using _mm_malloc()
to ensure proper memory alignment [17]. Listing 5 shows the most inner loop of our
another implementation of the algorithm. Note that the loop can be replaced with the
corresponding array expression (Listing 6). Listing 7 presents another possible modifi-
cation. The most inner loop is replaced with a simple call to the function vecmin(),
which uses AVX2 intrinsics explicitly. Note that the first loop works on 8-element
vectors, while the second one processes the remainder of input data sequentially.
Listing 8 presents the most interesting improvements. It uses SDLT template
library to move from AOS to SOA transparently. Each arc is represented as in
our first implementation. By the use of SDLT_PRIMITIVE, the data structure
Arc is declared as a primitive and its data members are identified. Then we can
define a data structure for arcs. It should be an array of containers of the type
sdlt::soa1d_container<Arc>. Internally, such containers are represented as
123
1468 P. Stpiczyński
Listing 7 The most inner loop of Algorithm 1: SOA, intrinsics for AVX2
1 float v e c m i n ( int n , float *d , int * inv , float * inw )
2 { float minv =0; int i , j =0; float * pd = d ;
3 float * pinw = inw ; int * pinv = inv ; float t [8];
4 _ _ m 2 5 6 x m i n = _ m m 2 5 6 _ s e t 1 _ p s (1.0 e +20) ;
5 t [ 0 ] = 1 . 0 e +20;
6 for ( i =0; i <( n /8) *8; i +=8)
7 { _ _ m 2 5 6 xw , x1 , x2 ; _ _ m 2 5 6 i idx ;
8 idx = _ m m 2 5 6 _ l o a d _ s i 2 5 6 (( _ _ m 2 5 6 i *) pinv ) ;
9 x1 = _ m m 2 5 6 _ i 3 2 g a t h e r _ p s ( pd , idx ,4) ;
10 xw = _ m m 2 5 6 _ l o a d _ p s ( pinw ) ;
11 x1 = _ m m 2 5 6 _ a d d _ p s ( xw , x1 ) ;
12 x2 = _ m m 2 5 6 _ p e r m u t e _ p s ( x1 ,0 x39 ) ;
13 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
14 x2 = _ m m 2 5 6 _ p e r m u t e _ p s ( x1 ,0 x4E ) ;
15 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
16 x2 = _ m m 2 5 6 _ p e r m u t e 2 f 1 2 8 _ p s ( x1 , x1 ,0 x1 ) ;
17 x1 = _ m m 2 5 6 _ m i n _ p s ( x1 , x2 ) ;
18 xmin = _ m m 2 5 6 _ m i n _ p s ( xmin , x1 ) ;
19 pinw +=8; pinv +=8;
20 }
21 _ m m 2 5 6 _ s t o r e _ p s (t , xmin ) ; minv = t [0];
22 for ( i =( n /8) *8; i < n ; i ++)
23 minv = std :: min ( minv , d [ inv [ i ]]+ inw [ i ]) ;
24 return min v ;
25 }
SOA with elements aligned properly to improve efficiency of AVX instructions. Vector-
ization of the loop is enforced by the use of “#pragma omp simd reduction.”
To ensure the compiler respects that the methods x() and weight() are inlined, we
use “#pragma force inline.”
Now let us consider the results of experiments performed to compare five con-
sidered implementations of Belman–Ford algorithm: BF1 (basic using AOS), BF2
(using SOA with the most inner loop vectorized by compiler), BF3 (using SOA
and the array notation), BF4 (using SOA and intrinsics) and BF5 (using SDLT).
All results have been obtained for graphs generated randomly for a given number of
vertices and a maximum degree (i.e., the maximum number of incoming arcs). Fig-
123
Language-based vectorization and parallelization using… 1469
ure 1 shows the speedup of five parallel implementations with outer loops parallelized
using schedule(static) against the sequential version of BF1. We can observe
that the parallel implementations are much faster than the basic sequential implemen-
tation. For sufficiently large graphs, all parallel implementations utilize multiple cores
achieving reasonable speedup. It happens when vectorized loops are sufficiently long.
Indeed, the speedup grows when the maximum degree (i.e., the length of the arrays
grows). Usually BF5 is much faster than other parallel versions. BF2, BF3 and BF4
outperform BF1 for larger and wider graphs, and their performance is comparable;
however, BF4 is slightly faster on Xeon Phi. Unexpectedly, for Xeon E5-2670, the
performance of BF5 drops for larger graphs. Such a behavior has not been observed
for Xeon Phi. For this platform, the implementation using SDLT always achieves the
best performance.
Figure 2 presents the execution time of the SDLT version of the algorithm paral-
lelized using OpenMP (for “static” and “dynamic,ChS” values of the clause
schedule, respectively), TBB and Cilk Plus. In case of our OpenMP implementa-
tions, Fig. 2 shows the best results chosen after several tests for various values of ChS.
Thus, our OpenMP “dynamic” version has been manually tuned. We have observed that
the best performance is achieved when the value of ChS is about 40 for Xeon E5-2670
and 20 for Xeon Phi. In case of TBB and Cilk, the runtime system has been responsible
for load balancing. The parallel loops in the Cilk version have been parallelized using
_Cilk_for construct. In case of TBB, we have used tbb::parallel_for tem-
plate. It should be noticed that TBB requires more changes in the source code than in
the case of OpenMP and Cilk Plus. The use of Cilk seems to be the easiest.
We can observe that for almost all cases, the OpenMP “static” version achieves
the best performance. In case of Xeon Phi, the situation is slightly different. For
smaller graphs the OpenMP “dynamic” version is really better. Both OpenMP versions
outperform TBB and Cilk Plus significantly. Usually, the Cilk version achieves the
worst performance. Generally, the use of technologies utilizing dynamic allocation
123
1470 P. Stpiczyński
40 40
speedup
speedup
30 30
20 20
10 10
0 0
10 100 1000 10 100 1000
maximum degree maximum degree
100 100
speedup
speedup
80 80
60 60
40 40
20 20
0 0
10 100 1000 10 100 1000
maximum degree maximum degree
Fig. 1 Speedup of the considered implementations against the sequential version of BF1 (static schedule
of for loops)
4 Conclusions
We have compared OpenMP, TBB and Cilk Plus as basic language-based tools for
simple and efficient parallelization of computational problems. We have shown that
Cilk Plus can be very easily applied to parallelize recursively defined Adaptive Simp-
son’s Integration Rule and such implementation can also utilize coprocessors such
as Intel Xeon Phi. Unexpectedly, the OpenMP implementation using tasks achieves
much worse performance. The efficiency of the TBB implementation is also worse
than Cilk, but TBB is still better than OpenMP.
123
Language-based vectorization and parallelization using… 1471
time [s]
0.6 8
6
0.4
4
0.2
2
0 0
10 100 1000 10 100 1000
maximum degree maximum degree
3 12
time [s]
time [s]
2.5 10
2 8
1.5 6
1 4
0.5 2
0 0
10 100 1000 10 100 1000
maximum degree maximum degree
Fig. 2 Execution time of the SDLT version of the algorithm parallelized using OpenMP (static and
dynamic), TBB and Cilk Plus
123
1472 P. Stpiczyński
Acknowledgements The use of computer resources installed at Institute of Mathematics, Maria Curie-
Skłodowska University, Lublin, is kindly acknowledged.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Interna-
tional License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made.
References
1. Allen R, Kennedy K (2001) Optimizing compilers for modern architectures: a dependence-based
approach. Morgan Kaufmann, Burlington
2. Cameron M (2010) Adaptive integration. http://www2.math.umd.edu/~mariakc/teaching/adaptive.pdf
3. Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R (2001) Parallel programming in
OpenMP. Morgan Kaufmann Publishers, San Francisco
4. Cormen T, Leiserson C, Rivest R (1994) Introduction to algorithms. MIT Press, Cambridge
5. Jeffers J, Reinders J (2013) Intel Xeon Phi coprocessor high-performance programming. Morgan
Kaufman, Waltham
6. Jeffers J, Reinders J, Sodani A (2016) Intel Xeon Phi processor high-performance programming.
Knights landing edition. Morgan Kaufman, Cambridge
7. Kuncir GF (1962) Algorithm 103: Simpson’s rule integrator. Commun ACM 5(6):347. https://doi.org/
10.1145/367766.368179
8. Leiserson CE (2011) Cilk. In: Padua DA (ed) Encyclopedia of parallel computing. Springer, Berlin,
pp 273–288. https://doi.org/10.1007/978-0-387-09766-4_2339
9. Leist A, Gilman A (2014) A comparative analysis of parallel programming models for C++. In:
Proceedings of ICCGI 2014: The Ninth International Multi-Conference on Computing in the Global
Information Technology, IARIA, pp 121–127
10. Lyness JN (1969) Notes on the adaptive Simpson quadrature routine. J ACM 16(3):483–495. https://
doi.org/10.1145/321526.321537
11. Marowka A (2007) Parallel computing on any desktop. Commun ACM 50(9):74–78. https://doi.org/
10.1145/1284621.1284622
12. Marowka A (2012) Tbbench: a micro-benchmark suite for intel threading building blocks. J Inf Process
Syst 8(2):331–346. https://doi.org/10.3745/JIPS.2012.8.2.331
13. Rahman R (2013) Intel Xeon Phi coprocessor architecture and tools: the guide for application devel-
opers. Apress, Berkely
14. Robison AD (2013) Composable parallel patterns with intel cilk plus. Comput Sci Eng 15(2):66–71.
https://doi.org/10.1109/MCSE.2013.21
15. Stpiczyński P (2016) Semiautomatic acceleration of sparse matrix-vector product using OpenACC. In:
Parallel Processing and Applied Mathematics, 11th International Conference, PPAM 2015, Kracow,
Poland, September 6–9, 2015, Revised Selected Papers, Part II, Springer, Lecture Notes in Computer
Science, vol 9574, pp 143–152. http://dx.doi.org/10.1007/978-3-319-32152-3_14
16. Stpiczyński P (2018) Efficient language-based parallelization of computational problems using cilk
plus. In: Parallel Processing and Applied Mathematics, 12th International Conference, PPAM 2017,
Lublin, Poland, September 10–13, 2017 (accepted)
17. Supalov A, Semin A, Klemm M, Dahnken C (2014) Optimizing HPC applications with intel cluster
tools. Apress, Berkely
18. van der Pas R, Stotzer E, Terboven C (2017) Using OpenMP—the next step. Affinity, accelerators,
tasking, and SIMD. MIT Press, Cambridge
123