A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake - 2017 (Paper - 44-A - Novel - Hybrid - Quicksort - Algorithm - Vectorized)
A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake - 2017 (Paper - 44-A - Novel - Hybrid - Quicksort - Algorithm - Vectorized)
Berenger Bramas
Max Planck Computing and Data Facility (MPCDF)
Gieenbachstrae 2
85748 Garching, Germany
Abstract—The modern CPU’s design, which is composed of needed operations. Consequently, new instruction sets, such as
hierarchical memory and SIMD/vectorization capability, governs the AVX-512, allow for the use of approaches that were not
the potential for algorithms to be transformed into efficient feasible previously.
implementations. The release of the AVX-512 changed things
radically, and motivated us to search for an efficient sorting The Intel Xeon Skylake (SKL) processor is the second
algorithm that can take advantage of it. In this paper, we describe CPU that supports AVX-512, after the Intel Knight Landing.
the best strategy we have found, which is a novel two parts hybrid The SKL supports the AVX-512 instruction set [13]: it sup-
sort, based on the well-known Quicksort algorithm. The central ports Intel AVX-512 foundational instructions (AVX-512F),
partitioning operation is performed by a new algorithm, and Intel AVX-512 conflict detection instructions (AVX-512CD),
small partitions/arrays are sorted using a branch-free Bitonic- Intel AVX-512 byte and word instructions (AVX-512BW),
based sort. This study is also an illustration of how classical
algorithms can be adapted and enhanced by the AVX-512
Intel AVX-512 doubleword and quadword instructions (AVX-
extension. We evaluate the performance of our approach on a 512DQ), and Intel AVX-512 vector length extensions instruc-
modern Intel Xeon Skylake and assess the different layers of our tions (AVX-512VL). The AVX-512 not only allows work on
implementation by sorting/partitioning integers, double floating- SIMD-vectors of double the size, compared to the previous
point numbers, and key/value pairs of integers. Our results AVX(2) set, it also provides various new operations.
demonstrate that our approach is faster than two libraries of
reference: the GNU C++ sort algorithm by a speedup factor of Therefore, in the current paper, we focus on the develop-
4, and the Intel IPP library by a speedup factor of 1.4. ment of new sorting strategies and their efficient implementa-
tion for the Intel Skylake using AVX-512. The contributions
Keywords—Quicksort; Bitonic; sort; vectorization; SIMD; AVX- of this study are the following:
512; Skylake
• proposing a new partitioning algorithm using AVX-
512,
I. I NTRODUCTION
• defining a new Bitonic-sort variant for small arrays
Sorting is a fundamental problem in computer science using AVX-512,
that always had the attention of the research community,
because it is widely used to reduce the complexity of some • implementing a new Quicksort variant using AVX-
algorithms. Moreover, sorting is a central operation in specific 512.
applications such as, but not limited to, database servers [1]
and image rendering engines [2]. Therefore, having efficient All in all, we show how we can obtain a fast and vectorized
sorting libraries on new architecture could potentially leverage sorting algorithm1 .
the performance of a wide range of applications. The rest of the paper is organized as follows: Section II
The vectorization — that is, the CPU’s capability to apply gives background information related to vectorization and
a single instruction on multiple data (SIMD) — improves sorting. We then describe our approach in Section III, intro-
continuously, one CPU generation after the other. While the ducing our strategy for sorting small arrays, and the vectorized
difference between a scalar code and its vectorized equivalent partitioning function, which are combined in our Quicksort
was “only” of a factor of 4 in the year 2000 (SSE), the variant. Finally, we provide performance details in Section IV
difference is now up to a factor of 16 (AVX-512). There- and the conclusion in Section V.
fore, it is indispensable to vectorize a code to achieve high-
performance on modern CPUs, by using dedicated instructions II. BACKGROUND
and registers. The conversion of a scalar code into a vectorized A. Sorting Algorithms
equivalent is straightforward for many classes of algorithms
and computational kernels, and it can even be done with auto- 1) Quicksort (QS) Overview: QS was originally proposed
vectorization for some of them. However, the opportunity of in [3]. It uses a divide-and-conquer strategy, by recursively
vectorization is tied to the memory/data access patterns, such
1 The functions described in the current study are available at
that data-processing algorithms (like sorting) usually require
https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This repository includes a
an important effort to be transformed. In addition, creating a clean header-only library (branch master) and a test file that generates the
fully vectorized implementation, without any scalar sections, performance study of the current manuscript (branch paper). The code is under
is only possible and efficient if the instruction set provides the MIT license.
www.ijacsa.thesai.org 337 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017
5 5 5 5 5 55 7 8
partitioning the input array, until it ends with partitions of 4 4 4 4 4 64 8 7
1 2 2 2 2 72 5 6
one value. The partitioning puts values lower than a pivot 2 1 1 1 1 81 6 5
5 8 8 8 6 16 4 5
at the beginning of the array, and greater values at the end, 8 5 7 7 6 26 5 4
7 7 5 6 0 45 1 2
with a linear complexity. QS has a worst-case complexity of 6 6 6 5 5 50 2 1
Algorithm 1: SIMD Bitonic sort for one vector of the values, and the process stops when some of these iterators
double floating-point values. meet. In its steady state, the algorithm loads an SIMD-vector
Input: vec: a double floating-point AVX-512 vector to sort. using the left or right indexes (at lines 19 and 24), and
Output: vec: the vector sorted. partitions it using the partition vec function (at line 27). The
1 function simd bitonic sort 1v(vec) partition vec function compares the input vector to the pivot
2 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
3 compare and exchange(vec, [4, 5, 6, 7, 0, 1, 2, 3]) vector (at line 47), and stores the values — lower or greater
4 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) — directly in the array using a store-some instruction (at lines
5 compare and exchange(vec, [0, 1, 2, 3, 4, 5, 6, 7])
6 compare and exchange(vec, [5, 4, 7, 6, 1, 0, 3, 2])
51 and 55). The store-some is an AVX-512 instruction that we
7 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) described in Section II-B1. The initialization of our algorithm
starts by loading one vector from each array’s extremities to
ensure that no values will be overwritten during the steady state
(lines 12 and 16). This way, our implementation works in-place
2) Sorting more than one SIMD-vectors: The principle of and only needs three SIMD-vectors. Algorithm 3 also includes,
using static permutation vectors to sort a single SIMD-vector as side comments, possible optimizations in case the array is
can be applied to sort several SIMD-vectors. In addition, we more likely to be already partitioned (A), or to reduce the data
can take advantage of the repetitive pattern of the Bitonic displacement of the values (B). The AVX-512 implementation
sorting network to re-use existing functions. More precisely, of this algorithm is given in Appendix B (Code 2). One should
to sort V vectors, we re-use the function to sort V /2 vectors note that we use a scalar partition function if there are less than
and so on. We provide an example to sort two SIMD-vectors 2 × VEC SIZE values in the given array (line 3).
in Algorithm 2, where we start by sorting each SIMD-vector
individually using the bitonic simd sort 1v function. Then,
we compare and exchange values between both vectors (line C. Quicksort Variant
5), and finally applied the same operations on each vector Our QS is given in Algorithm 4, where we partition the data
individually (lines 6 to 11). In our sorting implementation, we using the simd partition function from Section III-B, and then
provide the functions to sort up to 16 SIMD-vectors, which sort the small partitions using the simd bitonic sort wrapper
correspond to 256 integer values or 128 double floating-point function from Section III-A. The obtained algorithm is very
values. similar to the scalar QS given in Appendix A.
Algorithm 2: SIMD bitonic sort for two vectors of D. Sorting Key/Value Pairs
double floating-point values.
Input: vec1 and vec2: two double floating-point AVX-512 vectors to sort. The previous sorting methods are designed to sort an array
Output: vec1 and vec2: the two vectors sorted with vec1 lower or equal of numbers. However, some applications need to sort key/value
than vec2. pairs. More precisely, the sort is applied on the keys, and
1 function simd bitonic sort 2v(vec1, vec2)
2 // Sort each vector using bitonic simd sort 1v the values contain extra information and could be pointers to
3 simd bitonic sort 1v(vec1) arbitrary data structures, for example. Storing each key/value
4 simd bitonic sort 1v(vec2)
5 compare and exchange 2v(vec1, vec2, [0, 1, 2, 3, 4, 5, 6, 7])
pair contiguously in memory is not adequate for vectorization
6 compare and exchange(vec1, [3, 2, 1, 0, 7, 6, 5, 4]) because it requires transforming the data. Therefore, in our
7 compare and exchange(vec2, [3, 2, 1, 0, 7, 6, 5, 4]) approach, we store the keys and the values in two distinct
8 compare and exchange(vec1, [5, 4, 7, 6, 1, 0, 3, 2])
9 compare and exchange(vec2, [5, 4, 7, 6, 1, 0, 3, 2]) arrays. To extend the SIMD-Bitonic-sort and SIMD-partition
10 compare and exchange(vec1, [6, 7, 4, 5, 2, 3, 0, 1]) functions, we must ensure that the same permutations/moves
11 compare and exchange(vec2, [6, 7, 4, 5, 2, 3, 0, 1]) are applied to the keys and the values. For the partition
function, this is trivial. The same mask is used in combination
with the store-some instruction for both arrays. For the Bitonic-
3) Sorting small arrays: Each of our SIMD-Bitonic-sort
based sort, we manually apply the permutations that were done
functions are designed for a specific number of SIMD-vectors.
on the vector of keys to the vector of values. To do so, we first
However, we intend to sort arrays that do not have a size mul-
save the vector of keys k before it is permuted by a compare
tiple of the SIMD-vector’s length, because they are obtained
and exchange, using the Bitonic permutation vector of indexes
from the partitioning stage of the QS. Consequently, when we
p, into k’. We compare k and k’ to obtain a mask m that
have to sort a small array, we first load it into SIMD-vectors,
expresses what moves have been done. Then, we permute our
and then, we pad the last vector with the greatest possible
vector of values v using p into v’, and we select the correct
value. This guarantee that the padding values have no impact
values between v and v’ using m. Consequently, we perform
on the sorting results by staying at the end of the last vector.
this operation at the end of the compare and exchange in all
The selection of appropriate SIMD-Bitonic-sort function, that
the Bitonic-based sorts.
matches the size of the array to sort, can be done efficiently
with a switch statement. In the following, we refer to this
interface as the simd bitonic sort wrapper function. IV. P ERFORMANCE S TUDY
A. Configuration
B. Partitioning with AVX-512
We asses our method on an Intel(R) Xeon(R) Platinum
Algorithm 3 shows our strategy to develop a vectorized 8170 Skylake CPU at 2.10GHz, with caches of sizes 32K-
partitioning method. This algorithm is similar to a scalar Bytes, 1024K-Bytes and 36608K-Bytes, at levels L1, L2
partitioning function: there are iterators that start from both and L3, respectively. The process and allocations are bound
extremities of the array to keep track of where to load/store with numactl –physcpubind=0 –localalloc. We use the Intel
www.ijacsa.thesai.org 340 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017
Algorithm 3: SIMD partitioning. V EC SIZE is Algorithm 4: SIMD Quicksort. select pivot pos
the number of values inside a SIMD-vector of type returns a pivot.
array’s elements. Input: array: an array to sort. length: the size of array.
Input: array: an array to partition. length: the size of array. pivot: the Output: array: the array sorted.
reference value 1 function simd QS(array, length)
Output: array: the array partitioned. left w: the index between the 2 simd QS core(array, 0, length-1)
values lower and larger than the pivot. 3 function simd QS core(array, left, right)
1 function simd partition(array, length, pivot) 4 // Test if we must partition again or if we can sort
2 // If too small use scalar partitioning 5 if left + SORT BOUND < right then
3 if length ≤ 2 × V EC SIZE then 6 pivot idx = select pivot pos(array, left, right)
4 Scalar partition(array, length) 7 swap(array[pivot idx], array[right])
5 return 8 partition bound = simd partition(array, left, right,
6 end array[right])
7 // Set: Fill a vector with all values equal to pivot 9 swap(array[partition bound], array[right])
8 pivotvec = simd set from one(pivot) 10 simd QS core(array, left, partition bound-1)
9 // Init iterators and save one vector on each extremity 11 simd QS core(array, partition bound+1, right)
10 left = 0 12 else
11 left w = 0 13 simd bitonic sort wrapper(sub array(array, left),
12 left vec = simd load(array, left) right-left+1)
13 left = left + VEC SIZE 14 end
14 right = length-VEC SIZE
15 right w = length
16 right vec = simd load(array, right)
17 while left + VEC SIZE ≤ right do
18 if (left - left w) ≤ (right w - right) then of the fastest existing sorting implementation.
19 val = simd load(array, left)
20 left = left + VEC SIZE The test file used for the following benchmark is available
// (B) Possible optimization, swap val and left vec
21
22 else online3 , it includes the different sorts presented in this study
23 right = right - VEC SIZE plus some additional strategies and tests. Our SIMD-QS uses
24 val = simd load(array, right) a 3-values median pivot selection (similar to the STL sort
25 // (B) Possible optimization, swap val and right vec
26 end function). The arrays to sort are populated with randomly
27 [left w, right w] = partition vec(array, val, pivotvec, left w, generated values.
right w)
28 end
29 // Process left val and right val
30 [left w, right w] = partition vec(array, left val, pivotvec, left w,
B. Performance to Sort Small Arrays
right w)
31 [left w, right w] = partition vec(array, right val, pivotvec, left w, Fig. 4 shows the execution times to sort arrays of size
right w) from 1 to 16 × VEC SIZE, which corresponds to 128 double
32 // Proceed remaining values (less than VEC SIZE values)
33 nb remaining = right - left
floating-point values, or 256 integer values. We also test arrays
34 val = simd load(array, left) of size not multiple of the SIMD-vector’s length. The AVX-
35 left = right 512-bitonic always delivers better performance than the Intel
36 mask = get mask less equal(val, pivotvec)
37 mask low = cut mask(mask, nb remaining) IPP for any size, and better performance than the STL when
38 mask high = cut mask(reverse mask(mask) , nb remaining) sorting more than 5 values. The speedup is significant, and is
39 // (A) Possible optimization, do only if mask low is not 0 around 8 in average. The execution time per item increases
40 simd store some(array, left w, mask low, val)
41 left w = left w + mask nb true(mask low) every VEC SIZE values because the cost of sorting is not tied
42 // (A) Possible optimization, do only if mask high is not 0 to the number of values but to the number of SIMD-vectors to
43 right w = right w - mask nb true(mask high)
44 simd store some(array, right w, mask high, val)
sort, as explained in Section III-A3. For example, in Fig. 4a,
45 return left w the execution time to sort 31 or 32 values is the same, because
46 function partition vec(array, val, pivotvec, left w, right w) we sort one SIMD-vector of 32 values in both cases. Our
47 mask = get mask less equal(val, pivotvec)
48 nb low = mask nb true(mask) method to sort key/value pairs seems efficient, see Fig. 4c,
49 nb high = VEC SIZE-nb low because the speedup is even better against the STL compared
50 // (A) Possible optimization, do only if mask is not 0 to the sorting of integers.
51 simd store some(array, left w, mask, val)
52 left w = left w + nb low
53 // (A) Possible optimization, do only if mask is not all true
54 right w = right w - nb high C. Partitioning Performance
55 simd store some(array, right w, reverse mask(mask), val)
56 return [left w, right w] Fig. 5 shows the execution times to partition using our
AVX-512-partition or the STL’s partition function. Our method
provides again a speedup of an average factor of 4. For the
three configurations, an overhead impacts our implementation
compiler 17.0.2 (20170213) with aggressive optimization flag and the STL when partitioning arrays larger than 107 items.
-O3. Our AVX-512-partition remains faster, but its speedup de-
creases from 4 to 3. This phenomena is related to cache effects
We compare our implementation against two references. since 107 integers values occupy 40M-Bytes, which is more
The first one is the GNU STL 3.4.21 from which we use than the L3 cache size. In addition, we see that this effect starts
the std::sort and std::partition functions. The second one from 105 when partitioning key/value pairs.
is the Intel Integrated Performance Primitives (IPP) 2017
which is a library optimized for Intel processors. We use the 3 The test file that generates the performance study is available at
IPP radix-based sort (function ippsSortRadixAscend [type] I). https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort (branch paper) under MIT
This function require additional space, but it is known as one license.
www.ijacsa.thesai.org 341 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017
10−7 10−8
Time in s/ n log(n)
0.5
Time in s/ n
10−8
4.8
10−9
−9 7.5 5.9 5.5 6.1
10
14.8 8.2 10.8 11.5 9.7 8.8
11.4
24.9
−10
10 0 50 100 150 200 250 10−10
102 103 104 105 106 107 108 109
Number of values n Number of values n
10−7 10−8
0.5
Time in s/ n log(n)
Time in s/ n
10−8
3.2
3.0 2.9 3.2
10−9 4.8 5.9 5.6 4.8 4.5
7.4 8.1
10−9 11.6
8.7
10−7 0.5
10−8
0.4
Time in s/ n log(n)
10−8
Time in s/ n
www.ijacsa.thesai.org 342 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017
std::sort Intel IPP AVX-512-QS and ready to be used and compared. In the future, we intend
to design a parallel implementation of our AVX-512-QS, and
10−8
we expect the recursive partitioning to be naturally parallelized
with a task-based scheme on top of OpenMP.
Time in s/ n log(n)
A PPENDIX
10−9
3.7 1.5 1.3 1.8
16.9 1.7 1.4 1.3 1.3 A. Scalar Quicksort Algorithm
102 103 104 105 106 107 108 109 Algorithm 5: Quicksort
Number of values n Input: array: an array to sort. length: the size of array.
Output: array: the array sorted.
(a) Integer (int) 1 function QS(array, length)
2 QS core(array, 0, length-1)
−8 3 function QS core(array, left, right)
10
4 if left < right then
Time in s/ n log(n)
5.1
7.0
7.3 6.8 B. Source Code Extracts
10−9 10.1 9.3 9.0 8.0
10.1
1 inline m512d A V X 5 1 2 b i t o n i c s o r t 1 v ( m512d i n p u t ){
2 {
3 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
4 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
102 103 104 105 106 107 108 109 5
6
m512d
m512d
permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
Number of values n 7 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
8 }
9 {
(c) Key/value integer pair (int[2]) 10 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 4 , 5 , 6 , 7 , 0 , 1 , 2 , 3 ) ;
11 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
12 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
13 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
Fig. 6: Execution time divided by n log(n) to sort arrays filled with random 14 input = mm512 mask mov pd ( permNeighMin , 0xCC , permNeighMax ) ;
15 }
values with sizes from 21 to 230 (≈ 109 ). The execution time is obtained 16 {
from the average of 5 executions. The speedup of the AVX-512-bitonic against 17 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
18 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
the fastest between STL and IPP is shown above the AVX-512-bitonic line. 19 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
20 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
21 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
22 }
23 {
24 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ) ;
V. C ONCLUSIONS 25
26
m512d
m512d
permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
27 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
In this paper, we introduced new Bitonic sort and a new 28
29 }
input = mm512 mask mov pd ( permNeighMin , 0xF0 , permNeighMax ) ;
1 t e m p l a t e <c l a s s IndexType> [4] ISO/IEC 14882:2003(e): Programming languages - c++, 2003. 25.3.1.1
2 s t a t i c i n l i n e I n d e x T y p e A V X 5 1 2 p a r t i t i o n ( d o u b l e a r r a y [ ] , I n d e x T y p e l e f t , ...
I n d e x T y p e r i g h t , c o n s t d o u b l e p i v o t ){
sort [lib.sort] para. 2.
3 c o n s t IndexType S = 8 ; / / ( 5 1 2 / 8 ) / s i z e o f ( double ) ;
4
[5] ISO/IEC 14882:2014(E): Programming Languages - c++, 2014. 25.4.1.1
5 i f ( r i g h t−l e f t +1 < 2∗S ){ sort (p. 911).
6 r e t u r n C o r e S c a l a r P a r t i t i o n<double , IndexType >(a r r a y , l e f t , r i g h t , p i v o t ) ;
7 } [6] Musser,D. R.: Introspective sorting and selection algorithms. Softw.,
8
9 m512d p i v o t v e c = mm512 set1 pd ( p i v o t ) ;
Pract. Exper., 27(8):983–993, 1997.
10 [7] Batcher,K. E.: Sorting networks and their applications. In Proceedings
11 m512d l e f t v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
12 IndexType l e f t w = l e f t ; of the April 30–May 2, 1968, spring joint computer conference, pages
13 l e f t += S ; 307–314. ACM, 1968.
14
15 IndexType right w = r i g h t +1; [8] Nassimi, D. Sahni,S.: Bitonic sort on a mesh-connected parallel com-
16 r i g h t −= S−1;
17 m512d r i g h t v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ; puter. IEEE Trans. Computers, 28(1):2–7, 1979.
18
19 w h i l e ( l e f t + S <= r i g h t ){
[9] Owens, J. D. Houston, M. Luebke, D. Green, S. Stone, J. E. Phillips,
20 c o n s t IndexType f r e e l e f t = l e f t − l e f t w ; J. C.: Gpu computing. Proceedings of the IEEE, 96(5):879–899, 2008.
21 c o n s t IndexType f r e e r i g h t = right w − r i g h t ;
22 [10] Kogge, P. M.: The architecture of pipelined computers. CRC Press,
23 m512d v a l ;
24 i f ( f r e e l e f t <= f r e e r i g h t ){
1981.
25 v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
26 l e f t += S ;
[11] Intel: Intel 64 and ia-32 architectures software developer’s man-
27 } ual: Instruction set reference (2a, 2b, 2c, and 2d). Avail-
28 e l s e{
29 r i g h t −= S ;
able on: https://software.intel.com/en-us/articles/
30 v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ; intel-sdm.
31 }
32 [12] Intel: Introduction to intel advanced vector extensions. Available
33 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ; on: https://software.intel.com/en-us/articles/
34
35 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ; introduction-to-intel-advanced-vector
36 c o n s t I n d e x T y p e n b h i g h = S−nb low ; -extensions.
37
38 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , v a l ) ; [13] Intel: Intel architecture instruction set exten-
39 l e f t w += nb low ;
40 sions programming reference. Available on:
41 r i g h t w −= n b h i g h ; https://software.intel.com/sites/default/files/
42 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , v a l ) ;
43 } managed/c5/15/architecture-instruction-set-
44 extensions-programming-reference.pdf.
45 {
46 c o n s t IndexType remaining = r i g h t − l e f t ; [14] Sanders, P. Winkel, S.: Super scalar sample sort. In European Sympo-
47 m512d v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
48 left = right ; sium on Algorithms, pages 784–796. Springer, 2004.
49
50 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ;
[15] Inoue, H. Moriyama, T. Komatsu, H. Nakatani, T.: Aa-sort: A new
51 parallel sorting algorithm for multi-core simd processors. In Proceed-
52 mmask8 mask low = mask & ˜ ( 0 xFF << r e m a i n i n g ) ;
53 mmask8 m a s k h i g h = ( ˜ mask ) & ˜ ( 0 xFF << r e m a i n i n g ) ;
ings of the 16th International Conference on Parallel Architecture
54 and Compilation Techniques, pages 189–198. IEEE Computer Society,
55 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask low ) ;
56 c o n s t IndexType nb high = popcount ( mask high ) ; 2007.
57
58 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask low , v a l ) ;
[16] Furtak, T. Amaral, J. N. Niewiadomski, R.: Using simd registers and
59 l e f t w += nb low ; instructions to enable instruction-level parallelism in sorting algorithms.
60
61 r i g h t w −= n b h i g h ;
In Proceedings of the nineteenth annual ACM symposium on Parallel
62 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , mask high , v a l ) ; algorithms and architectures, pages 348–357. ACM, 2007.
63 }
64 { [17] Chhugani, J. Nguyen, A. D. Lee, V. W. Macy, W. Hagog, M. Chen,
65 mmask8 mask = mm512 cmp pd mask ( l e f t v a l , p i v o t v e c , CMP LE OQ ) ; Y.-K. Baransi, A. Kumar, S. Dubey, P.: Efficient implementation of
66
67 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ; sorting on multi-core simd cpu architecture. Proceedings of the VLDB
68 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
69
Endowment, 1(2):1313–1324, 2008.
70
71
m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , l e f t v a l ) ;
l e f t w += nb low ;
[18] Gueron, S. Krasnov, V.: Fast quicksort implementation using avx
72 instructions. The Computer Journal, 59(1):83–90, 2016.
73 r i g h t w −= n b h i g h ;
74 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , l e f t v a l ) ;
75 }
76 {
77 mmask8 mask = mm512 cmp pd mask ( r i g h t v a l , p i v o t v e c , CMP LE OQ ) ;
78
79 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ;
80 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
81
82 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , r i g h t v a l ) ;
83 l e f t w += nb low ;
84
85 r i g h t w −= n b h i g h ;
86 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , r i g h t v a l ) ;
87 }
88 return left w ;
89 }
90
R EFERENCES
[1] Graefe, G.: Implementing sorting in database systems. ACM Computing
Surveys (CSUR), 38(3):10, 2006.
[2] Bishop, L. Eberly, D. Whitted, T. Finch, M. Shantz,M.: Designing a pc
game engine. IEEE Computer Graphics and Applications, 18(1):46–53,
1998.
[3] Hoare,C AR: Quicksort. The Computer Journal, 5(1):10–16, 1962.
www.ijacsa.thesai.org 344 | P a g e