0% found this document useful (0 votes)
167 views9 pages

A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake - 2017 (Paper - 44-A - Novel - Hybrid - Quicksort - Algorithm - Vectorized)

This document describes a novel hybrid quicksort algorithm that is vectorized using AVX-512 instructions on Intel Skylake processors. The algorithm uses a new partitioning method that is implemented using AVX-512 and a branch-free Bitonic sort for small partitions. Evaluation on an Intel Skylake shows the algorithm outperforms GNU C++ sort by 4x and Intel IPP by 1.4x for sorting integers, doubles, and key-value pairs.

Uploaded by

raqibapp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views9 pages

A Novel Hybrid Quicksort Algorithm Vectorized Using AVX-512 On Intel Skylake - 2017 (Paper - 44-A - Novel - Hybrid - Quicksort - Algorithm - Vectorized)

This document describes a novel hybrid quicksort algorithm that is vectorized using AVX-512 instructions on Intel Skylake processors. The algorithm uses a new partitioning method that is implemented using AVX-512 and a branch-free Bitonic sort for small partitions. Evaluation on an Intel Skylake shows the algorithm outperforms GNU C++ sort by 4x and Intel IPP by 1.4x for sorting integers, doubles, and key-value pairs.

Uploaded by

raqibapp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Novel Hybrid Quicksort Algorithm Vectorized using

AVX-512 on Intel Skylake


Berenger Bramas

To cite this version:


Berenger Bramas. A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Sky-
lake. International Journal of Advanced Computer Science and Applications (IJACSA), 2017,
<10.14569/IJACSA.2017.081044>. <hal-01512970v2>

HAL Id: hal-01512970


https://hal.inria.fr/hal-01512970v2
Submitted on 2 Nov 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

A Novel Hybrid Quicksort Algorithm Vectorized


using AVX-512 on Intel Skylake

Berenger Bramas
Max Planck Computing and Data Facility (MPCDF)
Gieenbachstrae 2
85748 Garching, Germany

Abstract—The modern CPU’s design, which is composed of needed operations. Consequently, new instruction sets, such as
hierarchical memory and SIMD/vectorization capability, governs the AVX-512, allow for the use of approaches that were not
the potential for algorithms to be transformed into efficient feasible previously.
implementations. The release of the AVX-512 changed things
radically, and motivated us to search for an efficient sorting The Intel Xeon Skylake (SKL) processor is the second
algorithm that can take advantage of it. In this paper, we describe CPU that supports AVX-512, after the Intel Knight Landing.
the best strategy we have found, which is a novel two parts hybrid The SKL supports the AVX-512 instruction set [13]: it sup-
sort, based on the well-known Quicksort algorithm. The central ports Intel AVX-512 foundational instructions (AVX-512F),
partitioning operation is performed by a new algorithm, and Intel AVX-512 conflict detection instructions (AVX-512CD),
small partitions/arrays are sorted using a branch-free Bitonic- Intel AVX-512 byte and word instructions (AVX-512BW),
based sort. This study is also an illustration of how classical
algorithms can be adapted and enhanced by the AVX-512
Intel AVX-512 doubleword and quadword instructions (AVX-
extension. We evaluate the performance of our approach on a 512DQ), and Intel AVX-512 vector length extensions instruc-
modern Intel Xeon Skylake and assess the different layers of our tions (AVX-512VL). The AVX-512 not only allows work on
implementation by sorting/partitioning integers, double floating- SIMD-vectors of double the size, compared to the previous
point numbers, and key/value pairs of integers. Our results AVX(2) set, it also provides various new operations.
demonstrate that our approach is faster than two libraries of
reference: the GNU C++ sort algorithm by a speedup factor of Therefore, in the current paper, we focus on the develop-
4, and the Intel IPP library by a speedup factor of 1.4. ment of new sorting strategies and their efficient implementa-
tion for the Intel Skylake using AVX-512. The contributions
Keywords—Quicksort; Bitonic; sort; vectorization; SIMD; AVX- of this study are the following:
512; Skylake
• proposing a new partitioning algorithm using AVX-
512,
I. I NTRODUCTION
• defining a new Bitonic-sort variant for small arrays
Sorting is a fundamental problem in computer science using AVX-512,
that always had the attention of the research community,
because it is widely used to reduce the complexity of some • implementing a new Quicksort variant using AVX-
algorithms. Moreover, sorting is a central operation in specific 512.
applications such as, but not limited to, database servers [1]
and image rendering engines [2]. Therefore, having efficient All in all, we show how we can obtain a fast and vectorized
sorting libraries on new architecture could potentially leverage sorting algorithm1 .
the performance of a wide range of applications. The rest of the paper is organized as follows: Section II
The vectorization — that is, the CPU’s capability to apply gives background information related to vectorization and
a single instruction on multiple data (SIMD) — improves sorting. We then describe our approach in Section III, intro-
continuously, one CPU generation after the other. While the ducing our strategy for sorting small arrays, and the vectorized
difference between a scalar code and its vectorized equivalent partitioning function, which are combined in our Quicksort
was “only” of a factor of 4 in the year 2000 (SSE), the variant. Finally, we provide performance details in Section IV
difference is now up to a factor of 16 (AVX-512). There- and the conclusion in Section V.
fore, it is indispensable to vectorize a code to achieve high-
performance on modern CPUs, by using dedicated instructions II. BACKGROUND
and registers. The conversion of a scalar code into a vectorized A. Sorting Algorithms
equivalent is straightforward for many classes of algorithms
and computational kernels, and it can even be done with auto- 1) Quicksort (QS) Overview: QS was originally proposed
vectorization for some of them. However, the opportunity of in [3]. It uses a divide-and-conquer strategy, by recursively
vectorization is tied to the memory/data access patterns, such
1 The functions described in the current study are available at
that data-processing algorithms (like sorting) usually require
https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This repository includes a
an important effort to be transformed. In addition, creating a clean header-only library (branch master) and a test file that generates the
fully vectorized implementation, without any scalar sections, performance study of the current manuscript (branch paper). The code is under
is only possible and efficient if the instruction set provides the MIT license.

www.ijacsa.thesai.org 337 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

5 5 5 5 5 55 7 8
partitioning the input array, until it ends with partitions of 4 4 4 4 4 64 8 7
1 2 2 2 2 72 5 6
one value. The partitioning puts values lower than a pivot 2 1 1 1 1 81 6 5
5 8 8 8 6 16 4 5
at the beginning of the array, and greater values at the end, 8 5 7 7 6 26 5 4
7 7 5 6 0 45 1 2
with a linear complexity. QS has a worst-case complexity of 6 6 6 5 5 50 2 1

O(n2 ), but an average complexity of O(n log n) in practice.


The complexity is tied to the choice of the partitioning pivot, (a) Bitonic sorting network for input (b) Example of 8 values
of size 16. All vertical bars/switches sorted by a Bitonic sorting
which must be close to the median to ensure a low complexity. exchange values in the same direc- network.
However, its simplicity in terms of implementation, and its tion.
speed in practice, has turned it into a very popular sorting
algorithm. Fig. 1 shows an example of a QS execution.
Fig. 2: Bitonic sorting network examples. In red boxes, the exchanges are
done from extremities to the center. Whereas in orange boxes, the exchanges
l=0 r=5
are done with a linear progression.
3 1 2 0 5

l=0 r=1 ≤2 l=4 r=5


be seen as a time line, where input values are transferred
1 0 2 5 3 from left to right, and exchanged if needed at each vertical
bar. We illustrate an execution in Fig. 2b, where we print the
≤1 ≤5 ∅ intermediate steps while sorting an array of 8 values. The

0 1 3 5 Bitonic sort is not stable because it does not maintain the
original order of the values.
Fig. 1: Quicksort example to sort [3, 1, 2, 0, 5] to [0, 1, 2, 3, 5]. The pivot is If the size of the array to sort is known, it is possible to
equal to the value in the middle: the first pivot is 2, then at second recursion implement a sorting network by hard-coding the connections
level it is 1 and 5.
between the lines. This can be seen as a direct mapping
of the picture. However, when the array size is unknown,
the implementation can be made more flexible by using a
We provide in Appendix A the scalar QS algorithm. Here,
formula/rule to decide when to compare/exchange values.
the term scalar refers to a single value at the opposite of an
SIMD vector. In this implementation, the choice of the pivot
is naively made by selecting the value in the middle before B. Vectorization
partitioning, and this can result in very unbalanced partitions. The term vectorization refers to a CPU’s feature to apply
This is why more advanced heuristics have been proposed in a single operation/instruction to a vector of values instead of
the past, like selecting the median from several values, for only a single value [10]. It is common to refer to this concept
example. by Flynn’s taxonomy term, SIMD, for single instruction on
2) GNU std::sort Implementation (STL): The worst case multiple data. By adding SIMD instructions/registers to CPUs,
complexity of QS makes it no longer suitable to be used as it has been possible to increase the peak performance of single
a standard C++ sort. In fact, a complexity of O(n log n) in cores, despite the stagnation of the clock frequency. The same
average was required until year 2003 [4], but it is now a worst strategy is used on new hardware, where the length of the
case limit [5] that a pure QS implementation cannot guarantee. SIMD registers has continued to increase. In the rest of the
Consequently, the current implementation is a 3-part hybrid paper, we use the term vector for the data type managed by the
sorting algorithm i.e. it relies on 3 different algorithms2 . The CPU in this sense. It has no relation to an expandable vector
algorithm uses an Introsort [6] to a maximum depth of 2 × data structure, such as std::vector. The size of the vectors is
log2 n to obtain small partitions that are then sorted using an variable and depends on both the instruction set and the type
insertion sort. Introsort is itself a 2-part hybrid of Quicksort of vector element, and corresponds to the size of the registers
and heap sort. in the chip. Vector extensions to the x86 instruction set, for
example, are SSE [11], AVX [12], and AVX512 [13], which
3) Bitonic Sorting Network: In computer science, a sorting support vectors of size 128, 256 and 512 bits, respectively. This
network is an abstract description of how to sort a fixed number means that an SSE vector is able to store four single precision
of values i.e. how the values are compared and exchanged. floating point numbers or two double precision values. Fig. 3
This can be represented graphically, by having each input illustrates the difference between a scalar summation and a
value as a horizontal line, and each compare and exchange vector summation for SSE or AVX, respectively. An AVX-
unit as a vertical connection between those lines. There are 512 SIMD-vector is able to store 8 double precision floating-
various examples of sorting networks in the literature, but we point numbers or 16 integer values, for example. Throughout
concentrate our description on the Bitonic sort from [7]. This this document, we use intrinsic function extension instead of
network is easy to implement and has an algorithm complexity the assembly language to write vectorized code on top of the
of O(n log(n)2 ). It has demonstrated good performances on AVX-512 instruction set. Intrinsics are small functions that are
parallel computers [8] and GPUs [9]. Fig. 2a shows a Bitonic intended to be replaced with a single assembly instruction by
sorting network to process 16 values. A sorting network can the compiler.
2 See the libstdc++ documentation on the sorting algorithm available 1) AVX-512 Instruction Set: As previous x86 vectorization
at https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS- extensions, the AVX-512 has instructions to load a contiguous
4.4/a01347.html#l05207 block of values from the main memory and to transform it
www.ijacsa.thesai.org 338 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

float a; __m128 a; __m256 a;


architecture. Unaligned memory access is avoided, and the L2
+ float b; __m128 b; __m256 b;
cache is efficiently managed by using an out-of-core/blocking
scheme. The authors show a speedup by a factor of 3 against
a+b a+b a+b the GNU C++ STL.
=
In [16], the authors use a sorting-network for small-sized
Fig. 3: Summation example of single precision floating-point values using: arrays, similar to our own approach. However, instead of
() scalar standard C++ code, () SSE SIMD-vector of 4 values, () AVX
SIMD-vector of 8 values. dividing the main array into sorted partitions (partitions of
increasing contents), and applying a small efficient sort on
each of those partitions, the authors perform the opposite.
They apply multiple small sorts on sub-parts of the array,
into a SIMD-vector (load). It is also possible to fill a SIMD- and then they finish with a complicated merge scheme using
vector with a given value (set), and move back a SIMD-vector extra memory to globally sort all the sub-parts. A very similar
into memory (store). A permutation instruction allows to re- approach was later proposed in [17].
order the values inside a SIMD-vector using a second integer
array which contains the permutation indexes. This opera- The recent work in [18] targets AVX2. The authors use
tion was possible in since AVX/AVX2 using permutevar8x32 a Quicksort variant with a vectorized partitioning function,
(instruction vperm(d,ps)). The instructions vminpd/vpminsd and an insertion sort once the partitions are small enough (as
return a SIMD-vector where each value correspond to the the STL does). The partition method relies on look-up tables,
minimum of the values from the two input vectors at the same with a mapping between the comparison’s result of an SIMD-
position. It is possible to obtain the maximum with instructions vector against the pivot, and the move/permutation that must
vpmaxsd/vmaxpd. be applied to the vector. The authors demonstrate a speedup
by a factor of 4 against the STL, but their approach is not
In AVX-512, the value returned by a test/comparison always faster than the Intel IPP library. The proposed method
(vpcmpd/vcmppd) is a mask (integer) and not an SIMD-vector is not suitable for AVX-512 because the lookup tables will
of integers, as it was in SSE/AVX. Therefore, it is easy to occupy too much memory. This issue, as well as the use of
modify and work directly on the mask with arithmetic and extra memory, can be solved with the new instructions of the
binary operations for scalar integers. Among the mask-based AVX-512. As a side remark, the authors do not compare their
instructions, the mask move (vmovdqa32/vmovapd) allows for proposal to the standard C++ partition function, even so, it is
the selection of values between two vectors, using a mask. the only part of their algorithm that is vectorized.
Achieving the same result was possible in previous instruction
sets using the blend instruction since SSE4, and using several
operations with previous instruction sets. III. S ORTING WITH AVX-512
The AVX-512 provides operations that do not have an A. Bitonic-Based Sort on AVX-512 SIMD-Vectors
equivalent in previous extensions of the x86 instruction sets,
such as the store-some (vpcompressps/vcompresspd) and load- In this section, we describe our method to sort small arrays
some (vmovups/vmovupd). The store-some operation allows to that contain less than 16 times VEC SIZE, where VEC SIZE
save only a part of a SIMD-vector into memory. Similarly, is the number of values in a SIMD-vector. This function is
the load-some allows to load less values than the size of a later used in our final QS implementation to sort small enough
SIMD-vector from the memory. The values are loaded/saved partitions.
contiguously. This is a major improvement, because without
1) Sorting one SIMD-vector: To sort a single vector, we
this instruction, several operations are needed to obtain the
perform the same operations as the ones shown in Fig. 2a: we
same result. For example, to save some values from a SIMD-
compare and exchange values following the indexes from the
vector v at address p in memory, one possibility is to load
Bitonic sorting network. However, thanks to the vectorization,
the current values from p into a SIMD-vector v’, permute the
we are able to work on the entire vector without having to
values in v to move the values to store at the beginning, merge
iterate on the values individually. We know the positions that
v and v’, and finally save the resulting vector.
we have to compare and exchange at the different stages of
the algorithm. This is why, in our approach, we rely on static
C. Related Work on Vectorized Sorting Algorithms
(hard-coded) permutation vectors, as shown in Algorithm 1. In
The literature on sorting and vectorized sorting implemen- this algorithm, the compare and exchange function performs
tations is extremely large. Therefore, we only cite some of the all the compare and exchange that are applied at the same time
studies that we consider most related to our work. in the Bitonic algorithm i.e. the operations that are at the same
horizontal position in the figure. To have a fully vectorized
The sorting technique from [14] tries to remove branches
function, we implement the compare and exchange in three
and improves the prediction of a scalar sort, and they show a
steps. First, we permute the input vector v into v’ with the
speedup by a factor of 2 against the STL (the implementation
given permutation indexes p. Second, we obtain two vectors
of the STL was different at that time). This study illustrates the
wmin and wmax that contain the minimum and maximum
early strategy to adapt sorting algorithms to a given hardware,
values between both v and v’. Finally, we selects the values
and also shows the need for low-level optimizations, due to
from wmin and wmax with a mask-based move, where the
the limited instructions available at that time.
mask indicates in which direction the exchanges have to be
In [15], the authors propose a parallel sorting on top of done. The C++ source code of a fully vectorized branch-free
combosort vectorized with the VMX instruction set of IBM implementation is given in Appendix B (Code 1).
www.ijacsa.thesai.org 339 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

Algorithm 1: SIMD Bitonic sort for one vector of the values, and the process stops when some of these iterators
double floating-point values. meet. In its steady state, the algorithm loads an SIMD-vector
Input: vec: a double floating-point AVX-512 vector to sort. using the left or right indexes (at lines 19 and 24), and
Output: vec: the vector sorted. partitions it using the partition vec function (at line 27). The
1 function simd bitonic sort 1v(vec) partition vec function compares the input vector to the pivot
2 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1])
3 compare and exchange(vec, [4, 5, 6, 7, 0, 1, 2, 3]) vector (at line 47), and stores the values — lower or greater
4 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) — directly in the array using a store-some instruction (at lines
5 compare and exchange(vec, [0, 1, 2, 3, 4, 5, 6, 7])
6 compare and exchange(vec, [5, 4, 7, 6, 1, 0, 3, 2])
51 and 55). The store-some is an AVX-512 instruction that we
7 compare and exchange(vec, [6, 7, 4, 5, 2, 3, 0, 1]) described in Section II-B1. The initialization of our algorithm
starts by loading one vector from each array’s extremities to
ensure that no values will be overwritten during the steady state
(lines 12 and 16). This way, our implementation works in-place
2) Sorting more than one SIMD-vectors: The principle of and only needs three SIMD-vectors. Algorithm 3 also includes,
using static permutation vectors to sort a single SIMD-vector as side comments, possible optimizations in case the array is
can be applied to sort several SIMD-vectors. In addition, we more likely to be already partitioned (A), or to reduce the data
can take advantage of the repetitive pattern of the Bitonic displacement of the values (B). The AVX-512 implementation
sorting network to re-use existing functions. More precisely, of this algorithm is given in Appendix B (Code 2). One should
to sort V vectors, we re-use the function to sort V /2 vectors note that we use a scalar partition function if there are less than
and so on. We provide an example to sort two SIMD-vectors 2 × VEC SIZE values in the given array (line 3).
in Algorithm 2, where we start by sorting each SIMD-vector
individually using the bitonic simd sort 1v function. Then,
we compare and exchange values between both vectors (line C. Quicksort Variant
5), and finally applied the same operations on each vector Our QS is given in Algorithm 4, where we partition the data
individually (lines 6 to 11). In our sorting implementation, we using the simd partition function from Section III-B, and then
provide the functions to sort up to 16 SIMD-vectors, which sort the small partitions using the simd bitonic sort wrapper
correspond to 256 integer values or 128 double floating-point function from Section III-A. The obtained algorithm is very
values. similar to the scalar QS given in Appendix A.

Algorithm 2: SIMD bitonic sort for two vectors of D. Sorting Key/Value Pairs
double floating-point values.
Input: vec1 and vec2: two double floating-point AVX-512 vectors to sort. The previous sorting methods are designed to sort an array
Output: vec1 and vec2: the two vectors sorted with vec1 lower or equal of numbers. However, some applications need to sort key/value
than vec2. pairs. More precisely, the sort is applied on the keys, and
1 function simd bitonic sort 2v(vec1, vec2)
2 // Sort each vector using bitonic simd sort 1v the values contain extra information and could be pointers to
3 simd bitonic sort 1v(vec1) arbitrary data structures, for example. Storing each key/value
4 simd bitonic sort 1v(vec2)
5 compare and exchange 2v(vec1, vec2, [0, 1, 2, 3, 4, 5, 6, 7])
pair contiguously in memory is not adequate for vectorization
6 compare and exchange(vec1, [3, 2, 1, 0, 7, 6, 5, 4]) because it requires transforming the data. Therefore, in our
7 compare and exchange(vec2, [3, 2, 1, 0, 7, 6, 5, 4]) approach, we store the keys and the values in two distinct
8 compare and exchange(vec1, [5, 4, 7, 6, 1, 0, 3, 2])
9 compare and exchange(vec2, [5, 4, 7, 6, 1, 0, 3, 2]) arrays. To extend the SIMD-Bitonic-sort and SIMD-partition
10 compare and exchange(vec1, [6, 7, 4, 5, 2, 3, 0, 1]) functions, we must ensure that the same permutations/moves
11 compare and exchange(vec2, [6, 7, 4, 5, 2, 3, 0, 1]) are applied to the keys and the values. For the partition
function, this is trivial. The same mask is used in combination
with the store-some instruction for both arrays. For the Bitonic-
3) Sorting small arrays: Each of our SIMD-Bitonic-sort
based sort, we manually apply the permutations that were done
functions are designed for a specific number of SIMD-vectors.
on the vector of keys to the vector of values. To do so, we first
However, we intend to sort arrays that do not have a size mul-
save the vector of keys k before it is permuted by a compare
tiple of the SIMD-vector’s length, because they are obtained
and exchange, using the Bitonic permutation vector of indexes
from the partitioning stage of the QS. Consequently, when we
p, into k’. We compare k and k’ to obtain a mask m that
have to sort a small array, we first load it into SIMD-vectors,
expresses what moves have been done. Then, we permute our
and then, we pad the last vector with the greatest possible
vector of values v using p into v’, and we select the correct
value. This guarantee that the padding values have no impact
values between v and v’ using m. Consequently, we perform
on the sorting results by staying at the end of the last vector.
this operation at the end of the compare and exchange in all
The selection of appropriate SIMD-Bitonic-sort function, that
the Bitonic-based sorts.
matches the size of the array to sort, can be done efficiently
with a switch statement. In the following, we refer to this
interface as the simd bitonic sort wrapper function. IV. P ERFORMANCE S TUDY
A. Configuration
B. Partitioning with AVX-512
We asses our method on an Intel(R) Xeon(R) Platinum
Algorithm 3 shows our strategy to develop a vectorized 8170 Skylake CPU at 2.10GHz, with caches of sizes 32K-
partitioning method. This algorithm is similar to a scalar Bytes, 1024K-Bytes and 36608K-Bytes, at levels L1, L2
partitioning function: there are iterators that start from both and L3, respectively. The process and allocations are bound
extremities of the array to keep track of where to load/store with numactl –physcpubind=0 –localalloc. We use the Intel
www.ijacsa.thesai.org 340 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

Algorithm 3: SIMD partitioning. V EC SIZE is Algorithm 4: SIMD Quicksort. select pivot pos
the number of values inside a SIMD-vector of type returns a pivot.
array’s elements. Input: array: an array to sort. length: the size of array.
Input: array: an array to partition. length: the size of array. pivot: the Output: array: the array sorted.
reference value 1 function simd QS(array, length)
Output: array: the array partitioned. left w: the index between the 2 simd QS core(array, 0, length-1)
values lower and larger than the pivot. 3 function simd QS core(array, left, right)
1 function simd partition(array, length, pivot) 4 // Test if we must partition again or if we can sort
2 // If too small use scalar partitioning 5 if left + SORT BOUND < right then
3 if length ≤ 2 × V EC SIZE then 6 pivot idx = select pivot pos(array, left, right)
4 Scalar partition(array, length) 7 swap(array[pivot idx], array[right])
5 return 8 partition bound = simd partition(array, left, right,
6 end array[right])
7 // Set: Fill a vector with all values equal to pivot 9 swap(array[partition bound], array[right])
8 pivotvec = simd set from one(pivot) 10 simd QS core(array, left, partition bound-1)
9 // Init iterators and save one vector on each extremity 11 simd QS core(array, partition bound+1, right)
10 left = 0 12 else
11 left w = 0 13 simd bitonic sort wrapper(sub array(array, left),
12 left vec = simd load(array, left) right-left+1)
13 left = left + VEC SIZE 14 end
14 right = length-VEC SIZE
15 right w = length
16 right vec = simd load(array, right)
17 while left + VEC SIZE ≤ right do
18 if (left - left w) ≤ (right w - right) then of the fastest existing sorting implementation.
19 val = simd load(array, left)
20 left = left + VEC SIZE The test file used for the following benchmark is available
// (B) Possible optimization, swap val and left vec
21
22 else online3 , it includes the different sorts presented in this study
23 right = right - VEC SIZE plus some additional strategies and tests. Our SIMD-QS uses
24 val = simd load(array, right) a 3-values median pivot selection (similar to the STL sort
25 // (B) Possible optimization, swap val and right vec
26 end function). The arrays to sort are populated with randomly
27 [left w, right w] = partition vec(array, val, pivotvec, left w, generated values.
right w)
28 end
29 // Process left val and right val
30 [left w, right w] = partition vec(array, left val, pivotvec, left w,
B. Performance to Sort Small Arrays
right w)
31 [left w, right w] = partition vec(array, right val, pivotvec, left w, Fig. 4 shows the execution times to sort arrays of size
right w) from 1 to 16 × VEC SIZE, which corresponds to 128 double
32 // Proceed remaining values (less than VEC SIZE values)
33 nb remaining = right - left
floating-point values, or 256 integer values. We also test arrays
34 val = simd load(array, left) of size not multiple of the SIMD-vector’s length. The AVX-
35 left = right 512-bitonic always delivers better performance than the Intel
36 mask = get mask less equal(val, pivotvec)
37 mask low = cut mask(mask, nb remaining) IPP for any size, and better performance than the STL when
38 mask high = cut mask(reverse mask(mask) , nb remaining) sorting more than 5 values. The speedup is significant, and is
39 // (A) Possible optimization, do only if mask low is not 0 around 8 in average. The execution time per item increases
40 simd store some(array, left w, mask low, val)
41 left w = left w + mask nb true(mask low) every VEC SIZE values because the cost of sorting is not tied
42 // (A) Possible optimization, do only if mask high is not 0 to the number of values but to the number of SIMD-vectors to
43 right w = right w - mask nb true(mask high)
44 simd store some(array, right w, mask high, val)
sort, as explained in Section III-A3. For example, in Fig. 4a,
45 return left w the execution time to sort 31 or 32 values is the same, because
46 function partition vec(array, val, pivotvec, left w, right w) we sort one SIMD-vector of 32 values in both cases. Our
47 mask = get mask less equal(val, pivotvec)
48 nb low = mask nb true(mask) method to sort key/value pairs seems efficient, see Fig. 4c,
49 nb high = VEC SIZE-nb low because the speedup is even better against the STL compared
50 // (A) Possible optimization, do only if mask is not 0 to the sorting of integers.
51 simd store some(array, left w, mask, val)
52 left w = left w + nb low
53 // (A) Possible optimization, do only if mask is not all true
54 right w = right w - nb high C. Partitioning Performance
55 simd store some(array, right w, reverse mask(mask), val)
56 return [left w, right w] Fig. 5 shows the execution times to partition using our
AVX-512-partition or the STL’s partition function. Our method
provides again a speedup of an average factor of 4. For the
three configurations, an overhead impacts our implementation
compiler 17.0.2 (20170213) with aggressive optimization flag and the STL when partitioning arrays larger than 107 items.
-O3. Our AVX-512-partition remains faster, but its speedup de-
creases from 4 to 3. This phenomena is related to cache effects
We compare our implementation against two references. since 107 integers values occupy 40M-Bytes, which is more
The first one is the GNU STL 3.4.21 from which we use than the L3 cache size. In addition, we see that this effect starts
the std::sort and std::partition functions. The second one from 105 when partitioning key/value pairs.
is the Intel Integrated Performance Primitives (IPP) 2017
which is a library optimized for Intel processors. We use the 3 The test file that generates the performance study is available at
IPP radix-based sort (function ippsSortRadixAscend [type] I). https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort (branch paper) under MIT
This function require additional space, but it is known as one license.

www.ijacsa.thesai.org 341 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

std::sort Intel IPP AVX-512-bitonic sort std::partition AVX-512-partition

10−7 10−8
Time in s/ n log(n)

0.5

Time in s/ n
10−8
4.8
10−9
−9 7.5 5.9 5.5 6.1
10
14.8 8.2 10.8 11.5 9.7 8.8
11.4
24.9
−10
10 0 50 100 150 200 250 10−10
102 103 104 105 106 107 108 109
Number of values n Number of values n

(a) Integer (int). (a) Integer (int).

10−7 10−8
0.5
Time in s/ n log(n)

Time in s/ n
10−8
3.2
3.0 2.9 3.2
10−9 4.8 5.9 5.6 4.8 4.5
7.4 8.1
10−9 11.6
8.7

10−100 20 40 60 80 100 120 10−10


102 103 104 105 106 107 108 109
Number of values n Number of values n

(b) Floating-point (double) (b) Floating-point (double)

10−7 0.5
10−8
0.4
Time in s/ n log(n)

10−8
Time in s/ n

10−9 4.4 4.4 4.9


10−9 12.7 11.8 11.5 10.5 11.7 7.8 6.2
12.0 11.8

10−100 50 100 150 200 250


10−10
Number of values n 102 103 104 105 106 107 108 109
Number of values n
(c) Key/value integer pair (int[2]).
(c) Key/value integer pair (int[2]).
Fig. 4: Execution time divided by n log(n) to sort from 1 to 16 × VEC SIZE
values. The execution time is obtained from the average of 104 sorts for each Fig. 5: Execution time divided by n of elements to partition arrays filled
size. The speedup of the AVX-512-bitonic against the fastest between STL with random values with sizes from 21 to 230 (≈ 109 ).The pivot is selected
and IPP is shown above the AVX-512-bitonic line. randomly. the AVX-512-partition line. The execution time is obtained from
the average of 20 executions. The speedup of the AVX-512-partition against
the STL is shown above.

D. Performance to Sort Large Arrays


Fig. 6 shows the execution times to sort arrays up to a
size of 109 items. Our AVX-512-QS is always faster in all
configurations. The difference between AVX-512-QS and the
STL sort seems stable for any size with a speedup of more than
6 to our benefit. However, while the Intel IPP is not efficient for
arrays with less than 104 elements, its performance is really
close to the AVX-512-QS for large arrays. The same effect
found when partitioning appears when sorting arrays larger
than 107 items. All three sorting functions are impacted, but
the IPP seems more slowdown than our method, because it is
based on a different access pattern, such that the AVX-512-QS
is almost twice as fast as IPP for a size of 109 items.

www.ijacsa.thesai.org 342 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

std::sort Intel IPP AVX-512-QS and ready to be used and compared. In the future, we intend
to design a parallel implementation of our AVX-512-QS, and
10−8
we expect the recursive partitioning to be naturally parallelized
with a task-based scheme on top of OpenMP.
Time in s/ n log(n)

A PPENDIX
10−9
3.7 1.5 1.3 1.8
16.9 1.7 1.4 1.3 1.3 A. Scalar Quicksort Algorithm

102 103 104 105 106 107 108 109 Algorithm 5: Quicksort
Number of values n Input: array: an array to sort. length: the size of array.
Output: array: the array sorted.
(a) Integer (int) 1 function QS(array, length)
2 QS core(array, 0, length-1)
−8 3 function QS core(array, left, right)
10
4 if left < right then
Time in s/ n log(n)

5 // Naive method, select value in the middle


6 pivot idx = ((right-left)/2) + left
7 swap(array[pivot idx], array[right])
1.9 1.6 1.7 8 partition bound = partition(array, left, right, array[right])
8.7 3.1 1.6 1.5 1.4 1.3
9 swap(array[partition bound], array[right])
10−9
10 QS core(array, left, partition bound-1)
11 QS core(array, partition bound+1, right)
12 end
13 function partition(array, left, right, pivot value)
102 103 104 105 106 107 108 109 14 for idx read ← left to right do
Number of values n 15 if array[idx read] ¡ pivot value then
16 swap(array[idx read], array[left])
17 left += 1
(b) Floating-point (double) 18 end
19 end
10−8 20 return left;
Time in s/ n log(n)

5.1
7.0
7.3 6.8 B. Source Code Extracts
10−9 10.1 9.3 9.0 8.0
10.1
1 inline m512d A V X 5 1 2 b i t o n i c s o r t 1 v ( m512d i n p u t ){
2 {
3 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
4 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
102 103 104 105 106 107 108 109 5
6
m512d
m512d
permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
Number of values n 7 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
8 }
9 {
(c) Key/value integer pair (int[2]) 10 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 4 , 5 , 6 , 7 , 0 , 1 , 2 , 3 ) ;
11 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
12 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
13 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
Fig. 6: Execution time divided by n log(n) to sort arrays filled with random 14 input = mm512 mask mov pd ( permNeighMin , 0xCC , permNeighMax ) ;
15 }
values with sizes from 21 to 230 (≈ 109 ). The execution time is obtained 16 {
from the average of 5 executions. The speedup of the AVX-512-bitonic against 17 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
18 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
the fastest between STL and IPP is shown above the AVX-512-bitonic line. 19 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
20 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
21 input = mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
22 }
23 {
24 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ) ;
V. C ONCLUSIONS 25
26
m512d
m512d
permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
27 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
In this paper, we introduced new Bitonic sort and a new 28
29 }
input = mm512 mask mov pd ( permNeighMin , 0xF0 , permNeighMax ) ;

partition algorithm that have been designed for the AVX-512 30


31
{
m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 5 , 4 , 7 , 6 , 1 , 0 , 3 , 2 ) ;
instruction set. These two functions are used in our Quicksort 32 m512d permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
33 m512d permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
variant which makes it possible to have a fully vectorized 34 m512d permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
35 input = mm512 mask mov pd ( permNeighMin , 0xCC , permNeighMax ) ;
implementation (at the exception of partitioning tiny arrays). 36 }
{
Our approach shows superior performance on Intel SKL in all 37
38 m512i idxNoNeigh = m m 5 1 2 s e t e p i 6 4 ( 6 , 7 , 4 , 5 , 2 , 3 , 0 , 1 ) ;
configurations against two reference libraries: the GNU C++ 39
40
m512d
m512d
permNeigh = mm512 permutexvar pd ( idxNoNeigh , i n p u t ) ;
permNeighMin = mm512 min pd ( permNeigh , i n p u t ) ;
STL, and the Intel IPP. It provides a speedup of 8 to sort small 41
42
m512d
input =
permNeighMax = mm512 max pd ( permNeigh , i n p u t ) ;
mm512 mask mov pd ( permNeighMin , 0xAA , permNeighMax ) ;
arrays (less than 16 SIMD-vectors), and a speedup of 4 and 43 }
44
1.4 for large arrays, against the C++ STL and the Intel IPP, 45 return i n p u t ;
46 }
respectively. These results should also motivate the community 47
to revisit common problems, because some algorithms may
Code 1: AVX-512 Bitonic sort for one simd-vector of double floating-point
become competitive by being vectorizable, or improved, thanks values.
to AVX-512’s novelties. Our source code is publicly available
www.ijacsa.thesai.org 343 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 10, 2017

1 t e m p l a t e <c l a s s IndexType> [4] ISO/IEC 14882:2003(e): Programming languages - c++, 2003. 25.3.1.1
2 s t a t i c i n l i n e I n d e x T y p e A V X 5 1 2 p a r t i t i o n ( d o u b l e a r r a y [ ] , I n d e x T y p e l e f t , ...
I n d e x T y p e r i g h t , c o n s t d o u b l e p i v o t ){
sort [lib.sort] para. 2.
3 c o n s t IndexType S = 8 ; / / ( 5 1 2 / 8 ) / s i z e o f ( double ) ;
4
[5] ISO/IEC 14882:2014(E): Programming Languages - c++, 2014. 25.4.1.1
5 i f ( r i g h t−l e f t +1 < 2∗S ){ sort (p. 911).
6 r e t u r n C o r e S c a l a r P a r t i t i o n<double , IndexType >(a r r a y , l e f t , r i g h t , p i v o t ) ;
7 } [6] Musser,D. R.: Introspective sorting and selection algorithms. Softw.,
8
9 m512d p i v o t v e c = mm512 set1 pd ( p i v o t ) ;
Pract. Exper., 27(8):983–993, 1997.
10 [7] Batcher,K. E.: Sorting networks and their applications. In Proceedings
11 m512d l e f t v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
12 IndexType l e f t w = l e f t ; of the April 30–May 2, 1968, spring joint computer conference, pages
13 l e f t += S ; 307–314. ACM, 1968.
14
15 IndexType right w = r i g h t +1; [8] Nassimi, D. Sahni,S.: Bitonic sort on a mesh-connected parallel com-
16 r i g h t −= S−1;
17 m512d r i g h t v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ; puter. IEEE Trans. Computers, 28(1):2–7, 1979.
18
19 w h i l e ( l e f t + S <= r i g h t ){
[9] Owens, J. D. Houston, M. Luebke, D. Green, S. Stone, J. E. Phillips,
20 c o n s t IndexType f r e e l e f t = l e f t − l e f t w ; J. C.: Gpu computing. Proceedings of the IEEE, 96(5):879–899, 2008.
21 c o n s t IndexType f r e e r i g h t = right w − r i g h t ;
22 [10] Kogge, P. M.: The architecture of pipelined computers. CRC Press,
23 m512d v a l ;
24 i f ( f r e e l e f t <= f r e e r i g h t ){
1981.
25 v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
26 l e f t += S ;
[11] Intel: Intel 64 and ia-32 architectures software developer’s man-
27 } ual: Instruction set reference (2a, 2b, 2c, and 2d). Avail-
28 e l s e{
29 r i g h t −= S ;
able on: https://software.intel.com/en-us/articles/
30 v a l = mm512 loadu pd (& a r r a y [ r i g h t ] ) ; intel-sdm.
31 }
32 [12] Intel: Introduction to intel advanced vector extensions. Available
33 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ; on: https://software.intel.com/en-us/articles/
34
35 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ; introduction-to-intel-advanced-vector
36 c o n s t I n d e x T y p e n b h i g h = S−nb low ; -extensions.
37
38 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , v a l ) ; [13] Intel: Intel architecture instruction set exten-
39 l e f t w += nb low ;
40 sions programming reference. Available on:
41 r i g h t w −= n b h i g h ; https://software.intel.com/sites/default/files/
42 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , v a l ) ;
43 } managed/c5/15/architecture-instruction-set-
44 extensions-programming-reference.pdf.
45 {
46 c o n s t IndexType remaining = r i g h t − l e f t ; [14] Sanders, P. Winkel, S.: Super scalar sample sort. In European Sympo-
47 m512d v a l = mm512 loadu pd (& a r r a y [ l e f t ] ) ;
48 left = right ; sium on Algorithms, pages 784–796. Springer, 2004.
49
50 mmask8 mask = mm512 cmp pd mask ( v a l , p i v o t v e c , CMP LE OQ ) ;
[15] Inoue, H. Moriyama, T. Komatsu, H. Nakatani, T.: Aa-sort: A new
51 parallel sorting algorithm for multi-core simd processors. In Proceed-
52 mmask8 mask low = mask & ˜ ( 0 xFF << r e m a i n i n g ) ;
53 mmask8 m a s k h i g h = ( ˜ mask ) & ˜ ( 0 xFF << r e m a i n i n g ) ;
ings of the 16th International Conference on Parallel Architecture
54 and Compilation Techniques, pages 189–198. IEEE Computer Society,
55 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask low ) ;
56 c o n s t IndexType nb high = popcount ( mask high ) ; 2007.
57
58 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask low , v a l ) ;
[16] Furtak, T. Amaral, J. N. Niewiadomski, R.: Using simd registers and
59 l e f t w += nb low ; instructions to enable instruction-level parallelism in sorting algorithms.
60
61 r i g h t w −= n b h i g h ;
In Proceedings of the nineteenth annual ACM symposium on Parallel
62 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , mask high , v a l ) ; algorithms and architectures, pages 348–357. ACM, 2007.
63 }
64 { [17] Chhugani, J. Nguyen, A. D. Lee, V. W. Macy, W. Hagog, M. Chen,
65 mmask8 mask = mm512 cmp pd mask ( l e f t v a l , p i v o t v e c , CMP LE OQ ) ; Y.-K. Baransi, A. Kumar, S. Dubey, P.: Efficient implementation of
66
67 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ; sorting on multi-core simd cpu architecture. Proceedings of the VLDB
68 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
69
Endowment, 1(2):1313–1324, 2008.
70
71
m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , l e f t v a l ) ;
l e f t w += nb low ;
[18] Gueron, S. Krasnov, V.: Fast quicksort implementation using avx
72 instructions. The Computer Journal, 59(1):83–90, 2016.
73 r i g h t w −= n b h i g h ;
74 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , l e f t v a l ) ;
75 }
76 {
77 mmask8 mask = mm512 cmp pd mask ( r i g h t v a l , p i v o t v e c , CMP LE OQ ) ;
78
79 c o n s t I n d e x T y p e nb low = p o p c o u n t ( mask ) ;
80 c o n s t I n d e x T y p e n b h i g h = S−nb low ;
81
82 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ l e f t w ] , mask , r i g h t v a l ) ;
83 l e f t w += nb low ;
84
85 r i g h t w −= n b h i g h ;
86 m m 5 1 2 m a s k c o m p r e s s s t o r e u p d (& a r r a y [ r i g h t w ] , ˜ mask , r i g h t v a l ) ;
87 }
88 return left w ;
89 }
90

Code 2: AVX-512 partitioning of a double floating-point array (AVX-512-


partition)

R EFERENCES
[1] Graefe, G.: Implementing sorting in database systems. ACM Computing
Surveys (CSUR), 38(3):10, 2006.
[2] Bishop, L. Eberly, D. Whitted, T. Finch, M. Shantz,M.: Designing a pc
game engine. IEEE Computer Graphics and Applications, 18(1):46–53,
1998.
[3] Hoare,C AR: Quicksort. The Computer Journal, 5(1):10–16, 1962.

www.ijacsa.thesai.org 344 | P a g e

You might also like