Vectorization of Insertion Sort Using Altivec
Vectorization of Insertion Sort Using Altivec
Abstract
Parallelization of a sort algorithm is not always easy nor possible, or if
it is, this does not necessarily mean significant gains in performance. After
reading in Slashdot about a GPU version of Quicksort ([gpu]), I was curi-
ous if vectorization of common sorting algorithms, such as qsort, insertion
sort and merge sort could be vectorizable and adapted to be used with Al-
tivec, the PowerPC SIMD unit. The results were more than interesting, in
some cases offering speed gains of 54%!! in this paper, we will present vec-
torization techniques for all 3 of these algorithms. Note: This paper does
not yet include the quicksort vectorization.
1
Figure 1: Sorting 16 sets in parallel using Altivec
sets (as an Altivec register can hold 16 chars, which can be processed in par-
allel). Again, the situation is analogous if we’re having ints or floats deviding
into 4 sets (or 8 sets if dealing with short ints).
The idea is to sort 4/8/16 sets in parallel and then merge the results. This
is actually a variant of the Shell Sort algorithm ([wikb]). How we will do the
merging stage, still remains to be explained later. First we will have to deal
with the sorting of 16 parallel columns of data. In theory this is easy, but we
must make sure we handle a vector at a time, and in as few instructions as
possible, preferably in one go. To do that, we first have to load the data into
Altivec registers. We will use two registers as the comparison will be used
between two sets, one used as the key. Obviously, this means that the Altivec
version will work only for sizes > 32 chars (which is the size of 2 vectors).
2
}
}
Now this repeated 16 times over a vector is what Altivec does. What is
the benefit? The benefit is that this code does not use any branches and it
can be pre-calculated in the pipeline by the processor, giving very good per-
formance results. Plus it is constant, meaning it will take exactly the same
number of instructions regardless the data given. So the initial scalar loop can
be replaced by the following Altivec instructions (the following code sorts an
array of chars):
3
// Get the key set of elements
// and load it into an Altivec vector register
cur = a + i*16;
va_key = vec_ld(0, cur);
// Compare all the previous sets to the key set.
for (j = i-1; j >= 0; j--) {
cur = a + j*16;
/* Load current and next 16-byte sets
into Altivec registers
*/
va_cur = vec_ld(0, cur);
va_next = vec_ld(16, cur);
The result will be 16 sorted sets of N =16 elements. As the algorithm is still
basically an Insertion Sort the time required is still O(N 0 2 ) but this time N 0 =
N =16. So for example for N = 64 elements, the original algorithm would do
N 2 = 4096 steps to execute, while the Altivec version would do just N 0 = 4,
N 0 2 = 16 steps!!
So is this over? No, because the result is still not what we want, we have to
merge the remaining sets to a final sorted set. For that reason we have to use
an extended version of Merge Sort.
4
Figure 2: Merging the 16 sets using extended Merge sort
initially cover N cases, optimizing afterwards for 4/8/16 sets). Let’s begin by
describing the algorithm, assuming we have N sets of sorted data:
5. Find the minimum value(s) in the N-tuple and note its position(s), pos.
6. Insert the value(s) to the next available slot in the result[] array.
7. Increase indices[pos] by 1 (for all elements if more than one is found).
8. Set flag[pos] to 1.
9. Go back to 5 until all elements in flag[] are set to 1.
10. Go back to 4 while there are remaining sets.
5
uint32_t flag = 0, remsets = sets, length = sets*columnsize;
size_t i, j, set = 0, index = 0;
uint8_t min, cur;
6
Size It. (Alticec) It. (Scalar) Ratio
8 41 14 0.33
16 86 59 0.75
32 188 286 1.10
64 440 917 1.33
128 1136 4335 2.25
256 3296 16019 2.67
512 10688 66644 3.24
1024 37760 262649 3.50
2048 141056 1064615 3.73
4096 544256 4237939 3.81
8192 2137088 16521839 3.76
if (target)
free(target);
return;
}
You may have noticed that the code is scalar and with nothing Altivec spe-
cific. It is possible to do some Altivec optimizations in this one, especially with
regards to finding the minimum value, as Altivec provides such a function
(vec_min) that will find the minimum value in a vector in just 1 CPU cycle!
This will follow in a next revision of this paper.
Also worth mentioning, we have used a 32-bit integer and setting bits to
denote that a set has been completed. This was done for performance reasons,
but it also presents a limitation. We can’t use more than 32 sets of sorted data
to merge. But that’s quite acceptable for our purposes, as we only need 16 sets
maximum.
This Extended Merge Sort algorith, like the original algorithm is not an in-
place sorting algorithm, that is, the data get sorted in another array and then
copied back to the original array. We have used memcpy() for the copying pro-
cess, as we believe it to be faster than other methods (eg. per-element copying),
but there is of course no other particular reason or this choice. In particular,
since this function has already been vectorized for Altivec, we might benefit
from that as well.
7
Size It. (Alticec) It. (Scalar) Ratio
128 2048 4199 1.00
256 4096 16429 2.00
512 8192 67361 4.25
1024 16384 259680 7.33
2048 32768 1036143 13.79
4096 65536 4116630 20.27
8192 131072 16996544 29.85
16384 262144 67815998 37.50
32768 524288 265531367 42.12
3 Conclusions
Altivec is really a very powerful tool but so far its use is really mainly centered
around Multimedia and or Linear Algebra scientific applications. We strongly
believe that Altivec can be of real benefit to generic system-wide OS usage
as well. We will show eventually with a series of papers like this one that it
can be used for other generic applications, like data manipulation, other sort
algorithms, etc.
References
[gpu] Gpusort.
[wika] Shell sort.
[wikb] Shell sort.