Pattern-Defeating Quicksort
Pattern-Defeating Quicksort
Orson R. L. Peters
[email protected]
Leiden University
arXiv:2106.05123v1 [cs.DS] 9 Jun 2021
Abstract. A new solution for the Dutch national flag problem is pro-
posed, requiring no three-way comparisons, which gives quicksort a proper
worst-case runtime of O(nk) for inputs with k distinct elements. This is
used together with other known and novel techniques to construct a hy-
brid sort that is never significantly slower than regular quicksort while
speeding up drastically for many input distributions.
1 Introduction
Arguably the most used hybrid sorting algorithm at the time of writing is
introsort[12]. A combination of insertion sort, heapsort[19] and quicksort[10],
it is very fast and can be seen as a truly hybrid algorithm. The algorithm per-
forms introspection and decides when to change strategy using some very simple
heuristics. If the recursion depth becomes too deep, it switches to heapsort, and
if the partition size becomes too small it switches to insertion sort.
The goal of pattern-defeating quicksort (or pdqsort ) is to improve on in-
trosort’s heuristics to create a hybrid sorting algorithm with several desirable
properties. It maintains quicksort’s logarithmic memory usage and fast real-
world average case, effectively recognizes and combats worst case behavior (de-
terministically), and runs in linear time for a few common patterns. It also
unavoidably inherits in-place quicksort’s instability, so pdqsort can not be used
in situations where stability is needed.
In Section 2 we will explore a quick overview of pattern-defeating quicksort
and related work, in Section 3 we propose our new solution for the Dutch na-
tional flag problem and prove its O(nk) worst-case time for inputs with k distinct
elements, Section 4 describes other novel techniques used in pdqsort while Sec-
tion 5 describes various previously known ones. Our final section consists of an
empirical performance evaluation of pdqsort.
This paper comes with an open source state of the art C++ implementation[14].
The implementation is fully compatible with std::sort and is released under a
permissive license. Standard library writers are invited to evaluate and adopt the
implementation as their generic unstable sorting algorithm. At the time of writ-
ing the Rust programming language has adopted pdqsort for sort unstable
in their standard library thanks to a porting effort by Stjepan Glavina. The
implementation is also available in the C++ Boost.Sort library.
2 Overview and related work
A naive quicksort implementation might trigger the Θ(n2 ) worst case on the
all-equal input distribution by placing equal comparing elements in the same
partition. A smarter implementation either always or never swaps equal elements,
resulting in average case performance as equal elements will be distributed evenly
across the partitions. However, an input with many equal comparing elements
is rather common1 , and we can do better. Handling equal elements efficiently
requires tripartite partitioning, which is equivalent to Dijkstra’s Dutch national
flag problem[6].
Pattern-defeating quicksort uses the fast
’approaching pointers’ method[4] for partition-
ing. Two indices are initialized, i at the start = < ? > =
and j at the end of the sequence. i is incre-
mented and j is decremented while maintain- Fig. 1. The invariant used by
ing an invariant, and when both invariants Bentley-McIlroy. After partition-
are invalidated the elements at the pointers ing the equal elements stored at
are swapped, restoring the invariant. The al- the beginning and at the end are
gorithm ends when the pointers cross. Imple- swapped to the middle.
menters must take great care, as this algo- i j
rithm is conceptually simple, but is very easy p ?
to get wrong.
Bentley and McIlroy describe an invariant p < ? >=
for partitioning that swaps equal elements to p < >=
the edges of the partition, and swaps them
back into the middle after partitioning. This < p >=
is efficient when there are many equal ele- r
ments, but has a significant drawback. Ev-
ery element needs to be explicitly checked for
Fig. 2. The invariant used by
equality to the pivot before swapping, cost- partition right of pdqsort,
ing another comparison. This happens regard- shown at respectively the initial,
less of whether there are many equal elements, halfway and finished state. When
costing performance in the average case. the loop is done the pivot gets
Unlike previous algorithms, pdqsort’s par- swapped into its correct position.
titioning scheme is not self contained. It uses p is the single pivot element. r
two separate partition functions, one that is the pointer returned by the
partition routine indicating the
groups elements equal to the pivot in the
pivot position. The dotted lines
left partition (partition left), and one that indicate how i and j change as
groups elements equal to the pivot in the the algorithm progresses. This is a
right partition (partition right). Note that simplified representation, e.g. i is
both partition functions can always be imple- actually off by one.
mented using a single comparison per element
as a < b ⇔ a b and a ≮ b ⇔ a ≥ b.
For brevity we will be using a simplified,
incomplete C++ implementation to illustrate pdqsort. It only supports int and
compares using comparison operators. It is however trivial to extend this to
1
It is a common technique to define a custom comparison function that only uses a
subset of the available data to sort on, e.g. sorting cars by their color. Then you
have many elements that aren’t fundamentally equal, but do compare equal in the
context of a sorting operation.
int* part_left(int* l, int* r) { int* part_right(int* l, int* r) {
int* i = l; int* j = r; int* i = l; int* j = r;
int p = *l; int p = *l;
Proof. If the subsequence is a direct right child of its parent partition, its pre-
decessor is the pivot of the parent. However, if the subsequence is the left child
of its parent partition, its predecessor is the predecessor of its parent. Since our
lemma states that our subsequence has a predecessor, it is not leftmost and there
must exist some ancestor of which the subsequence is a right child.
Proof. Due to the counter ticking down, after log n levels that contain a bad
partition the call tree terminates in heapsort. At each level we may do at most
O(n) work, giving a runtime of O(n log n).
4
Note that this is an upper bound. When k is big O(n log n) still applies.
5
This counter is maintained separately in every subtree of the call graph - it is not a
global to the sort process. Thus, if after the first partition the left partition degen-
erates in the worst case it does not imply the right partition also does.
Lemma 6. At most O(n log n) time is spent in pdqsort on good partitions.
Proof. Consider a scenario where quicksort’s partition operation always puts pn
elements in the left partition, and (1 − p)n in the right. This consistently forms
the worst possible good partition. Its runtime can be described with the following
recurrence relation:
T (n, p) = n + T (pn, p) + T ((1 − p)n, p)
For any p ∈ (0, 1) the Akra-Bazzi[1] theorem shows Θ(T (n, p)) = Θ(n log n).
Theorem 2. Pattern-defeating quicksort has complexity O(n log n).
Proof. Pattern-defeating quicksort spends O(n log n) time on good partitions,
bad partitions, and degenerate cases (due to heapsort also being O(n log n)).
These three cases exhaustively enumerate any recursive call to pdqsort, thus
pattern-defeating quicksort has complexity O(n log n).
⊓
⊔
We have proven that for any choice of p ∈ (0, 1) the complexity of pattern-
defeating quicksort is O(n log n). However, this does not tell use what a good
choice for p is.
Yuval Filmus[8] solves above recurrence, allowing us to study the slowdown
of quicksort compared to the optimal case of p = 21 . He finds that the solution is
T (n, p) 1
lim 1 = H(p)
n→∞ T (n, 2 )
where H is Shannon’s binary entropy function:
H(p) = −p log2 (p) − (1 − p) log2 (1 − p)
Plotting this function gives us a look at quicksort’s fundamental performance
characteristics:
5
4.5
4
3.5
factor
3
2.5
2
1.5
1
0 0.1 0.2 0.3 0.4 0.5
p
Fig. 4. Slowdown of T (n, p) compared to T (n, 12 ). This beautifully shows why quicksort
is generally so fast. Even if every partition is split 80/20, we’re still running only 40%
slower than the ideal case.
From benchmarks we’ve found that heapsort is roughly about twice as slow
as quicksort for sorting randomly shuffled data. If we then choose p such that
H(p)−1 = 2 a bad partition becomes roughly synonymous with ‘worse than
heapsort’.
The advantage of this scheme is that p can be tweaked if the architecture
changes, or you have a different worst-case sorting algorithm instead of heapsort.
We have chosen p = 0.125 as the cutoff value for bad partitions for two rea-
sons: it’s reasonably close to being twice as slow as the average sorting operation
and it can be computed using a simple bitshift on any platform.
Using this scheme as opposed to introsort’s static logarithmic recursive call
limit for preventing the worst case is more precise. While testing we noticed that
introsort (and to a lesser extent pdqsort) often have a rough start while sorting
an input with a bad pattern, but after some partitions the pattern is broken
up. Our scheme then procedes to use the now fast quicksort for the rest of the
sorting whereas introsort too heavily weighs the bad start and degenerates to
heapsort.
Some input patterns form some self-similar structure after partitioning. This can
cause a similar pivot to be repeatedly chosen. We want to eliminate this. The
reason for this can be found in Figure 4 as well. The difference between a good
and mediocre pivot is small, so repeatedly choosing a good pivot has a relatively
small payoff. The difference between a mediocre and bad pivot is massive. An
extreme example is the traditional O(n2 ) worst case: repeatedly partitioning
without real progress.
The classical way to deal with this is by randomizing pivot selection (also
known as randomized quicksort). However, this has multiple disadvantages. Sort-
ing is not deterministic, the access patterns are unpredictable and extra runtime
is required to generate random numbers. We also destroy beneficial patterns, e.g.
the technique in section 5.2 would no longer work for descending patterns and
performance on ’mostly sorted’ input patterns would also degrade.
Pattern-defeating quicksort takes a different approach. After partitioning we
check if the partition was bad. If it was, we swap our pivot candidates for others.
In our implementation pdqsort chooses the median of the first, middle and last
element in a subsequence as the pivot, and swaps the first and last candidate for
ones found at the 25% and 75% percentile after encountering a bad partition.
When our partition is big enough that we would be using Tukey’s ninther for
pivot selection we also swap the ninther candidates for ones at roughly the 25%
and 75% percentile of the partition.
With this scheme pattern-defeating quicksort is still fully deterministic, and
with minimal overhead breaks up many of the patterns that regular quicksort
struggles with. If the downsides of non-determinism do not scare you and you like
the guarantees that randomized quicksort provides (e.g. protection against DoS
attacks) you can also swap out the pivot candidates with random candidates.
It’s still a good idea to only do this after a bad partition to prevent breaking up
beneficial patterns.
6 Experimental results
6.1 Methodology
We present a performance evaluation of pattern-defeating quicksort with (bpdq)
and without (pdq) block partitioning, introsort from libstdc++’s implementa-
tion of std::sort (std), Timothy van Slyke’s C++ Timsort[15] implementation[16]
(tim), BlockQuicksort (bq) and the sequential version of In-Place Super Scalar
Samplesort[3] (is4 o). The latter algorithm represents to our knowledge the state
of the art in sequential in-place comparison sorting for large amounts of data.
In particular the comparison with BlockQuicksort is important as it is a
benchmark for the novel methods introduced here. The code repository for Block-
Quicksort defines many different versions of BlockQuicksort, one of which also
uses Hoare-style crossing pointers partitioning and Tukey’s ninther pivot selec-
tion. This version is chosen for the closest comparison as it most resembles our
algorithm. The authors of BlockQuicksort also proposed their own duplicate han-
dling scheme. To compare the efficacy of their and our approach we also chose
the version of BlockQuicksort with it enabled.
We evaluate the algorithms for three different data types. The simplest is int,
which is a simple 64-bit integer. However, not all data types have a branchless
10
After each iteration at least one offsets buffer is empty. We fill any buffer that is
empty.
11
We skip over many important details and optimizations here as they are more rele-
vant to BlockQuicksort than to pattern-defeating quicksort. The full implementation
has loop unrolling, swaps elements using only two moves per element rather than
three and uses all comparison information gained while filling blocks.
comparison function. For that reason we also have str, which is a std::string
representation of int (padded with zeroes such that lexicographic order matches
the numeric order). Finally to simulate an input with an expensive comparison
function we evaluate bigstr which is similar to str but is prepended with 1000
zeroes to artificially inflate compare time. An algorithm that is more efficient
with the number of comparisons it performs should gain an edge there.
The algorithms are evaluated on a variety of input distributions. Shuffled uni-
formly distributed values (uniform:√A[i] = i), shuffled distributions with many
duplicates (dupsq: A[i] = i mod ⌊ n⌋, dup8: A[i] = i8 + n/2 mod n, mod8:
A[i] = i mod 8, and ones: A[i] = 1), partially shuffled uniform distributions
(sort50, sort90, sort99 which respectively have the first 50%, 90% and 99%
of the elements already in ascending order) and some traditionally notoriously
bad cases for median-of-3 pivot selection (organ: first half of the input ascend-
ing and the second half descending, merge: two equally sized ascending arrays
concatenated). Finally we also have the inputs asc, desc which are inputs that
are already sorted.
The evaluation was performed on an AMD Ryzen Threadripper 2950x clocked
at 4.2GHz with 32GB of RAM. All code was compiled with GCC 8.2.0 with flags
-march=native -m64 -O2. To preserve the integrity of the experiment no two
instances were tested simultaneously and no other resource intensive processes
were run at the same time. For all random shuffling a seed was deterministically
chosen for each size and input distribution, so all algorithms received the exact
same input for the same experiment. Each benchmark was re-run until at least
10 seconds had passed and for at least 10 iterations. The former condition re-
duces timing noise by repeating small instances many times whereas the latter
condition reduces the influence of a particular random shuffle. The mean number
of cycles spent is reported, divided by n log2 n to normalize across sizes. In total
the evaluation program spent 9 hours sorting (with more time spent to prepare
input distributions).
As the full results are quite large (12 distributions× 3 data types = 36 plots),
they are included in Appendix A.
We conclude that the heuristics and techniques presented in this paper have
little overhead, and effectively handle various input patterns. Pattern-defeating
quicksort is often the best choice of algorithm overall for small to medium input
sizes or data type sizes. It and other quicksort variants suffer from datasets that
are too large to fit in cache, where is4 o shines. The latter algorithm however
suffers from bad performance on smaller sizes, future research could perhaps
combine the best of these two algorithms.
References
1. Akra, M., Bazzi, L.: On the solution of linear recurrence equations. Computational
Optimization and Applications 10(2), 195–210 (1998)
2. Aumüller, M., Dietzfelbinger, M., Klaue, P.: How good is multi-
pivot quicksort? ACM Trans. Algorithms 13(1), 8:1–8:47 (Oct 2016).
https://doi.org/10.1145/2963102
3. Axtmann, M., Witt, S., Ferizovic, D., Sanders, P.: In-place paral-
lel super scalar samplesort (ipsssso). CoRR abs/1705.02257 (2017),
http://arxiv.org/abs/1705.02257
4. Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software: Practice and
Experience 23(11), 1249–1265 (1993)
5. Codish, M., Cruz-Filipe, L., Nebel, M., Schneider-Kamp, P.: Optimizing sorting
algorithms by using sorting networks. Formal Aspects of Computing 29(3), 559–
579 (May 2017). https://doi.org/10.1007/s00165-016-0401-3
6. Dijkstra, E.W.: A Discipline of Programming. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 1st edn. (1997)
7. Edelkamp, S., Weiß, A.: BlockQuicksort: How branch mispredictions don’t affect
quicksort. CoRR abs/1604.06697 (2016)
8. Filmus, Y.: Solving recurrence relation with two recursive calls. Computer Science
Stack Exchange, https://cs.stackexchange.com/q/31930
9. Hinnant, H., et al.: libc++ C++ standard library. http://libcxx.llvm.org/
(2018), [Online; accessed 2018]
10. Hoare, C.A.: Quicksort. The Computer Journal 5(1), 10–16 (1962)
11. Kurosawa, N.: Quicksort with median of medians is considered practical. CoRR
abs/1608.04852 (2016)
12. Musser, D.: Introspective sorting and selection algorithms. Software Practice and
Experience 27, 983–993 (1997)
13. Musser, D.R., Stepanov, A.A.: Algorithm-oriented generic libraries. Software: Prac-
tice and Experience 24(7), 623–642 (1994)
14. Peters, O.R.L.: Pattern-defeating Quicksort Implementation.
https://github.com/orlp/pdqsort (2018), [Online; accessed 2018]
15. Peters, T.: Timsort. http://svn.python.org/projects/python/trunk/Objects/listsort.txt
(2002), [Online; accessed 2019]
16. van Slyke, T.: Timsort implementation. https://github.com/tvanslyke/timsort-cpp
(2018), [Online; accessed 2019]
17. Tukey, J.W.: The ninther, a technique for low-effort robust (resistant) location in
large samples. In: Contributions to Survey Sampling and Applied Statistics, pp.
251–257. Elsevier (1978)
18. Wild, S., Nebel, M.E.: Average case analysis of java 7’s dual pivot quicksort. In:
Epstein, L., Ferragina, P. (eds.) Algorithms – ESA 2012. pp. 825–836. Springer
Berlin Heidelberg, Berlin, Heidelberg (2012)
19. Williams, J.W.J.: Algorithm 232: Heapsort. Communications of the ACM 7(6),
347–348 (1964)
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
sort90 sort99
15 15
Time spent / n log2 n [cycles]
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
asc desc
15 15
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
Input size n
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
mod8 ones
15 15
Time spent / n log2 n [cycles]
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
organ merge
15 15
10 10
5 5
0 0
210 212 214 216 218 220 222 224 226 228 210 212 214 216 218 220 222 224 226 228
Input size n
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
sort90 sort99
150 150
Time spent / n log2 n [cycles]
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
asc desc
150 150
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
Input size n
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
mod8 ones
150 150
Time spent / n log2 n [cycles]
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
organ merge
150 150
100 100
50 50
0 0
210 212 214 216 218 220 222 210 212 214 216 218 220 222
Input size n
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
sort90 sort99
1,000 1,000
Time spent / n log2 n [cycles]
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
asc desc
1,000 1,000
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
Input size n
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
mod8 ones
1,000 1,000
Time spent / n log2 n [cycles]
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
organ merge
1,000 1,000
800 800
600 600
400 400
200 200
0 0
210 212 214 216 218 220 210 212 214 216 218 220
Input size n