Sorting Stably, in Place, With O and O: (N Log N) Comparisons (N) Moves
Sorting Stably, in Place, With O and O: (N Log N) Comparisons (N) Moves
Gianni Franceschini
Dipartimento di Informatica, Università di Pisa,
Largo B, Pontecorvo 3, 56127 Pisa, Italy
[email protected]
1. Introduction
In the comparison model the only operations allowed on the totally ordered domain of
the input elements are the comparison of two elements and the transfer of an element
from one cell of memory to another. Therefore, in this model it is natural to measure the
efficiency of an algorithm with three metrics: the number of comparisons it requires, the
number of element moves it performs and the number of auxiliary memory cells it uses,
besides the ones strictly necessary for the input elements. It is well known that in order to
sort a sequence of n elements, at least n log n −n log e comparisons have to be performed
in the worst case. Munro and Raman [15] set the lower bound for the number of moves
to 3/2n. An in-place or space-optimal algorithm uses O(1) auxiliary memory cells.
In the general case of input sequences with repeated elements, an important requirement
for a sorting algorithm is to be stable: the relative order of equal elements in the final
sorted sequence is the same found in the original one.
328 G. Franceschini
Two basic techniques are very common when space efficiency of algorithms and data
structures in the comparison model is the objective. The first is bit stealing [12]: a
bit of information is encoded in the relative order of a pair of distinct input
elements.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 329
The second technique is internal buffering [11], in which some of the input elements
are used as placeholders in order to simulate a working area and permute the other
elements at less cost. The internal buffering is one of the most powerful techniques for
space-efficient algorithms and data structures. However, it is easy to understand how
disruptive the internal buffering is when the stability of the algorithm is an objective.
If the placeholders are not distinct, the original order of identical placeholders can be
lost using the simulated working area. As a witness of the clash between stability and
internal buffering technique, we can cite the difference in complexity between the first
in-place, linear-time merging algorithm, due to Kronrod [11], and the first stable one by
Pardo [16].
obtain a “slow” auxiliary encoding memory that we will use in the rest of the paper to
sort the remaining m = O(n) elements laid down in sequence A. Finally, after A is
sorted, the (n/log n) elements devoted to information encoding will be sorted using
the normal in-place, stable mergesort.
totally compact with efficient search but with possibly slow insertion. When we sort C
in the presence of many distinct elements, we want a structure that is not compact, as it
exploits the set of distinct buffer elements, but that has an efficient insertion, in particular
for what concerns the complexity bound on the number of moves (we can only afford
O(1) moves in an amortized sense for each insertion).
Finally, after we used the elements in B and the structure to build (log3 n) runs
of sorted elements out of the original sequence C left to be sorted from Section 4, we
have to merge these runs stably and within our computational bounds. To this purpose,
we introduce a multi-way stable merging technique requiring a very limited number of
placeholders to deliver the final sorted sequence.
3. Stealing Bits
As we mentioned in the Introduction, with the bit-stealing technique (see [12]) the value
of a bit is encoded in the relative order of two distinct elements (e.g., the increasing order
for 0 and the decreasing order for 1). In this section we show how to collect (n/log n)
pairs of distinct elements, stably and within our computational bounds.
The rank of an element xi in a sequence S = x1 · · · xt is the cardinality of the
multiset
The rank of an element x in a set S is similarly defined. Let r = n/log n and let π
and π be, respectively, the element with rank r and the element with rank n − r + 1 in
the input sequence. We want to stably and in-place partition the input sequence into five
zones J P A P J such that, for each j ∈ J , p ∈ P , a ∈ A, p ∈ P and j ∈ J ,
332 G. Franceschini
we have that
That can be done in O(n) comparisons and O(n) moves using the stable, in-place
selection and the stable, in-place partitioning of Katajainen and Pasanen [7], [8].
Zones J and J can be sorted stably and in place in O(n) time simply using a stable
in-place mergesort (e.g., [17]). If there are no elements in A, we are done since the input
sequence is already sorted. Otherwise we are left with the unsorted subsequence A and
with a set M of r = (n/log n) pairs of distinct elements, that is,
M = (Q [1], Q [1]), (Q [2], Q [2]), . . . , (Q [r ], Q [r ]) ,
where Q = J P and Q = P J .
The starting addresses of Q and Q can be maintained in two locations of auxiliary
memory (we can use O(1) auxiliary locations) and so, for any i, we can retrieve the
addresses of the elements of the ith pair in O(1) operations. Therefore, we can view M
as a collection of encoding words of t bits each, for any t. Those encoding words can be
used pretty much as if they were normal ones. We have to pay attention to the costs of
using encoding bits or encoding words, though: reading an encoding word of t bits takes
t comparisons, changing it costs t comparisons and O(t) data moves in the worst case
or O(1) moves amortized if we perform a sufficiently long sequence of increments by
one (see [2], the binary counter analysis). It is worth noting that we could have chosen
the ranks of π and π as cr and n − cr + 1 for any constant c, so that the number of
encoded bits would be cr without changing the asymptotic complexity bounds of the
algorithm.
Therefore, if m is the size of A, we can make the following assumption:
Hence, if we are able to solve the following problem over the sequence A, we are able
to solve the original problem.
In the following sections we use the auxiliary encoding memory M as normal auxiliary
memory for numeric values. We will declare explicitly any new auxiliary data (indices,
pointers. . . ) stored in M.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 333
In this section we show how to go from sequence A to J P C B such that:
Property 1.
(i) For any elements j ∈ J , p ∈ P and q ∈ C B, we have that j < p < q.
(ii) J is in sorted order.
(iii) The element with rank r/log2 m + 1 in A is in P together with all the other
elements equal to it.
(iv) Let d be the number of distinct elements in C B, where B contains
b = min(d, |C B|/log3 m)
distinct elements.
(v) Any two equal elements in J P C B are in the same relative order as in A.
After we show how to obtain the new sequence J P C B satisfying Property 1
within our target bounds, we will be left with the problem of sorting C B. The elements
in B will be used in Sections 5 and 6 as in the technique of internal buffering [11].
Basically, some of the elements are used as placeholders to simulate a working area in
which the other elements can be permuted efficiently. If the placeholders are not distinct,
stability becomes a difficult task since the original order of identical placeholders can be
lost using the simulated working area. The elements in B are distinct so we do not have
to worry, as long as we can sort O(|C|) elements with o(|C|) placeholders (Section 5).
However, as we will see in Section 6, if |B| is too small, we have to also use a larger
internal buffer whose entire original order, not only the relative order of equal elements,
has to be preserved.
4.1.1. First Phase: Collecting some Placeholders. In the first phase we extract some
elements, possibly non-distinct, that will help in the process of collecting the set of
distinct elements that will reside in B.
First, we select the element of rank r/log2 m + 1 in A. Then we partition A
according to that element.
We obtain a new sequence A P A that clearly satisfies points (i) and (iii) in
Property 1, with J = A and C B = A . The selection and the partitioning can be done
in place and stably using once again the linear-time algorithms proposed by Katajainen
334 G. Franceschini
and Pasanen [7], [8]. Therefore, point (v) in Property 1 is also satisfied. If A is void, we
sort A using the in-place, stable mergesort and we are done. Otherwise, we leave A as
it is and we proceed with the second phase.
4.1.2. Second Phase: Collecting the Distinct Elements. Throughout this phase we con-
tinue to denote by A the evolving sequence of m elements. We have that A = A P A
right after the first phase. Let us denote with h the index of the rightmost location of P .
We maintain two indices i and i initially set, respectively, to 1 and m. The following
steps are repeated until i > A /log3 m or i = h:
4.1.3. Third Phase: Collecting the Placeholders Back. After the second phase, the first
b elements residing in A P at the end of the first phase are scattered in the subsequence
A[h + 1 · · · m]. Therefore, point (v) in Property 1 is no longer satisfied by the current
sequence A. We have to collect them back.
First, we partition the subsequence A[h + 1 · · · m] according to A[h]. Let C A be
the resulting sequence, where for any a ∈ A and c ∈ C, we have that a ≤ A[h] < c.
We once again use the linear time, stable partitioning algorithm from [8].
Then we reverse A , recovering the original order holding before the second phase,
we sort it using the stable in-place mergesort and we exchange it with A[1 · · · b].
After that, the resulting sequence A respects all the points in Property 1.
Proof. First phase. We apply the stable, in-place, linear-time selection and partitioning
algorithm proposed in [7] and [8]. If we already have to sort the elements in A (because
nothing
has to be done for A ), we can use the normal stable, in-place mergesort since
A = O(m/log3 m).
Second phase. The cycle is iterated O(m) times, hence the total cost of the invo-
cations of SEARCH is O(m X s ) comparisons. During the cycle Step 2 can be executed
O(r/log2 m) times and, excluding the costs of PROCESS, its complexity is O(1). There-
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 335
fore the total cost of Step 2 is o(r ) = o(m) comparisons and moves. Step 3 contributes
another O(m) arithmetic operation.
Third phase. It consists simply in one application of the partitioning algorithm in
[8], a constant number of applications of block reversing and exchanging and the final
application of the stable, in-place mergesort to the first b = O(m/log3 m) elements
in A.
Problem 2. We are given two disjoint sets: R with routing elements and F with filler
elements. The following hypotheses hold:
(i) Routing and filler elements belong to the same totally ordered, possibly infinite
universe.
(ii) At any time we are presented with a new routing element to be included in R .
(iii) At most ρ < m elements will be included in R .
(iv) At the beginning |R | = 1, the unique routing element is in the first location
followed by the fillers.
(v) At any time |F |/|R | > log ρ. The possible growth of F is not a concern, as
new filler elements will eventually be added after the current last element.
(vi) We can use an auxiliary memory M̂ of (ρ) words of log ρ bits each to store
auxiliary data.
The task is to manage the growth of R so that the following properties hold:
(a) At any time R and F are stored in a zone Z of |R | + |F | contiguous memory
locations.
(b) At any time R can be searched with O(log|R |) comparisons and a constant
number of accesses to M̂.
(c) When R is complete, the total number of comparisons, moves and accesses to
M̂ performed is O(|R | log2 ρ).
336 G. Franceschini
know that, for any |R |, there are enough filler elements to create the new t segments.
Therefore, property (a) of Problem 2 holds.
Finally, with a simple analysis, similar to the one in [6], it can be proved that prop-
erty (c) of Problem 2 holds. After the redistribution of the elements in the subsequence
associated with v, for each descendant u of v we have that d(u) ≤ 2l(v) ·τl(v) . In particular
that holds for the children of v. Before the rebalancing, there was a child u of v such
that d(u ) > 2l(u ) · τl(u ) . Therefore, before v needs to be rebalanced again there will
have to be at least
2l(u ) (τl(u ) − τl(v) )
insertions in the subsequence associated with any child u of v. Since the rebalancing
of v cost O(2l(v) · τl(v) ), we have that the amortized cost relatively to level l(v) of the
insertion that triggered the rebalancing is
2l(v) · τl(v)
O = O(τl(v) ).
2l(u ) (τl(u ) − τl(v) )
Since there are log t levels, the complete amortized cost is O(σ log t) = O(log ρ log t) =
O(log2 ρ). Therefore we can conclude that Problem 2 is solved.
4.2.2. The Structure. In order to manage the growth of the set of buffer elements in the
second phase of the buffer extraction algorithm, we have to solve the following problem.
Let us give our solution for Problem 3. Let B be the (growing) zone of the memory in
which B will be maintained. The auxiliary encoding memory M has (r/log m) words
of log m bits each. Since |B | ≤ r/log2 m, it is easy to associate with every x ∈ B
a constant number h of words of auxiliary data. We can allocate in M an array I B of
r/log2 m entries of h words each and maintain the following invariant:
The element in position ith in B has its auxiliary data stored in the ith
entry of I B . (1)
For the sake of description, we skip that kind of detail in the algorithm and implicitly
assume that an element is always moved together with its encoded O(1) words of
auxiliary data. (We consider the unusual cost model for M in the analysis.)
At any time, B is divided into two contiguous zones R and H . The elements of the
routing level are in R (together with some elements of the collection level that will act
as fillers, as we will see).
338 G. Franceschini
Buckets. Each routing element a is associated with a set of elements β(a) in the collec-
tion level that we will call a bucket. Let a and a be two consecutive (in the sorted order)
routing elements: for each x ∈ β(a ) we have that a < x < a . For what concerns the
number of elements in a bucket, we have that 4log m ≤ |β(a)| ≤ 8log m.
Let us focus on a single bucket β and let evenβ and oddβ be the set of the elements
of β with even and odd rank, respectively. The elements in oddβ are stored in sorted
order in a contiguous zone of memory in H while the elements in evenβ are stored in
R and may be scattered. They will play a role similar to the one of buffer elements but
more powerful because they will be searchable at any moment of the lifetime of the
structure.
Pointers are used to keep track of the elements in evenβ : each o ∈ oddβ has a pointer
to its successor succ(o ) ∈ evenβ and succ(o ) has a pointer to o . (Obviously, if |β| is
odd, the successor of the largest element in oddβ does not belong to the set evenβ .)
The routing element a associated with β has a pointer to the location of oddβ and
oddβ has a pointer to a. Assuming that we are able to maintain the layout we just
introduced for the buckets, we can show a way to satisfy property (b) of Problem 3.
Sub-zones. H has to accommodate the set oddβ for any bucket β. By the upper and
lower bounds for the number of elements in a bucket, we know that 2log m ≤ oddβ ≤
4log m for any oddβ .
All the oddβ of size i are maintained in a contiguous sub-zone Hi−2log m+1 of H (the
sub-zone Hj contains sets of size j +2log m−1). Therefore, there are z = 2log m+1
sub-zones. We have that H = H1 H2 · · · Hz−1 Hz , that is, the sub-zones are in increasing
order by the size of sets they contain.
For any Hi , the first set oddβ may be rotated, that is, it may have its first i +
2log m − 1 − j elements at the right end of Hi and the last j at the left end, for any j
called index of rotation of Hi .
Some sub-zones may be void. For any zone, we store in M its starting address in
H and its index of rotation. If the indices of rotation are known, the particular case of a
routing element a having a bucket β with the set oddβ in the first position of its sub-zone
can be treated simply (e.g., with an extra flag for each routing element to recognize the
particular case).
Basic operations. We are going to need two basic operations on the sub-zones:
SLIDE BY ONE(i) and MOVE BACK(o).
• With SLIDE BY ONE(i) all the sub-zones Hj with j ≥ i are rotated by one position
to the right (assuming that there is a free location at the right end of H ). The
execution of this operation is obvious.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 339
• With MOVE BACK(o) the set o of size 4log m (the maximum size possible) is
moved from Hz to H1 .
1. o is exchanged with the second set in Hz .
2. o is exchanged with the portion of the first set in Hz residing at the left end
(Hz can have rotation index > 0).
3. For each Hi , 2 ≤ i ≤ z − 1, o is exchanged with the first 4log m elements
of Hi .
4. o is exchanged with the portion (if any) of the first set of H1 residing at the
right end (H1 can also have rotation index > 0).
Lemma 3. The operations SLIDE BY ONE(i) and MOVE BACK(o) can be executed with
O(log3 m) moves and comparisons.
Maintaining the invariants for the collection level. Let us show how the invariants on
B introduced so far can be maintained when an element u has to be inserted in a bucket
β associated with a routing element a placed somewhere in zone R.
Let us suppose |β| = p < 8log m, and let i be the rank of u in β ∪ {u}. There are
two phases: in the first phase we reorganize the space to make room for the new element;
in the second phase we rearrange the elements of β, since the arrival of u may change
oddβ and evenβ substantially.
• Space reorganization. If p is odd then evenβ increases by one and oddβ remains
of the same size. We invoke SLIDE BY ONE(1) to free a location between R and
H and put u in that location temporarily.
Otherwise, if p is even then oddβ increases by one and evenβ remains of the
same size.
1. We invoke SLIDE BY ONE( p/2 − 2log m + 2) in order to have a free location
between Hp/2−2log m+1 (the sub-zone that contained oddβ before the insertion
of u in β) and Hp/2−2log m+2 (the new sub-zone of oddβ after the insertion)
and we put u in the free location temporarily.
2. We exchange oddβ with the last set in Hp/2−2log m+1 .
3. We exchange oddβ with the portion (if any) of the first set of Hp/2−2log m+1
residing at the right end of the sub-zone.
4. After oddβ is joined with u, we exchange oddβ ∪ {u} with the portion of the
first set in Hp/2−2log m+2 residing at the left end of the new sub-zone.
340 G. Franceschini
Lemma 4. Maintaining the invariants for the collection level costs O(log3 m) moves
and comparisons in the worst case.
Proof. When |β| = p < 8log m, the cost of SLIDE BY ONE(i) dominates the com-
plexity of the space reorganization. For what concerns the rearrangement, we have to
access O(log m) pointers in M. Therefore, by Lemma 3, when we have to insert an
element in a bucket that is not full, we pay O(log3 m) moves and comparisons in the
worst case.
When β is full, the cost of the execution of MOVE BACK(oddβ ) dominates the
complexity of the space reorganization in this case. For the rearrangement, we have to
access O(log m) pointers in M. Therefore, by Lemma 3, when we have to insert an
element in a bucket that is full, we pay O(log3 m) moves and comparisons to split the
bucket.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 341
The routing level. With the organization of R H presented so far, we are able to satisfy
all the hypotheses in Problem 2. As expected, R contains the routing elements produced
by splitting the buckets and F contains the elements in β evenβ .
Hypotheses (i)–(iv) are obviously satisfied. For each routing element there is a bucket
β with at least 4log m elements and at least 2log m of those elements belong to F .
Hence, Hypothesis (v) is satisfied. Concerning Hypothesis (vi), we can use the encoded
memory M.
Therefore, the solution to the abstract Problem 2 can be used and we are able to
manage the growth of the set of routing elements so that the following properties hold:
• At any time, all the routing elements and the ones in β evenβ are maintained
compactly in zone R.
• At any time, the routing level can be searched with O(log m) comparisons
and a constant number of accesses to M (and hence O(log m) comparisons in
total).
• In the routing level there is a slowdown factor O(log m) because we use the
auxiliary encoded memory. However, we know that
|B |
|R | < ρ = O .
log m
Therefore, when the routing level is complete, the total number of comparisons
and moves performed building it is O(|B | log2 m) = O(m).
Joining those properties and Lemmas 2 and 4 we can conclude that Problem 3 is
solved. Therefore, by Lemma 1, we have that:
In this section we show how to sort the subsequence C B of a sequence J P C B sat-
isfying Property 1 and with b = |B| = |C B|/log3 m. First, in Section 5.1, we show
how to sort b elements stably, using O(1) auxiliary space, with O(b log m) comparisons
and O(b) moves, under Assumption 1 and using another b distinct elements as place-
holders (from B). Then, in Section 5.2, we show how to sort C B using the technique in
Section 5.1 and a multi-way stable merging technique requiring a very limited number
of placeholders.
obtained merging the subsequences in-place, stably and in linear time (e.g., using the
merging algorithm described in [17]).
To sort each Di , we use a structure with the same basic subdivision of the one in
Section 4.2: a routing level, directing the searches, and a collection level, containing
the majority of the elements. After all the elements in Di are inserted, the structure is
traversed to move the elements back in Di stably and in sorted order.
5.1.1. The Structure. First we describe the logical organization of the collection level.
Then we show how to embed the collection level into the internal buffer B and how to
store in M its auxiliary data. Finally, we show how to maintain the invariants and how,
even in this case, the routing level can be seen as a particular instance of the abstract
Problem 2.
The collection level. Each routing element a is associated with a small balanced search
tree T (a) in the collection level. Let a and a be two consecutive (in the sorted order)
routing elements: for each x ∈ T (a ) we have that a ≤ x ≤ a .
Let us consider a generic tree T . Concerning the number of elements in a leaf l of
T we have that
R, C and C will be devoted to the embedding of routing elements, internal nodes and
leaves of the trees in the collection level, respectively. W will be used as working area.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 343
1. Find the rank r (x) of x in the sequence lx (a simple scan of leaf l).
2. For i = 2log m + 1 to 1:
(a) If i = r (x) then exchange x with W [i] (a placeholder).
(b) Otherwise, find the element y with maximum rank (among the ones still in l)
scanning l and its bit mask, and set to zero the bit of y. Then, exchange y
with W [i] (again, a placeholder).
3. Exchange the first log m elements in l with W [1 · · · log m], and set the bit
mask of l to 1log m 0log m .
4. Exchange the first log m elements in l with W [|W | − log m + 1 · · · |W |],
and set the bit mask of l to 1log m 0log m .
After that, we have to insert the element x of medium rank (that is still located in
the (log m + 1)th position of W ) in the parent of l; let it be u. If u is not full, we simply
follow the inner linked list of u until the rightmost (in the list order that is also the sorted
order) element x less than or equal to x is found. Then, we insert x after x in the list
and set its child pointer to the starting address of l .
344 G. Franceschini
Lemma 5. Under Assumption 1, the data structure can be built using O(1) auxiliary
space, O(b log m) comparisons and O(b) moves.
√
Proof. T has O(1) levels and each internal node has O( log m) elements. Hence,
the total number of comparisons
√ we pay to scan the linked lists during the search for
the position of x is O( log m log log m). Scanning the bit mask of l costs O(log m)
comparisons. Therefore, we pay O(log m) comparisons in order to find the position of
x in T .
If l is not full, the insertion of x in it costs only O(1) moves, since we have to modify
a bit of the bit mask and exchange x with the placeholder in l.
If l is full, let us analyze the steps of the procedure for splitting a leaf. Step 1 is
a simple scan and it takes O(log m) comparisons. Step 2(a) is just a comparison of
integer values. Step 2(b) takes a scan with O(log m) comparisons and O(1) moves.
Steps 2(a) and 2(b) are executed O(log m) times and then the total cost of step 2 is
O(log2 m) comparisons and O(log m) moves. Finally, steps 3 and 4 are simple scans
and exchanges and take O(log m) comparisons and moves.
√Then we have to insert x in u, the parent of l. If u is not full we have to pay
O( log m log log m) comparisons to follow its inner list and O(log m) moves to update
the pointers.
If u is full, let us analyze the steps of the procedure for splitting an internal node. We
have to remember that every time an element of an internal node is moved its auxiliary
encoded information in IC is moved too.
√Step 1 scans the inner list of u and exchange one element of u, that takes
O( log m log log m + log m) comparisons and O(log m) moves. Steps 2 and 3 are
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 345
√
just a series of exchanges of the remaining elements in u and they take O( log m log m)
moves and comparisons (remember, also√the auxiliary encoded data in IC is moved).
Finally, steps 4 and 5 do exchanges of O( log √ m) elements formerly contained in u and
initialize two inner lists. Hence they cost O( log m log m) moves and comparisons.
After the splitting of u, the process is iterated for its O(1) ancestors. Given the
above worst-case costs and √ the fact that each leaf and each internal node has, respec-
tively, O(log m) and O( log m) elements, it is obvious to derive the amortized costs by
inserting an element into T . That is, there are O(log m) comparisons and O(1) moves
in the amortized sense.
If even the root of T has to be split, we insert a new routing element in R. To
organize R we use the solution to Problem 2 in Section 4.2.1. All the hypotheses in
Problem 2 are satisfied. Obviously, R contains the elements produced by splitting a tree
in the collection level and F the placeholders initially in R. Hypotheses (i)–(iv) are
easily satisfied. For each routing element, there is a tree T with at least log3 m elements,
therefore Hypothesis (v) is satisfied. We use the auxiliary encoded memory M to satisfy
Hypothesis (vi).
Given the cost model in Assumption 1 and the solution to Problem 2 in Section 4.2.1,
the thesis follows.
5.1.2. Traversing the Structure. After the construction of the structure, Di contains
placeholders. Traversing the structure is pretty standard. We maintain five pointers
pr , p1 , p2 , p3 , p4 in auxiliary memory; pr points to the rightmost visited routing el-
ement in R and p1 , p2 , p3 , p4 point to the internal nodes in the current visiting path of
the tree of the routing element pointed by pr . For each p j , we have to maintain a small
pointer s j to the rightmost visited element in the internal linked list of the node pointed
by p j . Actually, any pointed element (by pr or by any s j ) is immediately exchanged with
the leftmost placeholder in Di , and only its auxiliary encoded data is still accessible to
guide the visit.
Each leaf l is visited in the following way:
By the cost model in Assumption 1, it is immediate to prove that the traversing phase
ends with all the elements back in Di in stable sorted order and that the whole travers-
ing phase takes O(1) auxiliary locations, O(b log m) comparisons and O(b) moves.
Therefore, by Lemma 5 we can conclude that:
Lemma 6. Under Assumption 1, b elements can be sorted stably, using O(1) aux-
iliary space and another set of b distinct elements as placeholders, with O(b log m)
comparisons and O(b) moves.
346 G. Franceschini
Problem 4. We have
• s ≤ log m/log log m sorted sequences E 1 , . . . , E s of k ≤ m/s elements each and
• a set U of s(log m)2 distinct elements.
Under Assumption 1, we want to sort the sk elements stably, using O(1) auxiliary
locations, with O(sk log m) comparisons and O(sk) moves.
In this section we show how to sort the subsequence C B of a sequence J P C B
satisfying Property 1 and with b = |B| < |C B|/log3 m, that is, when the number d of
distinct elements in C B is less than |C B|/log3 m. First, we solve a general problem
in Section 6.1. Then, in Section 6.2, we show how the solution to the general problem
can be used to sort C B.
The abstract problem can be seen as the problem of sorting a sequence V with
few distinct elements having by our side (i) a constant time function helping to discern
between the elements of V and the other ones and (ii) two kinds of internal buffers:
• The first buffer is small and the order of its elements is not important and can be
lost after the process. Moreover, the number of elements in this buffer is greater
than or equal to the number of distinct elements in V . That sequence would
be D with the elements of set D . When we use our solution for this abstract
problem to sort the subsequence C B, the role of D obviously will be played by
the subsequence of distinct element B.
• The second buffer is as large as V but the original order of its elements is important
and has to be maintained after V is sorted. That large buffer would be G.
Our solution to Problem 5 has three phases.
6.1.1. First Phase. V is logically divided into |V |/d log m2 contiguous blocks
V1 V2 · · · of d log m2 elements each. We want to sort any block Vi stably, using O(1)
auxiliary locations, O(|Vi |log m) comparisons and O(|Vi |) moves. This can be accom-
plished in the same way that we sorted the sequence C in Section 5:
1. Each sub-block of d contiguous elements of Vi is sorted using the d elements
of D as placeholders (Section 5.1).
2. The log m2 sorted sub-blocks of Vi are merged with a constant number of
iterations of the multi-way mergesort using the fragmented multi-way merging
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 349
6.1.2. Second Phase. After the first phase, each block Vi of V is sorted and divided
into at most d ≤ d runs of equal elements. Since |Vi | = d log m2 , the total number
tr of runs in V is less than or equal to t/log m2 . For any run, let the first element be
the head and the rest of the run be the tail. The second phase has four main steps:
1. Each block Vi is divided into two sub-blocks Hi and Vi . Hi contains the heads
of all the runs of Vi and Vi contains all the tails. Both Hi and Vi are in sorted
order. This subdivision can be accomplished in a linear number of moves with
at most d applications of the well-known in-place block-exchanging technique
(recalled in Section 4.1).
Let ir be the number of runs of Vi . Let h 1 , . . . , h ir be the heads we have to
collect, indexed from the leftmost to the rightmost in Vi .
Let U1 , . . . , Uir be the subsequences of Vi that separate h 1 , . . . , h ir , that is
Vi = h 1 U1 h 2 U2 , . . . , Uir −1 h ir Uir
(some of them can be void). We collect h 1 , . . . , h ir in a growing subsequence Hi
starting from the position of h 1 . During the process, Hi slides toward the right
end of Vi . The process scans Vi from left to right and therefore the positions of
U1 , . . . , Uir and h 1 , . . . , h ir are obtained “on the fly,” during the scan.
Let Hi = h1 and j = 1. The following steps are repeated until j > ir :
(a) If |Hi | ≤ U j , do a block exchange between the two adjacent blocks Hi and
Uj .
(b) Otherwise, let Hi = Hi Hi with Hi = U j . Exchange Hi with U j (obvious
exchange of two non-adjacent but equal sized blocks). After that Hi = Hi Hi .
(c) In both cases, now Hi is adjacent to h j+1 ; let Hi = Hi h j+1 and increase j
by one.
Since the elements we are extracting (the heads of the runs of a single block
Vi ) are distinct, we do not care about their original order during the process. We
simply sort them when they are finally collected at the right end of Vi . On the
350 G. Franceschini
other hand, the order of the other elements of the runs of Vi is maintained in the
process.
2. Some information about runs and blocks is collected and stored in M.
• An array I H with |V |/d log m2 entries of two words each is stored in M.
For any i, the first word of I H [i] contains |Hi | and the second word contains
the index of the first run of Vi (the index is between 1 and tr , from the leftmost
run in V to the rightmost).
• An array I R with tr entries of four words each is stored in M. For any i, the
first word of I R [i] is initially set to i, the second contains the address of the
head of the ith (in V ) run, the third contains the starting address of the tail of
the ith run and the fourth contains the size of the ith run.
• Finally, an array I R −1 with tr entries of two words each is stored in M. For any
i, the first word of I R −1 [i] is initially set to i and the second word of I R −1 [i] is
initially set to 1.
All this information can be obtained within our target bounds simply by scanning
V . In general, for any array I of multi-word entries, we will denote the pth word
of the ith entry with I [i][ p].
3. I R −1 is sorted stably by head, that is, at any time of the sorting process, the sorting
key for the two-word value in the ith entry of I R −1 is
V [I R [I R −1 [i][1]][2]].
The sorting algorithm used is mergesort with a linear-time, in-place stable merg-
ing (e.g., the one described in [17]). During the execution of the algorithm, every
time the two-word value in the ith entry of I R −1 is moved to the jth entry, the
corresponding entry in I R is updated, that is, I R [I R −1 [ j][1]][1]] is set to j.
We remark that only the entries of the encoded array I R −1 are moved (where
any abstract move of an encoded value causes O(log m) actual moves of some
elements contained in zones Q and Q defined in Section 3). In this process, the
elements in V are not moved.
4. For i = 2 to tr , let I R −1 [i][2] be I R −1 [i − 1][2] + I R [I R −1 [i][1]][4] (that is, if we
had the elements in V sorted stably into another sequence V , I R −1 [i][2] would
be the starting address in V of the ith run in the stable sorted order).
6.1.3. Third Phase. After the second phase we are able to evaluate the function αV :
{1, . . . , t} → {1, . . . , t} such that αV ( j) is the rank of the element V [ j] in the sequence
V , performing O(log m) comparisons.
1. Let Vi be the block of V [ j]. We know where Hi starts and ends, in fact
Therefore, we can perform a binary search for V [ j] in Hi and find the index p j
in Vi of the run to which V [ j] belongs.
2. The index p j in V of the run of V [ j] is I H [i][2] + p j − 1.
3. Using the array I R , we can find the position k j of V [ j] in its run. If j = I R [ p j ][2]
then V [ j] is the head of its run. Otherwise, V [ j] belongs to the tail of its run.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 351
7. Conclusion
By Theorems 1–3 we can conclude that Problem 1 is solved and state the main result of
this paper.
Sorting Stably, in Place, with O(n log n) Comparisons and O(n) Moves 353
Theorem 4. Any sequence of n elements can be sorted stably, using O(1) auxiliary
locations of memory, performing O(n log n) comparisons and O(n) moves in the worst
case.
This settles a long-standing open question explicitly stated by Munro and Raman
in [13]. Before the introduction of this algorithm, the best-known solution for stable,
in-place sorting with O(n) moves was the one presented in [14], performing O(n 1+ε )
comparisons in the worst case.
References
[1] H. Bing-Chao and D. E. Knuth. A one-way, stackless quicksort algorithm. BIT, 26(1):127–130, 1986.
[2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press,
Cambridge, MA, 2001.
[3] B. Ďurian. Quicksort without a stack. In B. Rovan J. Gruska and J. Wiedermann, editors, Proceedings of
the 12th Symposium on Mathematical Foundations of Computer Science, volume 233 of LNCS, pages
283–289. Springer-Verlag, Berlin, 1986.
[4] G. Franceschini and V. Geffert. An in-place sorting with O(n log n) comparisons and O(n) moves. In
Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages
242–250, 2003.
[5] C. A. R. Hoare. Quicksort. Comput. J., 5(1):10–16, April 1962.
[6] A. Itai, A. G. Konheim, and M. Rodeh. A sparse table implementation of priority queues. In ICALP ’81,
volume 115 of LNCS, pages 417–31. Springer-Verlag, Berlin, 1981.
[7] J. Katajainen and T. Pasanen. Sorting multisets stably in minimum space. In O. Nurmi and E. Ukkonen,
editors, SWAT ’92, volume 621 of LNCS, pages 410–421. Springer-Verlag, Berlin, 1992.
[8] J. Katajainen and T. Pasanen. Stable minimum space partitioning in linear time. BIT, 32(4):580–585,
1992.
[9] J. Katajainen and T. Pasanen. In-place sorting with fewer moves. Inform. Process. Lett., 70:31–37, 1999.
[10] D. E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley,
Reading, MA, 1973.
[11] M. A. Kronrod. Optimal ordering algorithm without operational field. Soviet Math. Dokl., 10:744–746,
1969.
[12] J. I. Munro. An implicit data structure supporting insertion, deletion, and search in O(log2 n) time.
J. Comput. System Sci., 33:66–74, 1986.
[13] J. I. Munro and V. Raman. Sorting with minimum data movement. J. Algorithms, 13:374–93, 1992.
[14] J. I. Munro and V. Raman. Fast stable in-place sorting with O(n) data moves. Algorithmica, 16:151–60,
1996.
[15] J. I. Munro and V. Raman. Selection from read-only memory and sorting with minimum data movement.
Theoret. Comput. Sci., 165:311–23, 1996.
[16] L. T. Pardo. Stable sorting and merging with optimal space and time bounds. SIAM J. Comput., 6(2):351–
372, June 1977.
[17] J. Salowe and W. Steiger. Simplified stable merging tasks. J. Algorithms, 8(4):557–571, December 1987.
[18] L. M. Wegner. A generalized, one-way, stackless quicksort. BIT, 27(1):44–48, 1987.
[19] J. W. J. Williams. Heapsort (Algorithm 232). Comm. ACM, 7:347–48, 1964.