Chapter7 External Sorting
Chapter7 External Sorting
External Sorting
The Need for External Sorting
Many sorting algorithms have been proposed, some
more efficient than others
How?
Merge Sort for the rescue!
Basis for most external sorting routines
Sort any number of records using a tiny amount of main
memory
Extreme (base) case: memory can fit two data items only
Main idea:
Load data into memory one chunk at a time
Sort each chunk (sorted chunk = run)
Use the merge algorithm to combine runs!
Merge-Sort(A)
Merge-Sort(A)
01
01 if
if length(A)
length(A) >> 11 then
then
02
02 Copy
Copy the first half
the first half of
of AA into
into array
array A1
A1 Divide
03
03 Copy
Copy the second half of A into array A2
the second half of A into array A2
04 Merge-Sort(A1)
04
05
Merge-Sort(A1)
Merge-Sort(A2) Conquer
05 Merge-Sort(A2)
06
06 Merge(A,
Merge(A, A1,
A1, A2)
A2) Combine
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
log2N
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
2 10 1 5 13 19 7 9 4 15 3 8 12 17 6 11
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
N
Backtracking: merge runs (sorted sequences) of
size x into runs of size 2x, decrease the number of
runs twofold.
We can’t fit all the data into memory, but we can
fit some of it – what should be stored in RAM?
External-Memory Merge-Sort
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
Two-way merge
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
Two-way merges
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2
File 3
File 4
File 1
Write sorted
File 2 runs into
separate
File 3 11 81 94 17 28 99 15 external files
Merge can work directly on
File 4 12 35 96 41 58 75 the files that store sorted
runs
Simple External Merge-Sort
File 1
File 2
File 3 11 81 94 17 28 99 15
File 4 12 35 96 41 58 75
File 1 11 12 35 81 94 96 15
File 2 17 28 41 58 75 99
File 3
File 4
Simple External Merge-Sort
File 1 11 12 35 81 94 96 15
File 2 17 28 41 58 75 99
File 3
File 4
File 1
File 2
File 3 11 12 17 28 35 41 58 75 81 94 96 99
File 4 15
passes for the
Simple External Merge-Sort merging, + 1
initial pass to
File 1 construct the runs
File 2
File 3 11 12 17 28 35 41 58 75 81 94 96 99
File 4 15
File 1 11 12 15 17 28 35 41 58 75 81 94 96 99
File 2
File 3
File 4
Multiway (k-way) Merge-Sort
2-way (simple) merge sort requires passes through the
files to sort the data
Example on the previous slides: passes
Fewer passes – faster algorithm (less I/O)
File 3
File 4
File 5
File 6
Phase 1: each run is sorted and recorded into one of 2*k
files
File 1
File 2
File 3 2k =
File 4 11 81 94 41 58 75 6
File 5 12 35 96 15
File 6 17 28 99
Multiway (3-way) Merge-Sort
File 1
Put in a
Filemin-heap
2
File 3
File 4 11 81 94 41 58 75
File 5 12 35 96 15
File 6 17 28 99
File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4
File 5
File 6
Multiway (3-way) Merge-Sort
File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4
File 5
File 6
File 1
File 2
File 3
File 4 11 12 15 17 28 35 41 58 75 81 94 96 99
File 5
passes ⌈ log3 13/3⌉=2
File 6
Polyphase Merge-Sort
What is the problem with the multiway mergesort?
The number of files (or other output devices) = 2 * k
This might be infeasible and/or impractical
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2
File 3
File 1
File 2 11 81 94 12 35 96 17 28 99
File 3 41 58 75 15
Polyphase Merge-Sort
File 1
File 2 11 81 94 12 35 96 17 28 99
File 3 41 58 75 15
File 1 11 41 58 75 81 94 12 15 35 96
File 2 17 28 99
File 3
File 2 17 28 99
File 3
File 1 12 15 35 96
File 2
File 3 11 17 28 41 58 75 81 94 99
File 1 12 15 35 96
File 2
File 3 11 17 28 41 58 75 81 94 99
File 1
File 2 11 12 15 17 28 35 41 58 75 81 94 96 99
File 3
Polyphase Merge-Sort
In the example, we have split 5 runs into 2 + 3 runs
The split matters:
if the runs are distributed between two files incorrectly, there will
be a lot of unnecessary and slow back and forth
How do we know how to split the runs?
Ask Fibonacci!
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
11 < 81 >
96 12 12 is not
Heap: 11 94 81 81 94 96 94 96 12 a part of
the heap
File 2 11 81 94 96
Summary
External sorting
Necessary when sorting data that does not fit into RAM
(memory)
External sorting algorithms do not assume that you have
cheap random access to every data element
External Merge Sort
External algorithms are based on a bottom-up merge sort
First, data is split into sorted runs
Runs are then iteratively merged
External merge sort variations:
Simple (2-way)
Multiway (k-way)
Polyphase
Replacement selection