0% found this document useful (0 votes)
6 views23 pages

Chapter7 External Sorting

The document discusses external sorting algorithms, particularly focusing on the challenges of sorting large datasets that do not fit into memory. It highlights the use of Merge Sort as a foundational technique for external sorting, detailing methods such as simple external merge-sort, multiway merge-sort, and polyphase merge-sort to efficiently handle large files. Additionally, it introduces the concept of replacement selection to optimize the creation of sorted runs during the sorting process.

Uploaded by

mudwah877
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

Chapter7 External Sorting

The document discusses external sorting algorithms, particularly focusing on the challenges of sorting large datasets that do not fit into memory. It highlights the use of Merge Sort as a foundational technique for external sorting, detailing methods such as simple external merge-sort, multiway merge-sort, and polyphase merge-sort to efficiently handle large files. Additionally, it introduces the concept of replacement selection to optimize the creation of sorted runs during the sorting process.

Uploaded by

mudwah877
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

COS 212

External Sorting
The Need for External Sorting
 Many sorting algorithms have been proposed, some
more efficient than others

 What assumption do the sorting algorithms make about


the data?
 Easy random access
 Go to any data[i] in one unit of time

 What if your data is stored in a file on the hard drive?


 Just load it into memory!
Reading and writing to a file
(and other external
 …What if your file is 50GB long?
memory) is notoriously
 Doesn’t fit into memory  slow!
 No problem, just access the file every time you need an
element?..

What about sorting a linked list?


External Sorting
 External sorting algorithms are designed to work well
even when your data does not fit into memory / offers
no quick random access

 How?
 Merge Sort for the rescue!
 Basis for most external sorting routines
 Sort any number of records using a tiny amount of main
memory
 Extreme (base) case: memory can fit two data items only

 Main idea:
 Load data into memory one chunk at a time
 Sort each chunk (sorted chunk = run)
 Use the merge algorithm to combine runs!

 Let’s remind ourselves of how mergesort works…


Main-Memory Merge Sort

Merge-Sort(A)
Merge-Sort(A)
01
01 if
if length(A)
length(A) >> 11 then
then
02
02 Copy
Copy the first half
the first half of
of AA into
into array
array A1
A1 Divide
03
03 Copy
Copy the second half of A into array A2
the second half of A into array A2
04 Merge-Sort(A1)
04
05
Merge-Sort(A1)
Merge-Sort(A2) Conquer
05 Merge-Sort(A2)
06
06 Merge(A,
Merge(A, A1,
A1, A2)
A2) Combine

Running time for Merge sort: O(n log n)


Merge-Sort Recursion Tree
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
log2N
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

2 10 1 5 13 19 7 9 4 15 3 8 12 17 6 11

10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11

N
 Backtracking: merge runs (sorted sequences) of
size x into runs of size 2x, decrease the number of
runs twofold.
 We can’t fit all the data into memory, but we can
fit some of it – what should be stored in RAM?
External-Memory Merge-Sort
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
Two-way merge
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
Two-way merges
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

Main-memory Main-memory Main-memory Main-memory


sort sort sort sort

10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11

 Bottom-up rather than top-down


 Initial runs do not have to be of size 1, but can rather be of the
size that fits into memory (M data elements)
 We can sort M elements with heapsort or another sorting
algorithm
 All we need to do is merge – merge is sequential, thus
efficient even without random access.
Yes, but… We can’t store two runs in memory! How do we
merge?
Read as much
Simple External Merge-Sort as you can fit
Original unsorted input into memory,
file: then sort!

File 1 81 94 11 96 12 35 17 99 28 58 41 75 15

File 2

File 3

File 4

Read chunks of size M (= 3 in this example) at a time,


sort in RAM, write alternatively to File 3 and File 4

File 1
Write sorted
File 2 runs into
separate
File 3 11 81 94 17 28 99 15 external files
Merge can work directly on
File 4 12 35 96 41 58 75 the files that store sorted
runs
Simple External Merge-Sort
File 1

File 2

File 3 11 81 94 17 28 99 15

File 4 12 35 96 41 58 75

Merge the 1st run of File 3 & File 4, write to File 1;


Merge the 2nd run of File 3 & File 4, write to File 2;
etc.

File 1 11 12 35 81 94 96 15

File 2 17 28 41 58 75 99

File 3

File 4
Simple External Merge-Sort
File 1 11 12 35 81 94 96 15

File 2 17 28 41 58 75 99

File 3

File 4

Merge the 1st run of File 1 & File 2, write to File 3;


Merge the 2nd run of File 1 & File 2, write to File 4.

File 1

File 2

File 3 11 12 17 28 35 41 58 75 81 94 96 99

File 4 15
passes for the
Simple External Merge-Sort merging, + 1
initial pass to
File 1 construct the runs

File 2

File 3 11 12 17 28 35 41 58 75 81 94 96 99

File 4 15

Merge the last two runs, write to File 1

File 1 11 12 15 17 28 35 41 58 75 81 94 96 99

File 2

File 3

File 4
Multiway (k-way) Merge-Sort
 2-way (simple) merge sort requires passes through the
files to sort the data
 Example on the previous slides: passes
 Fewer passes – faster algorithm (less I/O)

 Can we reduce the number of passes?


 Yes we can – by using more files!

 Multiway merge: merge k sorted runs


 How?
 Two runs: pick the smallest one of the two elements
 k runs: pick the smallest one out of k
 Heaps can help!
Multiway (3-way) Merge-Sort k=3
Original unsorted input
file:
File 1 81 94 11 96 12 35 17 99 28 58 41 75 15
File 2

File 3
File 4

File 5

File 6
Phase 1: each run is sorted and recorded into one of 2*k
files
File 1
File 2

File 3 2k =
File 4 11 81 94 41 58 75 6

File 5 12 35 96 15
File 6 17 28 99
Multiway (3-way) Merge-Sort
File 1
Put in a
Filemin-heap
2

File 3
File 4 11 81 94 41 58 75
File 5 12 35 96 15
File 6 17 28 99

Phase 2: Perform 3-way merge on the sorted runs!

File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4

File 5

File 6
Multiway (3-way) Merge-Sort
File 1 11 12 17 28 35 81 94 96 99
File 2 15 41 58 75
File 3
File 4

File 5

File 6

Phase 2: Perform 3-way merge on the sorted runs

File 1
File 2

File 3
File 4 11 12 15 17 28 35 41 58 75 81 94 96 99
File 5
passes ⌈ log3 13/3⌉=2
File 6
Polyphase Merge-Sort
 What is the problem with the multiway mergesort?
 The number of files (or other output devices) = 2 * k
 This might be infeasible and/or impractical

 Polyphase merge: use k + 1 instead of 2k files


 How?
 Split the data unevenly between the files, i.e. some files get more
runs than others
 Suppose we have 3 files only, and a total of 8 runs of length M
 Standard split (even): 4 runs per file
 Polyphase split (uneven): 5 runs on one file, 3 runs on another file
 Input: file F1, output: 5 runs on file F2, 3 runs on file F3
 Merge F2 + F3 until you have no more runs on one,
put the result on F1
 F3 is now empty!
 Thus, merge remainder of F2 + F1, put on F3
 Repeat until sorted!
Polyphase Merge-Sort

File 1 81 94 11 96 12 35 17 99 28 58 41 75 15

File 2

File 3

Total number of runs: 5, put the first 3 on F2, last 2


on F3:

File 1

File 2 11 81 94 12 35 96 17 28 99

File 3 41 58 75 15
Polyphase Merge-Sort
File 1

File 2 11 81 94 12 35 96 17 28 99

File 3 41 58 75 15

Merge F2 and F3 into F1: F2 is left with some


baggage!

File 1 11 41 58 75 81 94 12 15 35 96

File 2 17 28 99

File 3

We can now merge F1 and F2 into F3


Polyphase Merge-Sort
File 1 11 41 58 75 81 94 12 15 35 96

File 2 17 28 99

File 3

Merge F1 and F2 into F3: F1 is left with some


baggage!

File 1 12 15 35 96

File 2

File 3 11 17 28 41 58 75 81 94 99

We can now merge F1 and F3 into F2


Polyphase Merge-Sort

File 1 12 15 35 96

File 2

File 3 11 17 28 41 58 75 81 94 99

Merge F1 and F3 into F2: done!

File 1

File 2 11 12 15 17 28 35 41 58 75 81 94 96 99

File 3
Polyphase Merge-Sort
 In the example, we have split 5 runs into 2 + 3 runs
 The split matters:
 if the runs are distributed between two files incorrectly, there will
be a lot of unnecessary and slow back and forth
 How do we know how to split the runs?
 Ask Fibonacci!

 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ....


 For a number of runs = ,
split the runs such that the 1st and the
2nd output have
and runs.
 If the number of runs
is not a Fibonacci
number, pad with
“dummy” runs!
Replacement Selection Merge-Sort
 Let’s reconsider Phase 1 of external merge sort:
Splitting input data into sorted runs
 What we have done so far:
 Fill memory with M data points, sort, store
 Maximum size of initial run: M
 Can we do better?
 Suppose heap sort is utilised to sort M elements
 Min elements are removed one by one from the heap and written
to a file
 As you remove min elements, you free up space in memory!
 Replacement selection:
 After min element was written to a file, read the next input
element;
 If the element > min element just removed, then add it to the
heap!
 If new element is smaller, can’t add to this run - save for the next
 Result: longer runs, fewer merges!
Replacement Selection

File 1 81 94 11 96 12 35 17 99 28 58 41 75 15

11 < 81 >
96 12 12 is not
Heap: 11 94 81 81 94 96 94 96 12 a part of
the heap

We constructed a run > M 94 >


35 35 is not
96 35 12 a part of
Can become very efficient if
the heap
you get runs >> M 96 >
17 End of the
12 35 17 17 35 12 run!
Rebuild
heap

File 2 11 81 94 96
Summary

 External sorting
 Necessary when sorting data that does not fit into RAM
(memory)
 External sorting algorithms do not assume that you have
cheap random access to every data element
 External Merge Sort
 External algorithms are based on a bottom-up merge sort
 First, data is split into sorted runs
 Runs are then iteratively merged
 External merge sort variations:
 Simple (2-way)
 Multiway (k-way)
 Polyphase
 Replacement selection

 Next lecture: Graphs

You might also like