External Sorting: A Technical Paper
External Sorting: A Technical Paper
On
EXTERNAL SORTING
Prepared By :
INTRODUCTION
One method for sorting a file is to load the file into memory, sort the data in memory, then write the results. But when the file cannot be loaded into memory due to resource limitations, an external sort applicable. External sorting refers to the sorting of a file that is on disk or tape. So external sorting is referred as tape sorting.
PHASES
The tape sorting methods, in general, requires two phases
Distribution phase Merge phase
Distribution Phase
In distribution phase strings of convenient lengths are created by use of a suitable internal sorting technique. The strings are generated one at a time. These strings are distributed according to some rule to one of the several tape.
Merge Phase
In the merge phase, these strings are merged to create longer strings, which are again merged. The process continues until the final merge generates s a single string that is the required sorted file.
Merging of strings-blocks
During a merging process, the strings to be merged will be called input strings and the generated merged string will be called the output string. Tapes containing input and output strings will be called input and output tapes This requires that strings should be considered as divided into a number of blocks of records. At the beginning of the merge, the leading block of each of the input strings are read into the memory. When a particular block is exhausted during merging, the next block is read in the same storage area. When the buffer is full, the block of records may be written onto the output tape
It is desirable to make the initial strings as long as long as is possible so that fewer passes will be required in the merge phase. The tape sorting techniques require that the total number of initial strings must be equal to some allowable numbers. The number of strings that are generated from the given file does not match with the required allowable number. In such cases additional dummy strings should be added in order to make the total number of strings equal to next higher allowable number.
During the sort, some of the data must be stored externally. Typically the data will be stored on tape or disk. The cost of accessing data is significantly greater than either bookkeeping or comparison costs. There may be severe restrictions on access. For example, if tape is used, items must be accessed sequentially.
Sorting by Merging
Sorting by merging is refers as merge sort. Merging is the process of combining two sorted lists into one sorted list.
The first pass consists of creating strings of length 2 by comparing and if necessary, by exchanging pairs of items. As a result, the pass generates n/2 strings each of length 2 where n is the number of items in the given file. The second pass consists of merging pairs of strings generated in the previous pass. Successive passes are similar to the second pass and in each pass the number of strings is reduced by half of the number and the sizes of the strings are doubled.
78 45 5 88 36 9 11 2
45 78
5 88 9 36 2 11
5 45 78 88
2 9 11 36
5 2 9 11 36 45 78 88
The technique is easy to program,particularly when n is a power of 2 . This is very efficient technique for large n. The number of comparisons required is very low.
The procedure of 2-way balanced merge easily extended to the general K-way balanced merge. The number of tape units required is 2k In each of the merge passes, the lengths of the strings are increased by a factor of k and number of strings is divided by k. n-km be the number strings generated in first pass. The number of merge passes required is given by m=log k n
EXAMPLE
Balanced Merge (2 way) with 32 strings and 4 tape units:
A 0 8(2) 0 2(8) 0 1(32) B 0 8(2) 0 2(8) 0 0 C 16(1) 0 4(4) 0 1(16) 0 D 16(1) 0 4(4) 0 1(16) 0 Description Distribution pass 2-way merge 2-way merge 2-way merge 2-way merge 2-way merge
When t tape unit are available, balanced merge requires that t should be even and order of merge should be even and order of merge should be t/2.Thus for a balanced merge of higher order, a large number of tape units are tied up. Cascade merge allow (t-1)-way merge with only t tape units. The advantage is due to the distribution of initial strings in unequal numbers to the (t-1) tapes. Each pass commences with a (t-1)-way merge.
Strings of level
Level Distributi on
14(1)
6(1)
6(1) 3(1) 0
3(6)
0 2(5) 2(5)
2(2) 0 1(3)
3(3) 1(3) 0
0 1 2 3 4
3(6) 3(6)
2(6) 1(6) 0
1(5) 0 1(6)
0 1(11) 1(11)
1(31)
3-way merge
Polyphase merge is similar to cascade merge as in this case also the distribution of strings is made unequally to ( i1) tapes where t is the total number of tape units available. The distribution rule is different from that of the cascade merge. In the merge phase, ployphase merge always restricts to (t1)-way merging.
EXAMPL E:
Strings of level
Level 1 2 3 4 5 . Distribution 1,1,1 2,2,1 4,3,2 7,6,4 13,11,7 .
B 11(1) 4(1) 0
C 7(1) 0 4(5)
D 0 7(3) 3(3)
Comments Distribution Pass Merge pass 1.3-way merging Merge pass 2 .3-way merging
6(1) 2(1)
0
1(17) 0
2(9)
1(9) 0
2(5)
1(5) 0
1(3)
0 1(31)
When the number of tape units available is t, cascade merge in each pass begins with a (t1)-way merge. But with the progress of the merge, its order is decreased and finally the pass ends with a copy operation. polyphase merge in each pass performs only (t1)-way merge. This is a distinct advantage of polyphase merge over cascade merge over cascade merge as it is desirable to have higher order merges whenever possible.
CONCLUSION
Very often the data set to be sorted becomes so large that internal sorting can't do anything, as the data set could not be fit into the main memory
External merge sort minimizes disk I/O cost fetching the external files as less number of times as required Among several merge sort techniques, k-way balanced merge sort, cascade merge sort, Polyphase merge sort, all are best types of external sorting from their distribution point of view.
THANK YOU!!!