Radix Sort: Problem Description
Radix Sort: Problem Description
Problem description Given a set of unsorted items with keys that can be considered as a binary representation of an integer, the bits within the key can be used to sort the set of items. This method of sorting is known as Radix Sort. Write a program that includes a threaded version of a Radix Sort algorithm that sorts the keys read from an input file, then output the sorted keys to another file. The input and output file names shall be the first and second arguments on the command line of the application execution. The first line of the input text file is the total number of keys (N) to be sorted; this is followed by N keys, one per line, in the file. A key will be a seven-character string made up of printable characters not including the space character (ASCII 0x20). The number of keys within the file is less than 2^31 - 1. Sorted output must be stored in a text file, one key per line. Timing: If you put timing code into your application to time the sorting process and report the elapsed time, this time will be used for scoring. If no timing code is added, the entire execution time (including time for input and output) will be used for scoring. Example Input file: 8 H@skell surVEYs sysTEMS HASKELL Surveys 1234567 SURveys systEMS Example Output file: 1234567 H@skell HASKELL SURveys Surveys surVEYs sysTEMS systEMS
Serial Algorithm Radix sort is a sorting algorithm that sorts integers by processing individual digits. Because integers can represent strings of characters (e.g., names or dates) and specially formatted
floating point numbers, radix sort is not limited to integers. A most significant digit (MSD) radix sort can be used to sort keys in lexicographic order. Unlike a least significant digit (LSD) radix sort, a most significant digit radix sort does not necessarily preserve the original order of duplicate keys. A MSD radix sort starts processing the keys from the most significant digit, leftmost digit, to the least significant digit, rightmost digit. This sequence is opposite that of least significant digit (LSD) radix sorts. An MSD radix sort stops rearranging the position of a key when the processing reaches a unique prefix of the key. Some MSD radix sorts use one level of buckets in which to group the keys. See the counting sort and pigeonhole sort articles. Other MSD radix sorts use multiple levels of buckets, which form a trie or a path in a trie. A postman's sort / postal sort is a kind of MSD radix sort. Radix sort, essentially, uses a tree to sort keys with each node branching out into R other nodes where R is the radix used. A recursively subdividing MSD radix sort algorithm works as follows: 1. Take the most significant digit of each key. 2. Sort the list of elements based on that digit, grouping elements with the same digit into one bucket. 3. Recursively sort each bucket, starting with the next digit to the right. 4. Concatenate the buckets together in order.
Considerations 1. There may be as many as 2^31 1 strings. This implies that 64-bit addressing may be required which is accomplished by defining _FILE_OFFSET_BITS=64 at compilation time. 2. Theoretically, to store the entire tree for a 7 byte string more than 94^7 * 8 > 480TB of storage would be required. This could cause problems unless memory is managed well. 3. To avoid excessive thread swapping, controlling the number of running threads might be a good idea. Hence, the number of running threads, in this implementation, is limited to the number of available processors, determined using /proc/cpuinfo.
Parallelization The recursive version of the (Most Significant Digit) MSD radix sort algorithm has particular application to parallel computing, as each of the subdivisions can be sorted independently of the rest. The motivation, here, is to engage as many of the available processors while managing memory usage effectively. Implementation This implementation uses POSIX threads for the sake of compatibility across systems. 1 Reading strings 1.1 The input data is stored in linked-lists. This way the requirement of a large chunk of contiguous memory is avoided and the nodes of the linked-list can be reused at the higher levels of the tree. 1.2 As many lists as the number of processors in the system (say P) are created and the input data is stored in these lists. 2 Sorting 2.1 The P linked-lists are processed in parallel. Strings in the linked-lists are assigned to 94 buckets corresponding to the printable characters (ASCII 33 to 126). Each bucket is a linked-list in itself. This has been done with the hope that processors will not be idle during the initial assignment to buckets. 2.2 A different strategy is used for the higher levels of the tree: 2.2.1 Buckets of the first level that contain more than one string are processed in parallel, P buckets at a time. 2.2.2 One thread fully processes an entire first level bucket using recursion at each level and then picks up the next available bucket for processing. 2.2.3 Once all the buckets at a certain level have been processed, the linked-lists are merged into a single list and assigned to the parent bucket. 3 Writing sorted strings 3.1 The sorted list is written to the specified output file. Testing The following code was used to generate test data:
#include <iostream> #include <fstream> #include <cstdlib> using namespace std; int main(int argc, char **argv) { if(argc==3) { int nRec = atoi(argv[1]); cout<<"Generating "<<nRec<<" strings"<<endl; ofstream fout(argv[2]); if(fout.fail()) {
cerr << "ERROR: File could not be created!" << endl; return EXIT_FAILURE; } char str[8]; str[7] = '\0'; srand(0); for(int j=0; j<7; j++) { str[j] = rand() % 94 + 33; } fout<<nRec<<endl; for(int i=0; i<nRec; i++) { int charsToChange = rand() % 8; for(int j=0; j<charsToChange; j++) { str[6 - j] = rand() % 94 + 33; } fout<<str<<endl; } fout.close(); } else { cout<<"Syntax: "<<argv[0]<<" <number of strings> <filename>"<<endl; } return EXIT_SUCCESS; }
The strings generated are fairly random. But care has been taken, to ensure that duplicate strings are also generated.
Performance Expectations 1. This implementation is expected to perform well given a well-distributed set of input strings. 2. The performance would be worst when all the input strings have the same first character. 3. Another advantage of this implementation is low memory requirement. This implementation managed to sort 10 million strings on a machine with 750MB of RAM while using only 20.3% of the available memory (as reported by top). In comparison, the default qsort implementation uses 25.3% of the available memory to sort the same input file. Observations The application was tested with upto 10 million strings, generated using the test program described above, on a relatively low performance machine. The observations were as follows: Number of strings 10,000 100,000 1,000,000 10,000,000 Radix Sort Timing 0.139 sec 1.142 sec 10.429 sec 94.087 sec