0% found this document useful (0 votes)
2 views3 pages

The First Self-Contained Hardware Implementation of Radix Sort

This document presents a self-contained hardware implementation of a parallel radix sorter that operates in linear time, utilizing predication, prefix sum, and compaction modules. The design addresses performance issues related to thread divergence and uncoalesced memory access patterns, achieving efficient sorting without the need for external memory. The implementation has undergone complete logical and physical synthesis, demonstrating significant performance metrics and area specifications.

Uploaded by

anglelathow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

The First Self-Contained Hardware Implementation of Radix Sort

This document presents a self-contained hardware implementation of a parallel radix sorter that operates in linear time, utilizing predication, prefix sum, and compaction modules. The design addresses performance issues related to thread divergence and uncoalesced memory access patterns, achieving efficient sorting without the need for external memory. The implementation has undergone complete logical and physical synthesis, demonstrating significant performance metrics and area specifications.

Uploaded by

anglelathow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

The First Self-Contained Hardware Implementation of The predication module is composed of a set of MUXes, one per each

the Parallel Radix Sort element. The select line for these MUXes controls which bit to extract
from each element. The Verilog pseudo-code used for this step is
Nathan V. Morrical​1​, Patsy Cadareanu​1​, Walter Lau Neto​1​, and included in Figure 4. The prefix sum module is composed of several
Max D. Austin​1 prefix iteration modules (seen in Figure 5), and works similar to a
1​
Kogge Stone adder. Figure 6 implements this module in pseudo-code.
The University of Utah, Salt Lake City, UT, USA The compaction module is composed of several parallel MUX’s, one
As chip density reaches its limits, many programmers are switching to per element. The select line for the compaction MUXes is driven using
concurrent programming models to meet high performance computing a procedural address computation, as seen in Figure 7. Finally, the
demands. These concurrent models achieve massive performance finite state machine is composed of a register containing the current
gains by distributing similar computation to local compute modules, iteration, and the instantiation of the predication, prefix sum, and
which typically work in a S
​ ingle Instruction Multiple Data​ (SIMD) compaction modules. The pseudo-code for this module is included in
fashion. However, performance suffers when thread divergence is high, Figure 8.
since individual threads need to execute different instructions which Our implementation can be compared against a similar, although more
are often not parallelizable. Even with newer generation ​Graphical complicated design proposed by Liu e​ t al.​ [3] which requires external
Processing Units​ (GPUs) like the Nvidia V ​ olta ​and ​Turing a​ rchitectures, memory. To our knowledge, no such device has been fabricated,
thread divergence tends to cause uncoalesced memory access packaged, and tested at this time. We believe our implementation is
patterns, minimizing potential memory bandwidth.
the first to undergo complete logical and physical synthesis following
These complications can be resolved by aggregating similar the TSMC 180 nm technology. In addition, the proposed radix sorter
computation together, which is achievable through a sort. As a result, will sort a specified number of bits in either ascending or descending
many database systems, computer graphics data structures, and linear order, and requires no external memory, making it completely
algebra systems all depend on efficient sorting as a fundamental self-contained.
building block. However even when parallelized, sorting is In application, several instances of our radix sorter would be used in
computationally demanding and becomes the speed-limiting factor. combination to sort separate contiguous sections of a larger sequence
For example, more than 50% of the parallel BVH construction algorithm of numbers.This would resolve the uncoalesced access patterns of a
in [1] is spent sorting numbers. typical parallel radix sorter by aggregating subsequences in a local
To improve sorting performance, this paper presents a memory array, at which point coalesced rearrangement can occur at a
hardware-accelerated parallel-radix sorter capable of sorting an higher level.
arbitrary number of elements and bits in linear time. Parallel radix sort Figure 9 presents the results after running logic synthesis with the
was chosen due to its non-comparative sorting algorithm which allows Design Compiler. The total area of the chip is 314,741 µm​2​ with the
for improved parallelization. It is also the fastest GPU parallel sorting compaction module taking up the most area at 35,504 µm​2​. The total
algorithm to date [2] and a variant is used by Nvidia’s ​Thrust l​ ibrary to power dissipation simulated for the chip is 4.75 mW, and the critical
perform sorting. The parallel-radix sort is composed of three repeated path delay is simulated at 2.35 ns. Figure 10 shows the results after
stages. running physical synthesis, i​ .e.,​ ​place and route​ (PnR), with Innovus
The first stage is p​ redication​. This works as follows: For each element which considers wire capacitances and resistances. The final total
in parallel, copy the bit at the current iteration. The predication for area of the chip after PnR is 1,169,641 µm​2​, the total power dissipation
each element equals the extracted bit compared with the ascending/ is 35.84 mW, and the arrival time is 4.59 ns. Figure 11 shows the final
descending flag. die where all metrics were extracted for both logic and physical
synthesis.
The second stage is the p​ refix sum​: For each prediction element in
parallel, take the sum of all elements before and including the current Acknowledgements
prediction element. Figure 1 shows an example of a prefix-sum The authors would like to acknowledge Edouard Giacomin for his
iteration on 8 elements of 3-bits for a better visualization. assistance throughout this project.

The final stage is c​ ompaction​: If the predication corresponding to an References


element is 1, move that element to the left. Otherwise, move that [1] Karras, Tero. "Maximizing parallelism in the construction of BVHs,
element to the right. For an example showing the radix-sort at work for octrees, and k-d trees." In Proceedings of the Fourth ACM
4 elements of 3-bits, see Figure 2. SIGGRAPH/Eurographics conference on High-Performance Graphics,
pp. 33-37. Eurographics Association, 2012.
These three stages are repeated for the total number of bits to sort. [2] M. C. Delorme, T. S. Abdelrahman, and C. Zhao, “Parallel Radix Sort
on the AMD Fusion Accelerated Processing Unit,” in Proceedings of
Our implementation of the parallel radix sorting algorithm is broken up
International Conference on Parallel Processing, pp. 339–348, 2013.
into 4 modules. Three modules are used for each of the three major
[3] Liu, Xingyu, Shikai Li, Kuan Fang, Yufei Ni, Zonghui Li, and Yangdong
stages of the sorting algorithm, processing N input elements. The last
Deng. "RadixBoost: A hardware acceleration structure for scalable
module defines a finite state machine which iterates over the three
radix sort on graphic processors." In Circuits and Systems (ISCAS),
stages K times, where K is the number of bits to sort. Figure 3 shows
2015 IEEE International Symposium on, pp. 1174-1177. IEEE, 2015.
the block diagram for this design.
Figure 1: Example showing the prefix-sum iteration of the radix-sort
on 8 elements of 3-bits. Figure 5: Verilog pseudo-code of the helper function “prefix-iteration”
called in the prefix-sum module, as seen in Figure 6.

Figure 6: Verilog pseudo-code of the prefix-sum module.

Figure 2: Example showing the radix-sort at work for 4 elements of


3-bits.

Figure 7: Verilog pseudo-code of the compaction module.

Figure 3: Hardware block diagram of our radix sorter.

Figure 4: Verilog pseudo-code of the predication module. Figure 8: Verilog pseudo-code of the compaction module.
Figure 9: The results of logic synthesis.

Figure 10: The results of physical synthesis.

Figure 11: Physical layout and logical synthesis layout simulation.

You might also like