The First Self-Contained Hardware Implementation of Radix Sort

This document presents a self-contained hardware implementation of a parallel radix sorter that operates in linear time, utilizing predication, prefix sum, and compaction modules. The design addresses performance issues related to thread divergence and uncoalesced memory access patterns, achieving efficient sorting without the need for external memory. The implementation has undergone complete logical and physical synthesis, demonstrating significant performance metrics and area specifications.

Uploaded by

anglelathow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

The First Self-Contained Hardware Implementation of Radix Sort

Uploaded by

anglelathow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

The First Self-Contained Hardware Implementation of The predication module is composed of a set of MUXes, one per each

the Parallel Radix Sort element. The select line for these MUXes controls which bit to extract
from each element. The Verilog pseudo-code used for this step is
Nathan V. Morrical1, Patsy Cadareanu1, Walter Lau Neto1, and included in Figure 4. The prefix sum module is composed of several
Max D. Austin1 prefix iteration modules (seen in Figure 5), and works similar to a
1
Kogge Stone adder. Figure 6 implements this module in pseudo-code.
The University of Utah, Salt Lake City, UT, USA The compaction module is composed of several parallel MUX’s, one
As chip density reaches its limits, many programmers are switching to per element. The select line for the compaction MUXes is driven using
concurrent programming models to meet high performance computing a procedural address computation, as seen in Figure 7. Finally, the
demands. These concurrent models achieve massive performance finite state machine is composed of a register containing the current
gains by distributing similar computation to local compute modules, iteration, and the instantiation of the predication, prefix sum, and
which typically work in a S
ingle Instruction Multiple Data (SIMD) compaction modules. The pseudo-code for this module is included in
fashion. However, performance suffers when thread divergence is high, Figure 8.
since individual threads need to execute different instructions which Our implementation can be compared against a similar, although more
are often not parallelizable. Even with newer generation Graphical complicated design proposed by Liu e t al. [3] which requires external
Processing Units (GPUs) like the Nvidia V olta and Turing a rchitectures, memory. To our knowledge, no such device has been fabricated,
thread divergence tends to cause uncoalesced memory access packaged, and tested at this time. We believe our implementation is
patterns, minimizing potential memory bandwidth.
the first to undergo complete logical and physical synthesis following
These complications can be resolved by aggregating similar the TSMC 180 nm technology. In addition, the proposed radix sorter
computation together, which is achievable through a sort. As a result, will sort a specified number of bits in either ascending or descending
many database systems, computer graphics data structures, and linear order, and requires no external memory, making it completely
algebra systems all depend on efficient sorting as a fundamental self-contained.
building block. However even when parallelized, sorting is In application, several instances of our radix sorter would be used in
computationally demanding and becomes the speed-limiting factor. combination to sort separate contiguous sections of a larger sequence
For example, more than 50% of the parallel BVH construction algorithm of numbers.This would resolve the uncoalesced access patterns of a
in [1] is spent sorting numbers. typical parallel radix sorter by aggregating subsequences in a local
To improve sorting performance, this paper presents a memory array, at which point coalesced rearrangement can occur at a
hardware-accelerated parallel-radix sorter capable of sorting an higher level.
arbitrary number of elements and bits in linear time. Parallel radix sort Figure 9 presents the results after running logic synthesis with the
was chosen due to its non-comparative sorting algorithm which allows Design Compiler. The total area of the chip is 314,741 µm2 with the
for improved parallelization. It is also the fastest GPU parallel sorting compaction module taking up the most area at 35,504 µm2. The total
algorithm to date [2] and a variant is used by Nvidia’s Thrust l ibrary to power dissipation simulated for the chip is 4.75 mW, and the critical
perform sorting. The parallel-radix sort is composed of three repeated path delay is simulated at 2.35 ns. Figure 10 shows the results after
stages. running physical synthesis, i .e., place and route (PnR), with Innovus
The first stage is p redication. This works as follows: For each element which considers wire capacitances and resistances. The final total
in parallel, copy the bit at the current iteration. The predication for area of the chip after PnR is 1,169,641 µm2, the total power dissipation
each element equals the extracted bit compared with the ascending/ is 35.84 mW, and the arrival time is 4.59 ns. Figure 11 shows the final
descending flag. die where all metrics were extracted for both logic and physical
synthesis.
The second stage is the p refix sum: For each prediction element in
parallel, take the sum of all elements before and including the current Acknowledgements
prediction element. Figure 1 shows an example of a prefix-sum The authors would like to acknowledge Edouard Giacomin for his
iteration on 8 elements of 3-bits for a better visualization. assistance throughout this project.

The final stage is c ompaction: If the predication corresponding to an References

element is 1, move that element to the left. Otherwise, move that [1] Karras, Tero. "Maximizing parallelism in the construction of BVHs,
element to the right. For an example showing the radix-sort at work for octrees, and k-d trees." In Proceedings of the Fourth ACM
4 elements of 3-bits, see Figure 2. SIGGRAPH/Eurographics conference on High-Performance Graphics,
pp. 33-37. Eurographics Association, 2012.
These three stages are repeated for the total number of bits to sort. [2] M. C. Delorme, T. S. Abdelrahman, and C. Zhao, “Parallel Radix Sort
on the AMD Fusion Accelerated Processing Unit,” in Proceedings of
Our implementation of the parallel radix sorting algorithm is broken up
International Conference on Parallel Processing, pp. 339–348, 2013.
into 4 modules. Three modules are used for each of the three major
[3] Liu, Xingyu, Shikai Li, Kuan Fang, Yufei Ni, Zonghui Li, and Yangdong
stages of the sorting algorithm, processing N input elements. The last
Deng. "RadixBoost: A hardware acceleration structure for scalable
module defines a finite state machine which iterates over the three
radix sort on graphic processors." In Circuits and Systems (ISCAS),
stages K times, where K is the number of bits to sort. Figure 3 shows
2015 IEEE International Symposium on, pp. 1174-1177. IEEE, 2015.
the block diagram for this design.
Figure 1: Example showing the prefix-sum iteration of the radix-sort
on 8 elements of 3-bits. Figure 5: Verilog pseudo-code of the helper function “prefix-iteration”
called in the prefix-sum module, as seen in Figure 6.

Figure 6: Verilog pseudo-code of the prefix-sum module.

Figure 2: Example showing the radix-sort at work for 4 elements of

3-bits.

Figure 7: Verilog pseudo-code of the compaction module.

Figure 3: Hardware block diagram of our radix sorter.

Figure 4: Verilog pseudo-code of the predication module. Figure 8: Verilog pseudo-code of the compaction module.
Figure 9: The results of logic synthesis.

Figure 10: The results of physical synthesis.

Figure 11: Physical layout and logical synthesis layout simulation.

Vlsi Physical Design Nptel
No ratings yet
Vlsi Physical Design Nptel
1,091 pages
Earning Bitcoins Using Telegram Bots and Websites-New-Method
0% (1)
Earning Bitcoins Using Telegram Bots and Websites-New-Method
8 pages
Maintenance of Electrical Machines
No ratings yet
Maintenance of Electrical Machines
8 pages
Week 1 Lecture Material
No ratings yet
Week 1 Lecture Material
96 pages
DCCN Unit 1
No ratings yet
DCCN Unit 1
13 pages
Robotics Monitoring of Power System
100% (2)
Robotics Monitoring of Power System
13 pages
Football Data Analysis Using Machine Learning Techniques
No ratings yet
Football Data Analysis Using Machine Learning Techniques
3 pages
Business Analytics Strategy Syllabus IIMC 2015-16 - Uday Kulkarni
No ratings yet
Business Analytics Strategy Syllabus IIMC 2015-16 - Uday Kulkarni
9 pages
SM RAS-3M26 4M27YA (C) V RAS-M10 - en PDF
No ratings yet
SM RAS-3M26 4M27YA (C) V RAS-M10 - en PDF
116 pages
00 B 495395 C 7 A 5 DBF 71000000
100% (1)
00 B 495395 C 7 A 5 DBF 71000000
12 pages
Practice FSM
No ratings yet
Practice FSM
5 pages
Programmable Logic and Storage Devices
100% (1)
Programmable Logic and Storage Devices
63 pages
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
No ratings yet
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
23 pages
Team Name Groupbips: Group B
No ratings yet
Team Name Groupbips: Group B
32 pages
A Comparison of Parallel Sorting Algorithms On Different Architectures
No ratings yet
A Comparison of Parallel Sorting Algorithms On Different Architectures
18 pages
Rev042606 GMCustSpecificsTS16949 PPAP Mar06
No ratings yet
Rev042606 GMCustSpecificsTS16949 PPAP Mar06
26 pages
IC2 Lecture9
No ratings yet
IC2 Lecture9
15 pages
A L D I S HW/SW C - D: Shun-Wen Cheng
No ratings yet
A L D I S HW/SW C - D: Shun-Wen Cheng
6 pages
En-378 en Fundamental
No ratings yet
En-378 en Fundamental
2 pages
Vlsi Tech
No ratings yet
Vlsi Tech
15 pages
Sample Ch1and2
No ratings yet
Sample Ch1and2
25 pages
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
No ratings yet
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
30 pages
AICTE
No ratings yet
AICTE
17 pages
Published Paper
No ratings yet
Published Paper
9 pages
Radix Sort: Problem Description
No ratings yet
Radix Sort: Problem Description
5 pages
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting
No ratings yet
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting
12 pages
Introduction and Motivation VLSI Circuit PDF
No ratings yet
Introduction and Motivation VLSI Circuit PDF
76 pages
Experiments in Computer System Design: Technical Report
No ratings yet
Experiments in Computer System Design: Technical Report
60 pages
HJ Listrank
No ratings yet
HJ Listrank
20 pages
He
No ratings yet
He
10 pages
Digital Systems
No ratings yet
Digital Systems
45 pages
ECE241 Final Project Report
No ratings yet
ECE241 Final Project Report
12 pages
EC6612 - VLSI Design Laboratory Manual
No ratings yet
EC6612 - VLSI Design Laboratory Manual
39 pages
An Efficient O N Comparison-Free Sorting Algorithm
No ratings yet
An Efficient O N Comparison-Free Sorting Algorithm
13 pages
Design of A Soft Core Processor in Fpga IJERTV12IS010057
No ratings yet
Design of A Soft Core Processor in Fpga IJERTV12IS010057
8 pages
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
No ratings yet
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
12 pages
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
No ratings yet
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
12 pages
LTDD Summative Assessment CW5 Students Version
No ratings yet
LTDD Summative Assessment CW5 Students Version
18 pages
Fpga Based 32 Bit Risc Processor Design
No ratings yet
Fpga Based 32 Bit Risc Processor Design
18 pages
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
No ratings yet
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
11 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
No ratings yet
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
5 pages
Iterative Parallel Shift Sort Optimization and Design For Area Constrained Applications
No ratings yet
Iterative Parallel Shift Sort Optimization and Design For Area Constrained Applications
7 pages
Solving Sudoku in Reconfigurable Hardware: Iouliia Skliarova, Tiago Vallejo, Valery Sklyarov
No ratings yet
Solving Sudoku in Reconfigurable Hardware: Iouliia Skliarova, Tiago Vallejo, Valery Sklyarov
6 pages
Design and Implementation of Sorting Algorithms Based On FPGA
No ratings yet
Design and Implementation of Sorting Algorithms Based On FPGA
4 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
FPGA Based Hardware Accelerator For Sorting Data
No ratings yet
FPGA Based Hardware Accelerator For Sorting Data
4 pages
HA084012U003
No ratings yet
HA084012U003
246 pages
How SIMD Width Affects Energy Efficiency - A Case Study On Sorting
No ratings yet
How SIMD Width Affects Energy Efficiency - A Case Study On Sorting
3 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
No ratings yet
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
7 pages
Entradas Analogica Kfd2-Stc4-Ex PDF
No ratings yet
Entradas Analogica Kfd2-Stc4-Ex PDF
3 pages
Utilization of Matlab For The Logarithmic Processor Development
No ratings yet
Utilization of Matlab For The Logarithmic Processor Development
4 pages
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
No ratings yet
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
8 pages
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
No ratings yet
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
6 pages
Laboratory Exercise 11: Implementing Algorithms in Hardware
No ratings yet
Laboratory Exercise 11: Implementing Algorithms in Hardware
3 pages
Security Laboratory PDF
No ratings yet
Security Laboratory PDF
73 pages
FPGA: Field Programmable Gate Array
No ratings yet
FPGA: Field Programmable Gate Array
5 pages
1.1.18 New Developments in Secondary Surveillance Radar
No ratings yet
1.1.18 New Developments in Secondary Surveillance Radar
4 pages
Portfolio Tracker
No ratings yet
Portfolio Tracker
8 pages
Difference Between Supervised and Unsupervised Learning Upsc Notes 21
No ratings yet
Difference Between Supervised and Unsupervised Learning Upsc Notes 21
3 pages
Hyper-Threading Technology Architecture and Microarchitecture - Summary
No ratings yet
Hyper-Threading Technology Architecture and Microarchitecture - Summary
4 pages
Astor Time Complete User Manual
No ratings yet
Astor Time Complete User Manual
18 pages
Distribution Board Schedule Report 1
No ratings yet
Distribution Board Schedule Report 1
15 pages
16-Bit ALU Using FPGA: Advanced Logic Circuits and Switching Theory
No ratings yet
16-Bit ALU Using FPGA: Advanced Logic Circuits and Switching Theory
3 pages
Job Opportunities in Q1-2012 (NCG) - Novellus Vietnam LTD
No ratings yet
Job Opportunities in Q1-2012 (NCG) - Novellus Vietnam LTD
3 pages
Synchronous Design: Figure 1: The Critical Path (Dashed Line) Takes 43ns To Settle
No ratings yet
Synchronous Design: Figure 1: The Critical Path (Dashed Line) Takes 43ns To Settle
4 pages
1
No ratings yet
1
4 pages
Software Performance Engineering (SPE) : Example
No ratings yet
Software Performance Engineering (SPE) : Example
4 pages
Six Week Industrial Training of
No ratings yet
Six Week Industrial Training of
33 pages
ST - ST62T00 PDF
No ratings yet
ST - ST62T00 PDF
3 pages
BIM AI Integration Through AEC Industry 1711459813
No ratings yet
BIM AI Integration Through AEC Industry 1711459813
10 pages
Tharun S - Information Technology Resume - Tharun S
No ratings yet
Tharun S - Information Technology Resume - Tharun S
2 pages
PID Controllers - Intro To Control Design - Online Engineering Courses
No ratings yet
PID Controllers - Intro To Control Design - Online Engineering Courses
1 page
Electronics 12 04024
No ratings yet
Electronics 12 04024
12 pages
Kelsey - Cloud Security Explained - Coursera73jst
No ratings yet
Kelsey - Cloud Security Explained - Coursera73jst
1 page
Details Requirement
No ratings yet
Details Requirement
2 pages
Rfid Card 125khz Em4100 Datasheet
No ratings yet
Rfid Card 125khz Em4100 Datasheet
2 pages
Resume Vishal
No ratings yet
Resume Vishal
1 page
Spi Verilog TB
No ratings yet
Spi Verilog TB
1 page
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

The First Self-Contained Hardware Implementation of Radix Sort

Uploaded by

The First Self-Contained Hardware Implementation of Radix Sort

Uploaded by

The First Self-Contained Hardware Implementation of The predication module is composed of a set of MUXes, one per each

The final stage is c​ ompaction​: If the predication corresponding to an References

Figure 6: Verilog pseudo-code of the prefix-sum module.

Figure 2: Example showing the radix-sort at work for 4 elements of

Figure 7: Verilog pseudo-code of the compaction module.

Figure 3: Hardware block diagram of our radix sorter.

Figure 10: The results of physical synthesis.

Figure 11: Physical layout and logical synthesis layout simulation.

You might also like

The final stage is c ompaction: If the predication corresponding to an References