0% found this document useful (0 votes)
16 views13 pages

An Efficient O N Comparison-Free Sorting Algorithm

This paper introduces a novel comparison-free sorting algorithm that operates in O(N) time complexity, utilizing a hardware structure that avoids complex circuitry and SRAM-based memory. The algorithm sorts integer elements on-the-fly using simple registers and matrix-mapping operations, achieving a sorting time of approximately 4-6 µs for 1024 elements with low power consumption. The design is scalable and efficient, making it suitable for applications in graphics processing, network routing, and video processing DSP chips.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views13 pages

An Efficient O N Comparison-Free Sorting Algorithm

This paper introduces a novel comparison-free sorting algorithm that operates in O(N) time complexity, utilizing a hardware structure that avoids complex circuitry and SRAM-based memory. The algorithm sorts integer elements on-the-fly using simple registers and matrix-mapping operations, achieving a sorting time of approximately 4-6 µs for 1024 elements with low power consumption. The design is scalable and efficient, making it suitable for applications in graphics processing, network routing, and video processing DSP chips.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1930 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO.

6, JUNE 2017

An Efficient O(N ) Comparison-Free Sorting


Algorithm
Saleh Abdel-Hafeez, Member, IEEE, and Ann Gordon-Ross, Member, IEEE

Abstract— In this paper, we propose a novel sorting algorithm Due to the ever-increasing computational power of parrallel
that sorts input data integer elements on-the-fly without any com- processing on many core CPU- and GPU-based process-
parison operations between the data—comparison-free sorting. ing systems, much research has focused on harnessing
We present a complete hardware structure, associated timing
diagrams, and a formal mathematical proof, which show an the computational power of these resources for efficient
overall sorting time, in terms of clock cycles, that is linearly sorting [17]–[20]. However, since not all computing domains
proportional to the number of inputs, giving a speed complexity and sorting applications can leverage the high throughput
on the order of O(N). Our hardware-based sorting algorithm pre- of these systems, there is still a great need for novel and
cludes the need for SRAM-based memory or complex circuitry, transformative sorting methods. Additionally, there is no clear
such as pipelining structures, but rather uses simple registers to
hold the binary elements and the elements’ associated number of dominate sorting algorithm due to many factors [21]–[24],
occurrences in the input set, and uses matrix-mapping operations including the algorithm’s percentage utilization of the available
to perform the sorting process. Thus, the total transistor count CPU/GPU resources, the specific data type being sorted,
complexity is on the order of O(N). We evaluate an application- amount of data being sorted.
specified integrated circuit design of our sorting algorithm for a To address these challenges, much research has focused
sample sorting of N = 1024 elements of size K = 10-bit using
90-nm Taiwan Semiconductor Manufacturing Company (TSMC) on architecting customized hardware designs for sorting algo-
technology with a 1 V power supply. Results verify that our rithms in order to fully utilize the hardware resources and
sorting requires approximately 4–6 µs to sort the 1024 elements provide custom, cost-effective hardware processing [2]–[27].
with a clock cycle time of 0.5 GHz, consumes 1.6 mW of power, However, due to the inherent complexity of the sorting
and has a total transistor count of less than 750 000. algorithms, efficient hardware implementation is challenging.
Index Terms— 90-nm TSMC, comparison free, Gigahertz clock To realize fast and power-efficient hardware sorting, a sig-
cycle, one-hot weight representation, sorting algorithms, SRAM, nificant amount of hardware resources are required, including,
speed complexity O(N). but not limited to, comparators, memory elements, large global
memories, and complex pipelining, in addition to complicated
I. I NTRODUCTION , M OTIVATION , AND R ELATED W ORK local and global control units.
Most prior work on hardware sorting designs are imple-
S ORTING algorithms have been widely researched for
decades [1]–[6] due to the ubiquitous need for sorting in
many application domains [7]–[10]. Sorting algorithms have
mented based on some modification of traditional mathemati-
cal algorithms [28]–[31], or are based on some modified net-
been specialized for particular sorting requirements/situations, work of switching structures [32]–[34] with partially parallel
such as large computations for processing data [11], high- computing processing and pipelining stages. In these sorting
speed sorting [12], improving memory performance [13], architectures, comparison units are essential components that
sorting using a single CPU [14], exploiting the parallelism are characterized by high-power consumption and feedback
of multiple CPUs [15], parallel processing for grid-computing control logic delays. These sorting methods iteratively move
in order to leverage the CPU’s powerful computing resources data between comparison units and local memories, requiring
for big data processing [16]. wide, high-speed data buses, involving numerous shift, swap,
comparison, and store/fetch operations, and have complicated
Manuscript received July 6, 2016; revised October 22, 2016 and control logic, all of which do not scale well and may need spe-
January 16, 2017; accepted January 16, 2017. Date of publication February 22, cialization for certain data-type particulars. Due to the inherent
2017; date of current version May 22, 2017. This work was supported in
part by the National Science Foundation under Grant CNS-0953447 and in mixture of data processing and control logic within the sorting
part by Nvidia and Synopsys. Any opinions, findings, and conclusions or structure’s processing elements, designing these structures can
recommendations expressed in this material are those of the author(s) and do be cumbersome, imposing large design costs in terms of area,
not necessarily reflect the views of the National Science Foundation.
S. Abdel-Hafeez is with Jordan University of Science and Technology, Irbid power, and processing time. Furthermore, these structures are
22110, Jordan (e-mail: [email protected]). not inherently scalable due to the complexity of integrating
A. Gordon-Ross is with the Department of Electrical and Computer Engi- and combining the data path and control logic within the
neering, University of Florida, Gainesville, FL 32611 USA and also with the
National Science Foundation Center for High-Performance Reconfigurable processing units, thus potentially requiring a full redesign for
Computing, University of Florida, Gainesville, FL 32611 USA (e-mail: different data sizes, as well as complex connective wiring with
[email protected]). high fan-out and fan-in in addition to coupling effects, thus
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. circuit timing issues are challenging to address. Additionally,
Digital Object Identifier 10.1109/TVLSI.2017.2661746 if multiple processors are used along with pipelining stages
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1931

and global memories, the data must be globally merged comparison-free sorting algorithm with illustrative exam-
from these stages to output the complete final sorted data ples and Section IV provides a mathematical analysis.
set [35], [36]. Section V details the hardware data path and control logic
To address these challenges, in this paper, we propose implementations along with timing diagrams. Section VI
a new sorting algorithm targeted for custom, IC-designed presents our simulation results, and Section VII discusses our
applications that sort small- to moderate-sized input sets, conclusions, which elaborate on the overall results and our
such as graphics accelerators, network routers, and video design’s hardware advantages.
processing DSP chips [12], [33], [44], [46]. For example,
graphics processing uses a painter unit that renders objects II. R ELATED W ORK
according to the object’s depth value such that the object In order to provide high scalability, it is critical to design a
can be displayed in the correct order on the screen. In video sorting method with timing and circuit complexity that scales
processing, fast computation is required for small matrices linearly with the number of input elements N [i.e., the circuit
in a frame in order to increase the resolution using digi- timing delay and circuit complexity are on the complexity
tal filters that leverage sorting algorithms. Even though we order of O(N)]. Although some recent works showed linear
present our design based on these scenarios, our design also scalability, these works’ O(N) notations hide a large scalar
supports processing large input sets by subsequently process- value [4], [27], [32], [34] and these methods have expensive
ing the data in multiple, smaller input sets (i.e., in sets of circuit complexity with respect to multiprocessing, local and
N < 100 000) using fast computations, and then merging global memories, pipelining, and control units with special
these sets. However, since applications with larger input instruction sets, in addition to high-cost technology power
sets (on the order of millions) are usually embedded into factors.
systems with large computational resources, such as data min- Other recent works [2], [25], [37]–[42] divide the sorting
ing and database visualization applications running on high- algorithm design into smaller computation partitions, where
performance grid computing and GPU accelerators [17]–[20], each partition integrates control logic and the partition’s com-
these applications can harness those powerful resources for parison operations with feedback decisions from neighbor-
sorting. ing partitions. A global control unit coordinates this control
Our sorting algorithm’s main features and contributions to streamline the data flow between the partitions and the
include as follows. partitions’ associated memories to store temporary data that
1) Our design affords continuous sorting of input element is transferred between partitions. In addition to the complex
sets, where each set can hold any type and distribu- circuitry required to maintain inter-partition connectivity and
tion (ordering) of data elements. Sorting is triggered redundant intra-partition control circuitry, a complex global
with a start-sort signal and sorting ends when a done- memory organization is required.
sorting signal is asserted by the design, which subse- Alternative methods [43]–[45] attempt to eliminate
quently begins sorting the next input set, thus affording comparators by introducing a rank (sorted) ordering
continuous, end-to-end sorting. approach. In [43], a bit-serial sorter architecture was
2) Our sorting design does not require any ALU- implemented based on a rank-order filter (ROF), but
comparisons/shifting-swapping, complex circuitry, comparators were still used to transform the programmable
or SRAM-based memory, and processes data in a capacitive threshold logic (CTL) to a majority voting decision.
forward moving direction through the circuit. Our That design used large array cells of ROF and CTL decisions
design’s simplicity results in a highly linearized with a pipelined architecture. The design in [44] counted the
sorting method with a CMOS transistor count that number of occurrences of every element in the unsorted input
grows on the order of O(N). Hence, the design array, where the rank of each element was determined by
provides low and efficient power components with the counting the number of elements less than or equal to the
addition of regularity and scalability as key structure element being considered. Thus, the comparison units were
features, which provide easily and quick miagration to replaced by counting units with bit comparison. However,
embedded micro-controllers and field-programmable the design required a complicated hardware structure with
gate arrays (FPGAs). pipelining and a histogram counting sequence. Alternatively,
3) The sorting delay time is always linearly proportional the design in [45] used a rank matrix that assigned relative
to the number of input data elements N, with upper ranks to the input elements, where the highest element had
and lower bounds of 3N and 2N clock cycles, respec- the maximum rank and the lowest element had the lowest
tively, giving a linear sorting delay time of O(N). rank of 1. The rank matrix was updated based on the value of
This sorting time is independent of the input elements’ a particular bit in each of the N input elements, starting with
ordering or repitition since the design always performs the most-significant bit. This bit-wise inspection required
the same operations within these bounds as opposed to inspecting a complete column of the rank matrix in order
Quicksort and othersorting algorithms, which have large for the lower ranks to update the higher ranks. However, that
and nonlinear margin of bounds. design could not be used when the number of elements was
The remainder of this paper is organized as follows. Section II less than the elements’ bit-width.
summarizes related works and the works’ cost-performance Some recent works [47]–[49] leverage previous works and
bottleneck tendencies. Section III discusses our proposed integrate several different sorting architectures for different

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1932 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

requirements, such as speed, area, power. The work in [47]


leveraged a bitonic sorting network to more efficiently map the
methodology considering energy and memory overheads for
FPGA devices. Further advances of that work [48] presented
novel and improved cost-performance tradeoffs, as well as
identification of some Pareto optimal solutions trading off
energy and memory overheads. Additional work [49] devel-
oped a framework that composes basic sorting architectures
to generate a cost-efficient hybrid sorting architecture, which
enabled fast hardware generation customized for heteroge-
neous FPGA/CPU systems.
Even though all of these designs reported linear sorting
delay times as the number of input elements increased,
the authors did not include the initialization times for the
required arrays/matrices, nor was the worst case sorting time Fig. 1. Comparison-free sorting example using four 2-bit input data elements.
evaluated. Furthermore, each design either required arrays
to store the input elements, associated arrays for the rank This example operates as follows. The inputted elements are
operations and data routing, or had to globally merge the inserted into a binary matrix of size N×1, where each element
intermediate sorted array partitions. These array elements is of size k-bit (in this example N = 4 and k = 2 bit).
required a significant amount of local and global input–output Concurrently, the inputted elements are converted to a one-
data routing, SRAM-based memory, and control signals, where hot weight representation and stored into a one-hot matrix
the local control logic communicated with each processing unit of size N × H , where each stored element is of size H -bit
partition and the global control unit. This layout complicates and H =Ngiving a one-hot matrix of size N-bit ×N-bit. The
adapting the design to different input data bit-widths. Addi- one-hot matrix is transposed to a transpose matrix of size
tionally, since the control signals and data path wiring was N × N, which is multiplied by the binary matrix—rather than
intertwined, circuit design bugs were challenging to locate, using comparison operations—to produce the sorted matrix.
in turn leading to high-cost design. For repeated elements in the input set, the one-hot transpose
matrix stores multiple “1s” (equal to the number of occurances
of the repeated element in the input set) in the element’s
III. C OMPARISON -F REE S ORTING A LGORITHM associated row, where each “1” in the row maps to identical
The input to our sorting algorithm is a K -bit binary bus, elements in the binary matrix, an advantage that will be
which enables sorting N = 2 K input data elements. The exploited in the hardware design (Section V). For example,
sorting algorithm operates on the element’s one-hot weight if the input set matrix is [2; 0; 2; 1], then the transpose matrix
representation, which is a unique count weight associated with is [0 0 0 0; 1 0 1 0; 0 0 0 1; 0 1 0 0]. Notice that
each of the N elements. For example, “5” has a binary repre- the second row contains two “1s,” such that when the transpose
sentation of “101,” which has a one-hot weight representation matrix is multiplied by the second row in the binary matrix,
of “1 00 000.” For a complete set of N = 2 K data elements, both “1” occurances in the transpose matrix are mapped to
the one-hot weight representation’s bit-width H is equal to the “2” in the binary matrix. Therefore, the multiply operation
the number of possible unique input elements. For example, can be simply replaced with a mapping function using a
a K = 3-bit input bus can sort/represent N = 8 elements, tri-state buffer (Section V). Additionally, the first row in
so each element’s one-hot weight representation is of size the transpose matrix has no element in the first position
H = 8-bit (i.e., H = N). The binary to one-hot weight (i.e., element 3 is not in the binary matrix since 3 is not in the
representation conversion is a simple transformation using input set). The absence of this element can be recorded using
a conventional one-hot decoder. Using this one-hot weight a counting register for each inputted element (Section V), and
representation method ensures that different elements are this register records the number of occurences of this element
orthogonal with respect to each other when projected into in the binary matrix, which in this case would be “0” for
an R n linear space. element 3.
For brevity of discussion and ease of understanding our For more insight on this algorithm, Fig. 2 shows C-code
sorting method’s mathematical functionality, we illustrate a for a single-threaded implementation on a single CPU, where
small example in Fig. 1, which is based on linear algebra the transpose matrix is used as a vector matrix instead of a
vector computations. This example shows our sorting algo- 2-D matrix such that the indices of the TM N×1 matrix record
rithm’s functionality using four 2-bit input data elements, the counting elements of size N×1. Hence, the initialization
with an initial (random and arbitrary) sequential ordering phase, which is structured in the first loop, requires less
of [2; 0; 3; 1], which generates the outputted elements in the memory access time for the reads and writes in the loop
sorted matrix = [3; 2; 1; 0]. This sorting matrix is in descend- body. The evaluation phase is conducted in the second loop,
ing order; however, the elements can also be represented in and in this phase, the elements are sorted and stored in the
ascending order by having the mapping go from the bottom sorted vector SS N×1 . The elements in the array vector TM N×1
row to the upper row. are read sequentially, and concurrently the elements in the

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1933

Fig. 2. Comparison-free sorting C-code for a single-threaded single CPU.


Fig. 4. Block diagram of the hardware structure for our sorting algorithm.

Thus, if s does not belong to L (i.e., there is no r such


that a(r ) = s), then the sth column of J will contain all “0s.”
If s belongs to L, then the sth column of J will have “1s” in
exactly the locations r where a(r ) = s.
Supposing that L had no repetitions, let
L J = [a(1), . . . ,a(k)]
J = [b(1), . . . , b(m)] (4)
which gives

s, if s ∈ L
b(s) = (5)
0, otherwise.
Fig. 3. Execution comparisons of for our comparison-free sorting design,
Quicksort, merge sort, and radix sort. If s ∈
/ L, then all of the values in the sth column of Cs of J
are “0s,” and b (s) = L · CsT = 0. If s ∈ L, and if r is the
sorted vector SS N×1 are written sequentially, resulting in good unique value for which a(r ) = s, then all of the values in the
spatial locality in the second loop of the C-code. Due to these sth column of Cs of J are all “0” except for the value in the r th
structural designs, initial insight in our simulation results for a column, which is “1.” Therefore, b (s) = L • CsT = a(r ) = s,
single-threaded single CPU, which is shown in Fig. 3, reveal which proves our claim.
the advantages of our proposed algorithm in execution time For example, starting with L = [6, 3, 4], then J = JL would
over other popular sorting algorthms such as Quicksort and be the matrix
other standard sorting algorithms reported in [50] ⎡ ⎤
000001
IV. M ATHEMATICAL A NALYSIS J = ⎣0 0 1 0 0 0⎦ (6)
In this section, we provide the mathematical proof for 000100
our sorting algorithm illustrating the case of N unique input and
elements as a proof of concept. We present this case as the base
case proof for our sorting algorithm since other input element L J = [0, 0, 3, 4, 0, 6]. (7)
set cases (i.e., different numbers of duplicated elements) can Let J∗
be the matrix obtained by deleting the zero columns
be easily derived from this case. from J such that
Let
L J ∗ = [3, 4, 6]. (8)
L = [a(1), . . . , a(k)] (1)
be a given list1 of k positive integers and let V. H ARDWARE F UNCTIONALITY D ETAILS
M = max[a(1), . . . , a(k)]. (2) The overall hardware structure for our sorting algorithm is
divided into two parts: the data path and the control unit.
Let J = JL be the (k x M) matrix whose entries Jr,s are
Fig. 4 depicts the input–output signals of a complete block
defined by
 diagram for our sorting algorithm, which sorts of N = 2 K
1, if a(r ) = s input data elements. The basic design architecture operates in
Jr,s = (3)
0, otherwise. two sequential phases: the write-evaluate phase (Section V-A1)
followed by the read-sort phase (Section V-A2). The control
1 A list is a set in which repetition is allowed. unit (Section V-B), is a simple state machine that controls the

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1934 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

Fig. 6. Hardware chart for read-sort phase.

Fig. 5. Hardware flow for the write-evaluate phase.

data path’s phases using only a few D-type flip-flop (DFF)


components. Sorting begins when the START-EXT signal is
asserted and the design signals that sorting has completed by
asserting the FINAL-EXT signal.

A. Data Path Operation


The data path contains several circuit components: a one-hot
decoder, register arrays, a serial shifter, a parallel counter (PC),
tri-state buffers and multiplexors, a one-detector, and an incre-
mentor/decrementor circuit. In order to meet the setup-hold
delay time bewteen the clock and data stabilization for the Fig. 7. Detailed block diagram of our sorting algorithm’s write-evaluate
elements’ storage registers, the delay element’s components phase.
are a cascade of an even number of inverters. These circuit
components are standard CMOS circuit components [51]–[53],
which are commonly used components for advanced
CMOS technologies beyond 90 nm, making our design scal-
able for further advanced low-cost CMOS technologies.
Before proceeding with a more detailed circuit structure of
the write-evaluate and read-sort phases, we present generalized
and overall illustrations for these phases in the flow charts
in Figs. 5 and 6, respectively. The rectangles present the
operations during each clock cycle event, in which two events
occur per clock cycle, one on each cycle edge (i.e., asserted
high and low). The steps within the rectangles show the
sequences of the operations based on the data hardware flow
shown in Figs. 7 and 9, where some operations have the same
number indicating parallelism/independence between these
operations within the clock cycle, meaning that it does not mat-
ter which operation occurs first. Additionally, these flow charts Fig. 8. Timing diagraph for our sorting algorithm’s write-evaluate phase.
adhere to the timing constraints depicted in Figs. 8 and 10,
respectively, where each event occurs at a clock edge. The weight representation by the one-hot decoder. The decoder’s
diamonds are the condition expressions that change the data output enables an associated register in a register array to
flow based on control flow events. record the binary input element’s occurrence. We refer to
1) Write-Evaluate Phase: During the write-evaluate phase, this register as an order register (ORi ) array, where the
each binary input element is converted to the element’s one-hot i th register stores the i th input element. Each register is a

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1935

Fig. 7 for figure clarity). Next, the WRITE-ENA signal is used


to direct the input data to the one-hot decoder, and enable the
clocking source for the order and flag register arrays, which
are actually gated by another AND-gate that comes from the
one-hot decoder.
Following the timing diagram in Fig. 8, the write-evaluate
cycle time requires time for the one-hot decoder (Toh ), time
for the order and flag registers’ access times, (Tor ) and (Tfr ),
respectively, and time for the flag register increment (Tacc ).
The total write-evaluate phase’s cycle time (Twrite−cycle) is
Twrite-cycle = Toh + Tor + Tacc + Tfr . (9)
The delay element’s components have no influence on the
write-evaluate cycle time since these components only change
the duty cycle while preserving the cycle time. All of the
registers (order and flag) are structured in parallel, such that
Fig. 9. Detailed block diagram of our sorting algorithm’s read-sort phase.
the access times to the registers are on the order of fractions
of a nano-second. Additionally, the simple incrementor is less
than a nano-second time scale since the bit-width is only
k-bits. One incrementor is shared for all flag registers since
only one element is input per clock cycle.
A parallel counter in the control unit (Section V-B) controls
the end of the write-evaluate phase when the counter’s value
reaches the maximum number possible inputted elements
(i.e., N = 2k ). Even though the input set may contain
less than the maximum number of elements, assuming that
the input set is full realizes the simplisity of the read-sort
phase’s operation. The control unit asserts the READ-ENA
signal and deasserts the WRITE-ENA signal when the write-
evaluate phase completes, which enables the read-sort phase
on the next clock edge. The write-evaluate phase requires a
fixed N clock cycles since the phase always iterates for the
Fig. 10. Timing diagraph for our sorting algorithm’s read-sort phase. maximum number of potential input elements.
2) Read-Sort Phase: Fig. 9 illustrates a detailed block
simple DFF register of size k-bit. This operation is equivalent diagram of the read-sort phase’s data path, which comprises
to the recording of the element in the transposed matrix in our of a k-bit sorted shift register (SRi ) array of size N that stores
algorithm (Section III). Simultaneously, the one-hot decoder the elements in their final sorted order, and a k-bit PC that
enables an associated register in another register array— indexes into the order register array to process each element in
the flag register (FRi ) array—which records the number of turn. The element ordering, ascending or descending, is user-
occurrences of this element in the input set. For each occur- specified, and can be controlled by either left- or right-shifting
rence of a duplicated element, the associated flag register is in the elements. A one-detector circuit detects if the flag
triggered, and the occurrence is recorded by incrementing register value is “1” or not, and a decrementor circuit subtracts
the register’s stored value using a 10-bit incrementor. This a “1” from the flag register, the result of which is stored back
operation is equivalent to having multiple “1s” for repeated in to the flag register, when processing replicated elements.
elements in a row in the transpose matrix (Section III). In this figure, the write-evaluate phase’s data path components
All input elements follow the same sequential operation at that are used in the read-sort phase are encompassed in the
every rising clock edge. Fig. 7 illustrates a detailed block dashed lines.
diagram of the write-evaluate phase’s data path, which shows The read-sort phase begins after the WRITE-ENA signal
the input bus and all control signals that are fed from the is deasserted and the READ-ENA signal is asserted, which
control unit (Section V-B). Fig. 8 depicts the associated timing sends the PC’s value to the one-hot decoder at each new read-
diagram, which shows the detailed streamlined sequential sort clock cycle. The one-hot decoder converts this counter
timing for the write-evaluate phase. In this diagram, the value to the value’s one-hot representation, which enables the
START-EXT signal indicates the beginning of a new block associated order and flag registers to read/release the registers’
of N = 2K k-bit input elements, which arrive sequentially values, and the order register’s value is stored into the sorted
on each clock cycle. The START-EXT signal consecutively register array if-and-only-if that element’s flag register value
triggers several intermediate signals in the write-evaluate data is greater than “0,” meaning there was at least one occurrence
path’s circuit. First, the reset signal RES is asserted high of that input element. The one-detector evaluates the flag
for one clock cycle to initialize all registers (omitted from register value to control whether or not the element is stored

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1936 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

in the sorted register array. If the flag register records a value Fig. 10 shows the timing diagram for the read-sort phase for
equal to or greater than “1,” the associated element should be all three cases, where the circled area shows the clock cycle
stored in the sorted register array a number of times equal operations for case two and three. Case three is assumed to be
to the flag register’s value. The case is simple when the flag the worst case due to the decrementor’s delay, which has more
register value is “1,” which is detected by the one-detector. delay than the one-detector delay (TOD ) as given in case 2.
To avoid complex comparison units (i.e., equal to or greater The additional required logic gates’ delays, such as the XOR
than “1”), detecting values greater than “1” can be easily gate, tri-state buffer, and AND gates, are not included in the
determined using the decrementor’s carry out single. Thus, above delay equations since these gates require only fractions
if the one-detector’s evaluation is false (i.e., “0” is the one- of nano-seconds. Additionally, delay buffer #3 (Fig. 9) has
detector’s decision output), but when decrementing the flag no effect on the read-sort cycle time since this delay element
register’s value, the resulting carry out flag is “0,” this means is only used for maintaining the setup-hold time between the
that the flag register’s value was greater than “1.” In both clock (CLK) and the element being stored in the sorted register
cases, the input element should be stored into the sorted array.
register array. Indexing to the next input element is inhibited Case three represents the worst case, upper bound sorting
by disabling the PC’s increment, which allows the replicated time when the input element set contans N occurances of the
element to be stored in the sorted register array until the flag same element (i.e., one row in the transpose matrix has all
register value reaches “0.” Otherwise, the flag register’s value “1” values, while all other rows have all “0” values). The
is “0,” the element is not in the input set, and thus is not stored corresponding flag register’s value for this element is “N,”
into the sorted register array, and the PC is incremented. while all other flag registers’ values are “0.” Our algorithm
The read-sort cycle time can be divided into three cases requires N− 1 cycles to check all flag register values (i.e., all
based on the flag register’s value. For clarity, these cases will transpose matrix rows), even though all values are “0,” and
be described with references to the example in Fig. 1 and N cycles to output the single replicated element N times into
the discussion of the structure in Section III. In case one, the sorted register array. Therfore, the total number of clock
the flag register’s value is “0” (i.e., the element is not in the cycles are 2N − 1 plus one cycle for reset, resulting in a total
binary matrix), and thus, this element is not stored in the sorted worst case, upper bound of 2N.
register array, and the PC is incremented (i.e., proceed to the The best case, lower bound occurs when all elements in
next row in the transpose matrix). The timing of the read- the input set are distinct (i.e., every transpose matrix row
sort cycle (Tread−cycle ) in case one is the sum of the PC’s contains either a single “1” or no “1s,” case one and case
increment (TPC ), the one-hot decoder’s (TOH ), and the one- two, respectively). During the read-sort phase, each cycle
detector’s (TOD ) delays either stores one element or nothing, respectively, to the
Tread−cycle = TPC + TOH + TOD . (10) sorted register array, which requires N clock cycles to sort
N elements.
We can see that the one-detector and decrementor both operate On average and in most general cases, the input set will
concurrently with the flag register value’s evaluation. contain a mixture of distinct and repeated elements, and the
In case two, the flag register’s value is “1,” meaning that actual sorting time will fall between the upper and lower
the element is in the input set once, and thus this element is bounds. Considering both the write-evaluate and read-sort
read from the order register using the one-hot decoder and a phases, the required number of clock cycles ranges from
tri-state buffer at the register’s output, the element is stored in 2N to 3N to sort the input elements, with the addition of the
the sorted register array, and the PC is incremented. As with one clock cycle for reset and one clock cycle for the control
case one, a flag register value of “0” and “1” both require one switch between the write-evaluate and read-sort phases.
clock cycle. The timing of the read-sort cycle (Tread−cycle ) in
this case is the sum of the PC’s increment (TPC ), the one- B. Control Unit Operation
hot decoder’s (TOH ), the one-detector’s (TOD ), and the sorted The control unit receives input signals from the data path
register array’s (TSR ) delays and outputs the appropriate control signals back to the data
Tread−cycle = TPC + TOH + TOD + TSR . (11) path. The control unit also receives the external and hand-
shaking components’ signals in order to interface with the
In case three, the flag register’s value is greater than “1” external components that are using the sorting hardware, and
(i.e., the element’s corresponding row in the transpose matrix synchronizes the complete sorting operation. There are several
contains more than one “1”). Similar to case two, this element methods for designing the control unit [54], [55], and prior
is stored into the sorted register array, but in this case, the flag work on sorting hardware typically found it sufficient to
register is also decremented. The PC’s increment is disabled present only the data path design and no detail on the control
until the element’s flag register reaches “1,” signaling that all logic [2], [34]–[45]. However, in our work, we present the
occurrences of the element have been stored into the sorted complete control unit design in order to provide a holistic
output array. The timing of the read-sort cycle (Tread−cycle ) in sorting implementation with all signals, which alleviates any
this case is the sum of the PC’s increment (TPC ), the one-hot discrepancy between the control and data path units. Addi-
decoder’s (TOH ), the decrementor’s (TDA ), and the flag register tionally, our inclusion of the control unit’s design shows
array’s (TFR ) delays the simplicity of our sorting hardware, with the control unit
Tread-cycle = TPC + TOH + TDA + TFR . (12) using a small number of gates and is scalable and easily

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1937

to ensure state initialization in the components, regardless of


the underlying technology and fan-out interconnect. Several
reset signals are branched and routed to different components
in order to minimize the effective load on the RES signal.
Additionally, the clock tree is designed in order to balance the
clock edges across the components and preserve the setup-
hold time margins, the details of which have been omitted in
this figure for figure clarity.
All input and output signals are associated with
appropriately-sized drivers to minimize the resistor-capacitive
Fig. 11. Control unit diagram for the write-evaluate unit. load on the input signals, and ensures that the signals propa-
gate quickly enough and at full-swing with an appropriate sig-
nal slew-rate. We refer the reader to [53] for further details on
load balancing and using appropriately-sized drivers. Asserting
the RES signal (after START-EXT is asserted) for one clock
cycle begins initializing the master-slave DFF structure for
further operations. Subsequently, de-asserting the RES signal
triggers asserting the WRITE-ENA signal for the complete
write-evaluate phase. Once the control unit’s PC reaches
the saturated state N = 2 K , all input elements have been
processed, which indicates the end of the write-evaluate phase.
The WRITE-ENA signal is de-asserted and the READ-ENA
signal is asserted on the next CLK edge, as illustrated in the
timing diagram in Fig. 8.
The read-sort phase’s control unit’s circuitry (Fig. 12) is
derived from the read-sort timing diagram (Fig. 10). The
READ-ENA signal is asserted one clock cycle after the WRITE-
ENA is de-asserted. At this point, the data path’s PC is enabled
Fig. 12. Control unit diagram for the read-sort unit. and activates the one-hot decoder, order register array, flag
register array, and one-detector. When the data path’s PC
reconfigurable to different data types and sizes. We note that saturates (i.e., all order and flag register values have been
further area optimization can easily be achieved by reusing evaluated), the data path asserts the FINAL-STATE signal
components for many handshaking controls with the data path that drives the control unit. The control unit deasserts the
unit, however, without loss of generality and for an easier READ-ENA signal and asserts the FINAL-EXT signal indi-
conceptual explanation, we describe the control unit without cating that sorting is complete. The FINAL-STATE signal
shared components. In regards to timing and power, most of indicates that all rows in the transpose matrix have been
the components in the control unit are fast, and respond within scanned and mapped to the sorted array register.
the DFF access time delay. Additionally, most of the DFFs The synchronization of these operations are inherent-by-
are clock-gated with an enable signal to minimize the DFFs’ design using DFFs with a SET and RESET structure, as given
switching activities upon needed, thus reducing the overall in [59]. The complete control unit only requires seven DFFs
circuit’s power consumption. for controlling the continuous sorting of input elements. The
Collectively, Figs. 11 and 12 depict the complete block simplicity of our control unit circuitry design is due to the
diagrams for the control unit. For ease of explanation, the con- continuous forward-flowing data through the data path and
trol unit divides the control logic structure into the write- results in simple timing, which is amenable to efficient circuit
evaluate and read-sort phases’ controls, respectively, however, design structures.
physically the control units share common components, such
as the clock and the reset-initialization block. VI. S IMULATIONS AND R ESULTS
The write-evaluate control circuitry (Fig. 11) is derived Without loss of generality and for comparison purposes,
from the write-evaluate timing diagram (Fig. 8) and receives we implemented, tested, and verified our sorting algorithm
as input the external signals CLOCK-EXT, RES-EXT, and and hardware architecture using a sample system with N =
START-EXT. These signals control the sorting of the input 1024 input data elements, which is similar to many prior hard-
bus elements, such that the data path generates the outputted ware sorting integrated circuits (ICs) [2], [37]–[45], [47]–[49].
sorted elements on the output bus and signals the end of We architected our proposed comparison-free sorting hardware
sorting by asserting the FINAL-EXT signal. The internal reset- at the CMOS transistor level using 90-nm Taiwan Semicon-
initialization block is triggered by the START-EXT signal, ductor Manufacturing Company (TSMC) technology with a 1
which in turn asserts the RES signal for one clock cycle. V power supply [56]. We gathered timing delay values, total
This complete clock cycle ensures that the reset-initialized power consumption, and total transistor counts using HSPICE
components receive the asserted RES signal for long enough simulations [57].

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1938 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

TABLE I
C OMPONENT T IME D ELAYS AND T RANSISTOR C OUNTS
A SSUMING 90-nm T ECHNOLOGY

Fig. 13. Transistor counts for the order, flag, and sorted register arrays as
compared number of elements.

Fig. 14. Clock cycle time as compared to bus width.

conservative clock cycle frequency of 500 MHz, and the


total power consumption given the technology factor at this
frequency is 1.6 mW. Sorting 1024 elements requires a total
number of clock cycles ranging from 3 × 1024 = 3076
to 2 × 1024 = 2048, depending on the number of duplicated
input elements, resulting in a total time (for our clock speed
of 500 MHz) of approximately 4–6 μs. Additionally, the total
transistor count is less than 7 50 000 to sort 1024 elements.
Our design alleviates complex components such as memory
and pipelining structures, which are considered in hardware
The one-hot decoder, which converts the 10-bit input bus designs as the bottleneck for performance and power con-
binary representation to the 1024-bit one-hot weight repre- sumption [13]. The only design bottleneck with respect to
sentation, uses a four-input fan-in NAND logic gate with a performance is the one-hot decoder; however, an optimized
five-level hierarchical structure, resulting in a timing delay of version of this component could be used [51], [52]. Since
TOH = 0.688 ns. The order and flag registers are comprised our focus is to architect a holistic circuit design, rather
of ten parallel DFFs, such that the register access time can than optimizing special components and leveraging advanced
be approximated using a single DFF access time of TDFF = CMOS technologies, we consider the integration of these
0.14 ns. Similarly, the tri-state buffer and multiplexer are optimizations as orthagonal to our design.
approximated as the same delay as the DFF access time Fig. 13 shows how the transistor count scales as compared
TTB = TMUX = TDFF . to the number of data elements for the order, flag, and sorted
The one-detector uses a parallel prefix-tree structure of four- register arrays since these structures dominate the transistor
input OR-gates, which take as input 10 bits and activates a count. These results show that our design’s transistor growth
two-level output, resulting in a timing delay of TOD = 0.26 ns. rate is linear, with a small increase in the slope rate of less
The data path’s 10-bit PC is implemented based on state-look than six, giving a linear complexity ratio of O(N) with respect
ahead logic [58], giving a timing delay to the next state of to transistor count.
approximately 0.167 ns. The incrementor/decrementor circuit Fig. 14 shows sorting speed in clock cycle time as compared
takes a 10-bit input bus and add/subtract a “1,” giving a timing to the number of data elements N = 2 K for a k-bit bus. Our
delay of approximately 0.37 ns. results ignore the interconnect parasitic values and the required
Table I summarizes all of the components’ delay times buffering sizes, and focus only on our design’s components’
and associated transistor counts. These results, combined delays. Using the access delay times reported in Table I
with (9)–(12), show that the write-evaluate phase’s clock cycle and (12) for upper bound limits on maximum frequency, and
time is CLKW < 2 ns and the read-sort phase’s clock cycle assume the worst case data distribution (all N elements are
time is CLK R < 2 ns. These timings result in an approximate repeated), Fig. 14 shows a linear complexity of O(N) for

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1939

TABLE II
S ORTING C OMPUTATION T IME FOR AN I NPUT S ET OF 1024 E LEMENTS

TABLE III
C OMPARISON B ETWEEN P RIOR W ORK AND O UR
P ROPOSED S ORTING D ESIGN

Fig. 15. Power consumption as compared to number of data elements.

end-to-end execution time for our sorting design with a small


growth rate less than 1.5. This small rate is due to using basic
registers (flag, order, and sorted registers) that access the bus
in parallel.
The power consumption is relative to the switching activity
and the transistors’ static leakage. To reduce power consump-
tion, our design’s datapath and control units’ components
are gated with enable signals to restrict activity to only the
components operational periods. The write-evaluate and read-
sort phases each activate two register arrays: the order and flag
register arrays, and the flag and sorted register arrays, respec-
tively. Therfore, during the write-evaluate phase, the sorted
register array is shut off, and in the read-sort phase, the order
register array is shut off. All other components operate in both
phases, therefore the phases’ consume approximately equal
power. Fig. 15 shows our design’s power consumption as sorting performance in number of clock cycles. These com-
compared to the number of data elements and assuming a parisons are independent of technology factors in order to
500 MHz running frequncy. The operating frequency limits avoid uncertainty with respect to different technology scale
are evaluation to a maximum of N = 216 data elements, since comparisons and technology simulation environments, which
larger sizes would require slower a slower clock frequency. makes the comparison fair because technology circuit imple-
Our design’s power consumption shows a linear complexity mentations can vary greatly, ranging from different FPGA
of O(N) for a number of data elements less than 216 with a varieties/families to custom application specified integrated
growth rate of about 6.4. circuits using CMOS, NMOS, PMOS, Domino, pass-transistor
Overall, our design shows a linear growth rate O(N) with logic families, and many others [53]. These implementation
respect to total transistor count, end-to-end execution time, and specifics have a large influence on the design performance
power consumption. This is in contrast to other work’s [2], and design cost, which may result in unrealistic or inaccurate
[35], [41], [48] that report a linear complexity of O(N), but conclusions. Therefore, we compare our design with prior
the growth rate is usually in the order of greater than 100. designs with respect to common features for sorting hardware
We also compare our design with data reported in litera- design circuit architectures, such as the number of cycles
ture for related CPU and GPU sorting algorithms [5], [15], with respect to number of input elements, design structure
[19], [20]. Table II reports the execution time for sorting of the data path and control units that leads to scalability
1024 elements using both single- and multicore CPUs and and flexibility for different applications, and finally, the design
GPUs not considering the the front-end memory initialization computation complexity and data movement directions, which
time and the back-end memory merging time; just only the impact the design cost and power factor. These types of
computation time. These results show that our design is comparisons provide a larger evaluation picture considering
even faster than prior algorithms who effectively harness the the huge number of sorting hardware designs.
computing resources, to the best of our knowledge. Table III compares our design with prior hardware sorting
For general purposes, we have compared our sorting design algorithms that have a single computing engine and several
with prior work with respect to hardware complexity and sorting partitions that require merging small sorted partitions

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1940 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

TABLE IV
C OMPARISON W ITH R ECENT FPGA S ORTING A LGORITHMS : S PIRAL [47] AND R ESOLVE [48]

to obtain the final sorted output. We evaluated the designs O(N) with respect to the sorting speed, transistor count, and
based on the number of clock cycles required to sort an power consumption. This linear growth is with respect to the
input set of size N. This evaluation illustrates the com- number of elements N for N = 2 K where K is the bit width
plexity scaling of our simple forward data flowing design of the input data. The slope of the linear growth rate is small,
for increasing bit-widths as compared to the prior methods with a growth rate of approximately 6 for the transistor count
that merge the datapath and control units’ functionalities and power consumption, and 1.5 for the sorting speed.
within the parallel computing cells, memory, and comparison The order complexity and growth rates are due to
circuitry, all of which usually dictate the circuit’s design simple basic circuit components that alleviate the need
complexity (number of transistors), runtime complexity (num- for SRAM-based memory and pipelining complexity. Our
ber of cycles to sort N elements), and power. Dividing mathematically-simple algorithm streamlines the sorting oper-
computing cells that integrate the datapath with the control ation in one forward flowing direction rather than using
unit usually requires two operations: element evaluation and compare operations and frequent data movement between the
result updating, which requires repeating evaluation decisions. storage and computational units, as with other sorting algo-
Furthermore, prior rank-based designs required repeated ALU rithms. Our design uses simple standard library components
computations within the SRAM or memory array, which is including registers, a one-hot decoder, a one detector, an incre-
usually characterized as being time consuming. menter/decrementer, and a PC, combined with a simple control
For additional comparison, we evaluate the data reported unit that contains a small amount of delay logic.
in [49], which presents recent work on hardware sorting algo- Our design is at least 6× faster than software parallel
rithms implemened on the Xilinx FPGA xc7vx690tffg1761-2 algorithms that harness powerful computing resources for
using 32-bit fixed point operations and running at a frequency input data set sizes in the small-to-moderate range up to 216 .
of 125 MHz. Table IV shows the overall transistor counts, Additionally, our hardware design’s performance is approxi-
required number of BRAMs, and sorting time in micro- mately 1.5× better as compared to other optimized hardware-
seconds. These compared designs show a linear increase based hybrid sorting designs in terms of transistor count and
in the FF/LUT count with respect to the number of ele- design scalability, number of clock cycles and critical path
ments, however the BRAM requirements do not scale linearly. delay, and power consumption. Thus, our design is suitable
Since memory devices introduce performance bottlenecks, for most IC systems that require sorting algorithms as part of
this results in the non-linear execution time and non-linear their computational operations.
transistor count. Our results show that our comparison-free sorting CMOS
With respect to all evaluated results, our comparison-free hardware can sort N unsigned integer elements from end-to-
sorting design provides an efficient linear scalability of O(N). end with any input data set distribution within 2N to 3N
Our design uses simple registers (flag, order, and sorted clock cycles (lower and upper bounds, respectively) at a clock
registers) that are accessed on both the rising and falling frequency of 0.5 GHz using a 90-nm TSMC technology with
clock edges, and simple standard CMOS components with a 1 V power supply and a power consumption of 1.6 mW for
a forward flowing data movement architecture. Even though N = 1024 elements.
our design shows a linear performance cost of O(N), our Future work includes leveraging our sorting algorithm for
hardware design is recommended for data element set sizes of commercial parallel processing computing power, such as
less than 216 due to practical integration into large computing GPUs and parallel processing machines, in order to further
IC devices (e.g., graphics engines, routers, grid controllers.), improve large-scale sorting, and thus, further enhance embed-
where the sorting hardware accounts for no more than 10% of ded sorting for big data applications.
the IC’s characteristics (power and area).
R EFERENCES
VII. C ONCLUSION
[1] D. E. Knuth, The Art of Computer Programming. Reading, MA, USA:
In this paper, we proposed a novel mathematical Addison-Wesley, Mar. 2011.
comparison-free sorting algorithm and associated hardware [2] Y. Bang and S. Q. Zheng, “A simple and efficient VLSI sorting
architecture,” in Proc. 37th Midwest Symp. Circuits Syst., vol. 1. 1994,
implementation. Our sorting design exhibits linear complexity pp. 70–73.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1941

[3] T. Leighton, Y. Ma, and C. G. Plaxton, “Breaking the(n log2 n) barrier [29] J.-T. Yan, “An improved optimal algorithm for bubble-sorting-based non-
for sorting with faults,” J. Comput. Syst. Sci., vol. 54, no. 2, pp. 265–304, Manhattan channel routing,” IEEE Trans. Comput.-Aided Des. Integr.
1997. Circuits Syst., vol. 18, no. 2, pp. 163–171, Feb. 1999.
[4] Y. Han, “Deterministic sorting in O(n log log n) time and linear space,” [30] L. Skliarova, D. Mihhailov, V. Sklyarov, and A. Sudnitson, “Implemen-
J. Algorithms, vol. 50, no. 1, pp. 96–105, 2004. tation of sorting algorithms in reconfigurable hardware,” in Proc. 16th
[5] C. Canaan, M. S. Garai, and M. Daya, “Popular sorting algorithms,” IEEE Medit. Electrotech. Conf. (MELECON), Mar. 2012, pp. 107–110.
World Appl. Programm., vol. 1, no. 1, pp. 62–71, Apr. 2011. [31] N. Tabrizi and N. Bagherzadeh, “An ASIC design of a novel pipelined
[6] L. M. Busse, M. H. Chehreghani, and J. M. Buhmann, “The infor- and parallel sorting accelerator for a multiprocessor-on-a-chip,” in Proc.
mation content in sorting algorithms,” in Proc. IEEE Int. Symp. Inf. IEEE 6th Int. Conf. ASIC (ASICON), Oct. 2005, pp. 46–49.
Theory (ISIT), Jul. 2012, pp. 2746–2750. [32] H. Schröder, “VLSI-sorting evaluated under the linear model,” J. Com-
[7] R. Zhang, X. Wei, and T. Watanabe, “A sorting-based IO connec- plex., vol. 4, no. 4, pp. 330–355, Dec. 1988.
tion assignment for flip-chip designs,” in Proc. IEEE 10th Int. Conf. [33] H.-S. Yu, J.-Y. Lee, and J.-D. Cho, “A fast VLSI implementation of
ASIC (ASICON), Oct. 2013, pp. 1–4. sorting algorithm for standard median filters,” in Proc. 12th Annu. IEEE
[8] D. Fuguo, “Several incomplete sort algorithms for getting the median Int. ASIC/SOC Conf., Sep. 1999, pp. 387–390.
value,” Int. J. Digital Content Technol. Appl., vol. 4, no. 8, pp. 193–198, [34] G. Campobello and M. Russo, “A scalable VLSI speed/area tunable
Nov. 2010. sorting network,” J. Syst. Archit., vol. 52, no. 10, pp. 589–602, Oct. 2006.
[35] W. Zhou, Z. Cai, R. Ding, C. Gong, and D. Liu, “Efficient sorting
[9] W. Jianping, Y. Yutang, L. Lin, H. Bingquan, and G. Tao, “High-
design on a novel embedded parallel computing architecture with unique
speed FPGA-based SOPC application for currency sorting system,” in
memory access,” Comput. Elect. Eng., vol. 39, no. 7, pp. 2100–2111,
Proc. 10th Int. Conf. Electron. Meas. Instrum. (ICEMI), Aug. 2011,
Oct. 2013.
pp. 85–89.
[36] V. Sklyarov, “FPGA-based implementation of recursive algorithms,”
[10] R. Meolic, “Demonstration of sorting algorithms on mobile platforms,” Microprocess. Microsyst., vol. 28, nos. 5–6, pp. 197–211, Aug. 2004.
in Proc. CSEDU, 2013, pp. 136–141. [37] R. Lin and S. Olariu, “Efficient VLSI architectures for Columnsort,”
[11] F.-C. Leu, Y.-T. Tsai, and C. Y. Tang, “An efficient external sorting IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1,
algorithm,” Inf. Process. Lett., vol. 75, pp. 159–163, Sep. 2000. pp. 135–138, Mar. 1999.
[12] J. L. Bentley and R. Sedgewick, “Fast algorithms for sorting and [38] S. W. Moore and B. T. Graham, “Tagged up/down sorter—A hardware
searching strings,” in Proc. 8th Annu. ACM-SIAM Symp. Discrete priority queue,” Comput. J., vol. 38, no. 9, pp. 695–703, Sep. 1995.
Algorithms (SODA), Jan. 1997, pp. 360–369. [39] G. V. Russo and M. Russo, “A novel class of sorting networks,”
[13] L. Xiao, X. Zhang, and S. A. Kubricht, “Improving memory perfor- IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 43, no. 7,
mance of sorting algorithms,” J. Experim. Algorithmic, vol. 5, no. 3, pp. 544–552, Jul. 1996.
pp. 1–20, 2000. [40] S. Dong, X. Wang, and X. Wang, “A novel high-speed parallel scheme
[14] P. Sareen, “Comparison of sorting algorithms (on the basis of average for data sorting algorithm based on FPGA,” in Proc. IEEE 2nd Int.
case),” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3, Congr. Image Signal Process. (CISP), Oct. 2009, pp. 1–4.
pp. 522–532, Mar. 2013. [41] A. Széll and B. Fehér, “Efficient sorting architectures in FPGA,” in Proc.
[15] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “AA-SORT: Int. Carpathian Control Conf. (ICCC), May 2006, pp. 1–4.
A new parallel sorting algorithm for multi-core SIMD processors,” in [42] A. A. Colavita, A. Cicuttin, F. Fratnik, and G. Capello, “SORTCHIP:
Proc. 16th Int. Conf. Parallel Archit. Compil. Techn. (PACT), 2007, A VLSI implementation of a hardware algorithm for continuous data
pp. 189–198. sorting,” IEEE J. Solid-State Circuits, vol. 38, no. 6, pp. 1076–1079,
[16] V. Kundeti and S. Rajasekaran, “Efficient out-of-core sorting algorithms Jun. 2003.
for the parallel disks model,” J. Parallel Distrib. Comput., vol. 71, no. 11, [43] T. Demirci, I. Hatirnaz, and Y. Leblebici, “Full-custom CMOS realiza-
pp. 1427–1433, 2011. tion of a high-performance binary sorting engine with linear area-time
[17] G. Capannini, F. Silvestri, and R. Baraglia, “Sorting on GPUs for large complexity,” in Proc. Int. Symp. Circuits Syst. (ISCAS), vol. 5. May 2003,
scale datasets: A thorough comparison,” Int. Process. Manage., vol. 48, pp. V453–V456.
no. 5, pp. 903–917, 2012. [44] K. Ratnayake and A. Amer, “An FPGA architecture of stable-sorting on
[18] D. Cederman and P. Tsigas, “GPU-Quicksort: A practical quicksort algo- a large data volume: Application to video signals,” in Proc. 41st Annu.
rithm for graphics processors,” ACM J. Experim. Algorithmics (JEA), Conf. Inf. Sci. Syst., Mar. 2007, pp. 431–436.
vol. 14, Dec. 2009, Art. no. 4. [45] S. Alaparthi, K. Gulati, and S. P. Khatri, “Sorting binary numbers in
[19] B. Jan, B. Montrucchio, C. Ragusa, F. G. Ghan, and O. Khan, “Fast hardware—A novel algorithm and its implementation,” in Proc. IEEE
parallel sorting algorithms on GPUs,” Int. J. Distrib. Parallel Syst., Int. Symp. Circuits Syst. (ISCAS), May 2009, pp. 2225–2228.
vol. 3, no. 6, pp. 107–118, Nov. 2012. [46] J. F. Hughes et al., Computer Graphics: Principles and Practice, 3rd ed.
[20] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting Reading, MA, USA: Addison-Wesley, 2014.
algorithms for manycore GPUs,” in Proc. 23rd IEEE Int. Symp. Parallel [47] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
Distrib. Process., May 2009, pp. 1–10. mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int.
[21] C. Bunse, H. Höpfner, S. Roychoudhury, and E. Mansour, “Choosing Symp. Field Program. Gate (FPGA), Monterey, CA, USA, Feb. 2015,
the ‘best’ sorting algorithm from optimal energy consumption,” in Proc. pp. 240–249.
ICSOFT, vol. 2. 2009, pp. 199–206. [48] M. Zuluaga, P. Milder, and M. Püschel, “Streaming sorting networks,”
ACM Trans. Design Autom. Electron. Syst., vol. 21, no. 4, May 2016,
[22] A. D. Mishra and D. Garg, “Selection of best sorting algorithm,” Int.
Art. no. 55.
J. Intell. Inf. Process., vol. 2, no. 2, pp. 363–368, Jul./Dec. 2008.
[49] J. Matai et al., “Resolve: Generation of high-performance sorting
[23] T.-C. Lin, C.-C. Kuo, Y.-H. Hsieh, and B.-F. Wang, “Efficient algorithms architectures from high-level synthesis,” in Proc. ACM/SIGDA Int.
for the inverse sorting problem with bound constraints under the Symp. Field Program. Gate (FPGA), Monterey, CA, USA, Feb. 2016,
l∞-norm and the Hamming distance,” J. Comput. Syst. Sci., vol. 75, pp. 195–204.
no. 8, pp. 451–464, 2009. [50] Sorting Algorithms Animations, accessed on 2017.
[24] F. Henglein, “What is a sorting function?” J. Logic Algebraic Pro- [Online]. Available: https://www.toptal.com/developers/sorting-
gramm., vol. 78, no. 7, pp. 552–572, Aug./Sep. 2009. algorithms
[25] E. Mumolo, G. Capello, and M. Nolich, “VHDL design of a scalable [51] (2010). Cadence Online Documentation. [Online]. Available: http://
VLSI sorting device based on pipelined computation,” J. Comput. Inf. www.cadence.com
Technol., vol. 12, no. 1, pp. 1–14, 2004. [52] (2015). Synopsys Online Documentation. [Online]. [Online]. Available:
[26] E. Herruzo, G. Ruiz, J. I. Benavides, and O. Plata, “A new paral- http://www.synopsys.com
lel sorting algorithm based on odd-even mergesort,” in Proc. 15th [53] J. P. Uyemura, CMOS Logic Circuit Design. Norwell, MA, USA:
EUROMICRO Int. Conf. Parallel, Distrib. Netw.-Based Process. (PDP), Kluwer, 1999.
Feb. 2007, pp. 18–22. [54] J. P. Hayes, Computer Architecture and Organization, 2rd ed. New York,
[27] M. Thorup, “Randomized sorting in O(n log log n) time and linear space NY, USA: McGraw-Hill, 1994.
using addition, shift, and bit-wise Boolean operations,” J. Algorithms, [55] S. Lee, Advanced Digital Logic Design Using VHDL, State Machines,
vol. 42, no. 2, pp. 205–230, Feb. 2002. and Synthesis for FPGA’s. Luton, U.K.: Thomson Holidays, 2006.
[28] M. Afghahi, “A 512 16-b bit-serial sorter chip,” IEEE J. Solid-State [56] Taiwan Semiconductor Manufacturing Corporation. 90 nm CMOS ASIC
Circuits, vol. 26, no. 10, pp. 1452–1457, Oct. 1991. Process Digests, 2005.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1942 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017

[57] Synopsys. (2010). HSPICE. [Online]. Available: http://www.synopsys. Ann Gordon-Ross (M’00) received the B.S. and
com Ph.D. degrees in computer science and engineering
[58] S. Abdel-Hafeez and A. Gordon-Ross, “A gigahertz digital CMOS from the University of California, Riverside, CA,
divide-by-N frequency divider based on a state look-ahead structure,” USA, in 2000 and 2007, respectively.
J. Circuits, Syst. Signal Process., vol. 30, no. 6, pp. 1549–1572, 2011. She is currently an Associate Professor of Electri-
[59] V. Stojanovic and V. G. Oklobdzija, “Comparative analysis of master- cal and Computer Engineering with the University
slave latches and flip-flops for high-performance and low-power sys- of Florida, Gainesville, FL, USA, where she is a
tems,” IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536–548, member of the NSF Center for High Performance
Apr. 1999. Reconfigurable Computing (CHREC). She is very
active in promoting diversity in STEM fields. Her
current research interests include embedded systems,
Saleh Abdel-Hafeez (M’01) received the B.S.E.E.,
M.S.E.E., and Ph.D. degrees in computer engineer- computer architecture, low-power design, reconfigurable computing, dynamic
ing from the USA with a specialization of very large optimizations, hardware design, real-time systems, and multicore platforms.
scale integration (VLSI) design. Dr. Gordon-Ross is the Faculty Advisor for the Women in Electrical
In 1997, he joined S3 Inc., Huntsville, AL, USA, and Computer Engineering and the Phi Sigma Rho National Society for
as a member of their technical staff, where he Women in Engineering and Engineering Technology, and she is an active
was involved in the IC circuit design related to member of the Women in Engineering ProActive Network. She received the
CAREER award from the National Science Foundation in 2010, the Best
cache memory, digital I/O, and ADCs. He was
the Chairman of Computer Engineering Department. Paper Awards at the Great Lakes Symposium on VLSI in 2010 and the
He is currently an Associate Professor with the IARIA International Conference on Mobile Ubiquitous Computing, Systems,
College of Computer and Information Technology, Services and Technologies in 2010, and the Best Ph.D. Forum Award at the
IEEE Computer Society Annual Symposium on VLSI in 2014. She has been a
Jordan University of Science and Technology, Irbid, Jordan. He holds three
patents (6, 265, 509; 6, 356, 509; 20040211982A1) in the field of IC design. Guest Speaker and has organized several international workshops/conferences
His current research interests include circuits and architectures for low-power on this topic, and participates in outreach programs at local K-12 schools.
and high-performance VLSI.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.

You might also like