An Efficient O N Comparison-Free Sorting Algorithm
An Efficient O N Comparison-Free Sorting Algorithm
6, JUNE 2017
Abstract— In this paper, we propose a novel sorting algorithm Due to the ever-increasing computational power of parrallel
that sorts input data integer elements on-the-fly without any com- processing on many core CPU- and GPU-based process-
parison operations between the data—comparison-free sorting. ing systems, much research has focused on harnessing
We present a complete hardware structure, associated timing
diagrams, and a formal mathematical proof, which show an the computational power of these resources for efficient
overall sorting time, in terms of clock cycles, that is linearly sorting [17]–[20]. However, since not all computing domains
proportional to the number of inputs, giving a speed complexity and sorting applications can leverage the high throughput
on the order of O(N). Our hardware-based sorting algorithm pre- of these systems, there is still a great need for novel and
cludes the need for SRAM-based memory or complex circuitry, transformative sorting methods. Additionally, there is no clear
such as pipelining structures, but rather uses simple registers to
hold the binary elements and the elements’ associated number of dominate sorting algorithm due to many factors [21]–[24],
occurrences in the input set, and uses matrix-mapping operations including the algorithm’s percentage utilization of the available
to perform the sorting process. Thus, the total transistor count CPU/GPU resources, the specific data type being sorted,
complexity is on the order of O(N). We evaluate an application- amount of data being sorted.
specified integrated circuit design of our sorting algorithm for a To address these challenges, much research has focused
sample sorting of N = 1024 elements of size K = 10-bit using
90-nm Taiwan Semiconductor Manufacturing Company (TSMC) on architecting customized hardware designs for sorting algo-
technology with a 1 V power supply. Results verify that our rithms in order to fully utilize the hardware resources and
sorting requires approximately 4–6 µs to sort the 1024 elements provide custom, cost-effective hardware processing [2]–[27].
with a clock cycle time of 0.5 GHz, consumes 1.6 mW of power, However, due to the inherent complexity of the sorting
and has a total transistor count of less than 750 000. algorithms, efficient hardware implementation is challenging.
Index Terms— 90-nm TSMC, comparison free, Gigahertz clock To realize fast and power-efficient hardware sorting, a sig-
cycle, one-hot weight representation, sorting algorithms, SRAM, nificant amount of hardware resources are required, including,
speed complexity O(N). but not limited to, comparators, memory elements, large global
memories, and complex pipelining, in addition to complicated
I. I NTRODUCTION , M OTIVATION , AND R ELATED W ORK local and global control units.
Most prior work on hardware sorting designs are imple-
S ORTING algorithms have been widely researched for
decades [1]–[6] due to the ubiquitous need for sorting in
many application domains [7]–[10]. Sorting algorithms have
mented based on some modification of traditional mathemati-
cal algorithms [28]–[31], or are based on some modified net-
been specialized for particular sorting requirements/situations, work of switching structures [32]–[34] with partially parallel
such as large computations for processing data [11], high- computing processing and pipelining stages. In these sorting
speed sorting [12], improving memory performance [13], architectures, comparison units are essential components that
sorting using a single CPU [14], exploiting the parallelism are characterized by high-power consumption and feedback
of multiple CPUs [15], parallel processing for grid-computing control logic delays. These sorting methods iteratively move
in order to leverage the CPU’s powerful computing resources data between comparison units and local memories, requiring
for big data processing [16]. wide, high-speed data buses, involving numerous shift, swap,
comparison, and store/fetch operations, and have complicated
Manuscript received July 6, 2016; revised October 22, 2016 and control logic, all of which do not scale well and may need spe-
January 16, 2017; accepted January 16, 2017. Date of publication February 22, cialization for certain data-type particulars. Due to the inherent
2017; date of current version May 22, 2017. This work was supported in
part by the National Science Foundation under Grant CNS-0953447 and in mixture of data processing and control logic within the sorting
part by Nvidia and Synopsys. Any opinions, findings, and conclusions or structure’s processing elements, designing these structures can
recommendations expressed in this material are those of the author(s) and do be cumbersome, imposing large design costs in terms of area,
not necessarily reflect the views of the National Science Foundation.
S. Abdel-Hafeez is with Jordan University of Science and Technology, Irbid power, and processing time. Furthermore, these structures are
22110, Jordan (e-mail: [email protected]). not inherently scalable due to the complexity of integrating
A. Gordon-Ross is with the Department of Electrical and Computer Engi- and combining the data path and control logic within the
neering, University of Florida, Gainesville, FL 32611 USA and also with the
National Science Foundation Center for High-Performance Reconfigurable processing units, thus potentially requiring a full redesign for
Computing, University of Florida, Gainesville, FL 32611 USA (e-mail: different data sizes, as well as complex connective wiring with
[email protected]). high fan-out and fan-in in addition to coupling effects, thus
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. circuit timing issues are challenging to address. Additionally,
Digital Object Identifier 10.1109/TVLSI.2017.2661746 if multiple processors are used along with pipelining stages
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1931
and global memories, the data must be globally merged comparison-free sorting algorithm with illustrative exam-
from these stages to output the complete final sorted data ples and Section IV provides a mathematical analysis.
set [35], [36]. Section V details the hardware data path and control logic
To address these challenges, in this paper, we propose implementations along with timing diagrams. Section VI
a new sorting algorithm targeted for custom, IC-designed presents our simulation results, and Section VII discusses our
applications that sort small- to moderate-sized input sets, conclusions, which elaborate on the overall results and our
such as graphics accelerators, network routers, and video design’s hardware advantages.
processing DSP chips [12], [33], [44], [46]. For example,
graphics processing uses a painter unit that renders objects II. R ELATED W ORK
according to the object’s depth value such that the object In order to provide high scalability, it is critical to design a
can be displayed in the correct order on the screen. In video sorting method with timing and circuit complexity that scales
processing, fast computation is required for small matrices linearly with the number of input elements N [i.e., the circuit
in a frame in order to increase the resolution using digi- timing delay and circuit complexity are on the complexity
tal filters that leverage sorting algorithms. Even though we order of O(N)]. Although some recent works showed linear
present our design based on these scenarios, our design also scalability, these works’ O(N) notations hide a large scalar
supports processing large input sets by subsequently process- value [4], [27], [32], [34] and these methods have expensive
ing the data in multiple, smaller input sets (i.e., in sets of circuit complexity with respect to multiprocessing, local and
N < 100 000) using fast computations, and then merging global memories, pipelining, and control units with special
these sets. However, since applications with larger input instruction sets, in addition to high-cost technology power
sets (on the order of millions) are usually embedded into factors.
systems with large computational resources, such as data min- Other recent works [2], [25], [37]–[42] divide the sorting
ing and database visualization applications running on high- algorithm design into smaller computation partitions, where
performance grid computing and GPU accelerators [17]–[20], each partition integrates control logic and the partition’s com-
these applications can harness those powerful resources for parison operations with feedback decisions from neighbor-
sorting. ing partitions. A global control unit coordinates this control
Our sorting algorithm’s main features and contributions to streamline the data flow between the partitions and the
include as follows. partitions’ associated memories to store temporary data that
1) Our design affords continuous sorting of input element is transferred between partitions. In addition to the complex
sets, where each set can hold any type and distribu- circuitry required to maintain inter-partition connectivity and
tion (ordering) of data elements. Sorting is triggered redundant intra-partition control circuitry, a complex global
with a start-sort signal and sorting ends when a done- memory organization is required.
sorting signal is asserted by the design, which subse- Alternative methods [43]–[45] attempt to eliminate
quently begins sorting the next input set, thus affording comparators by introducing a rank (sorted) ordering
continuous, end-to-end sorting. approach. In [43], a bit-serial sorter architecture was
2) Our sorting design does not require any ALU- implemented based on a rank-order filter (ROF), but
comparisons/shifting-swapping, complex circuitry, comparators were still used to transform the programmable
or SRAM-based memory, and processes data in a capacitive threshold logic (CTL) to a majority voting decision.
forward moving direction through the circuit. Our That design used large array cells of ROF and CTL decisions
design’s simplicity results in a highly linearized with a pipelined architecture. The design in [44] counted the
sorting method with a CMOS transistor count that number of occurrences of every element in the unsorted input
grows on the order of O(N). Hence, the design array, where the rank of each element was determined by
provides low and efficient power components with the counting the number of elements less than or equal to the
addition of regularity and scalability as key structure element being considered. Thus, the comparison units were
features, which provide easily and quick miagration to replaced by counting units with bit comparison. However,
embedded micro-controllers and field-programmable the design required a complicated hardware structure with
gate arrays (FPGAs). pipelining and a histogram counting sequence. Alternatively,
3) The sorting delay time is always linearly proportional the design in [45] used a rank matrix that assigned relative
to the number of input data elements N, with upper ranks to the input elements, where the highest element had
and lower bounds of 3N and 2N clock cycles, respec- the maximum rank and the lowest element had the lowest
tively, giving a linear sorting delay time of O(N). rank of 1. The rank matrix was updated based on the value of
This sorting time is independent of the input elements’ a particular bit in each of the N input elements, starting with
ordering or repitition since the design always performs the most-significant bit. This bit-wise inspection required
the same operations within these bounds as opposed to inspecting a complete column of the rank matrix in order
Quicksort and othersorting algorithms, which have large for the lower ranks to update the higher ranks. However, that
and nonlinear margin of bounds. design could not be used when the number of elements was
The remainder of this paper is organized as follows. Section II less than the elements’ bit-width.
summarizes related works and the works’ cost-performance Some recent works [47]–[49] leverage previous works and
bottleneck tendencies. Section III discusses our proposed integrate several different sorting architectures for different
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1932 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1933
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1934 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1935
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1936 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
in the sorted register array. If the flag register records a value Fig. 10 shows the timing diagram for the read-sort phase for
equal to or greater than “1,” the associated element should be all three cases, where the circled area shows the clock cycle
stored in the sorted register array a number of times equal operations for case two and three. Case three is assumed to be
to the flag register’s value. The case is simple when the flag the worst case due to the decrementor’s delay, which has more
register value is “1,” which is detected by the one-detector. delay than the one-detector delay (TOD ) as given in case 2.
To avoid complex comparison units (i.e., equal to or greater The additional required logic gates’ delays, such as the XOR
than “1”), detecting values greater than “1” can be easily gate, tri-state buffer, and AND gates, are not included in the
determined using the decrementor’s carry out single. Thus, above delay equations since these gates require only fractions
if the one-detector’s evaluation is false (i.e., “0” is the one- of nano-seconds. Additionally, delay buffer #3 (Fig. 9) has
detector’s decision output), but when decrementing the flag no effect on the read-sort cycle time since this delay element
register’s value, the resulting carry out flag is “0,” this means is only used for maintaining the setup-hold time between the
that the flag register’s value was greater than “1.” In both clock (CLK) and the element being stored in the sorted register
cases, the input element should be stored into the sorted array.
register array. Indexing to the next input element is inhibited Case three represents the worst case, upper bound sorting
by disabling the PC’s increment, which allows the replicated time when the input element set contans N occurances of the
element to be stored in the sorted register array until the flag same element (i.e., one row in the transpose matrix has all
register value reaches “0.” Otherwise, the flag register’s value “1” values, while all other rows have all “0” values). The
is “0,” the element is not in the input set, and thus is not stored corresponding flag register’s value for this element is “N,”
into the sorted register array, and the PC is incremented. while all other flag registers’ values are “0.” Our algorithm
The read-sort cycle time can be divided into three cases requires N− 1 cycles to check all flag register values (i.e., all
based on the flag register’s value. For clarity, these cases will transpose matrix rows), even though all values are “0,” and
be described with references to the example in Fig. 1 and N cycles to output the single replicated element N times into
the discussion of the structure in Section III. In case one, the sorted register array. Therfore, the total number of clock
the flag register’s value is “0” (i.e., the element is not in the cycles are 2N − 1 plus one cycle for reset, resulting in a total
binary matrix), and thus, this element is not stored in the sorted worst case, upper bound of 2N.
register array, and the PC is incremented (i.e., proceed to the The best case, lower bound occurs when all elements in
next row in the transpose matrix). The timing of the read- the input set are distinct (i.e., every transpose matrix row
sort cycle (Tread−cycle ) in case one is the sum of the PC’s contains either a single “1” or no “1s,” case one and case
increment (TPC ), the one-hot decoder’s (TOH ), and the one- two, respectively). During the read-sort phase, each cycle
detector’s (TOD ) delays either stores one element or nothing, respectively, to the
Tread−cycle = TPC + TOH + TOD . (10) sorted register array, which requires N clock cycles to sort
N elements.
We can see that the one-detector and decrementor both operate On average and in most general cases, the input set will
concurrently with the flag register value’s evaluation. contain a mixture of distinct and repeated elements, and the
In case two, the flag register’s value is “1,” meaning that actual sorting time will fall between the upper and lower
the element is in the input set once, and thus this element is bounds. Considering both the write-evaluate and read-sort
read from the order register using the one-hot decoder and a phases, the required number of clock cycles ranges from
tri-state buffer at the register’s output, the element is stored in 2N to 3N to sort the input elements, with the addition of the
the sorted register array, and the PC is incremented. As with one clock cycle for reset and one clock cycle for the control
case one, a flag register value of “0” and “1” both require one switch between the write-evaluate and read-sort phases.
clock cycle. The timing of the read-sort cycle (Tread−cycle ) in
this case is the sum of the PC’s increment (TPC ), the one- B. Control Unit Operation
hot decoder’s (TOH ), the one-detector’s (TOD ), and the sorted The control unit receives input signals from the data path
register array’s (TSR ) delays and outputs the appropriate control signals back to the data
Tread−cycle = TPC + TOH + TOD + TSR . (11) path. The control unit also receives the external and hand-
shaking components’ signals in order to interface with the
In case three, the flag register’s value is greater than “1” external components that are using the sorting hardware, and
(i.e., the element’s corresponding row in the transpose matrix synchronizes the complete sorting operation. There are several
contains more than one “1”). Similar to case two, this element methods for designing the control unit [54], [55], and prior
is stored into the sorted register array, but in this case, the flag work on sorting hardware typically found it sufficient to
register is also decremented. The PC’s increment is disabled present only the data path design and no detail on the control
until the element’s flag register reaches “1,” signaling that all logic [2], [34]–[45]. However, in our work, we present the
occurrences of the element have been stored into the sorted complete control unit design in order to provide a holistic
output array. The timing of the read-sort cycle (Tread−cycle ) in sorting implementation with all signals, which alleviates any
this case is the sum of the PC’s increment (TPC ), the one-hot discrepancy between the control and data path units. Addi-
decoder’s (TOH ), the decrementor’s (TDA ), and the flag register tionally, our inclusion of the control unit’s design shows
array’s (TFR ) delays the simplicity of our sorting hardware, with the control unit
Tread-cycle = TPC + TOH + TDA + TFR . (12) using a small number of gates and is scalable and easily
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1937
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1938 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
TABLE I
C OMPONENT T IME D ELAYS AND T RANSISTOR C OUNTS
A SSUMING 90-nm T ECHNOLOGY
Fig. 13. Transistor counts for the order, flag, and sorted register arrays as
compared number of elements.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1939
TABLE II
S ORTING C OMPUTATION T IME FOR AN I NPUT S ET OF 1024 E LEMENTS
TABLE III
C OMPARISON B ETWEEN P RIOR W ORK AND O UR
P ROPOSED S ORTING D ESIGN
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1940 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
TABLE IV
C OMPARISON W ITH R ECENT FPGA S ORTING A LGORITHMS : S PIRAL [47] AND R ESOLVE [48]
to obtain the final sorted output. We evaluated the designs O(N) with respect to the sorting speed, transistor count, and
based on the number of clock cycles required to sort an power consumption. This linear growth is with respect to the
input set of size N. This evaluation illustrates the com- number of elements N for N = 2 K where K is the bit width
plexity scaling of our simple forward data flowing design of the input data. The slope of the linear growth rate is small,
for increasing bit-widths as compared to the prior methods with a growth rate of approximately 6 for the transistor count
that merge the datapath and control units’ functionalities and power consumption, and 1.5 for the sorting speed.
within the parallel computing cells, memory, and comparison The order complexity and growth rates are due to
circuitry, all of which usually dictate the circuit’s design simple basic circuit components that alleviate the need
complexity (number of transistors), runtime complexity (num- for SRAM-based memory and pipelining complexity. Our
ber of cycles to sort N elements), and power. Dividing mathematically-simple algorithm streamlines the sorting oper-
computing cells that integrate the datapath with the control ation in one forward flowing direction rather than using
unit usually requires two operations: element evaluation and compare operations and frequent data movement between the
result updating, which requires repeating evaluation decisions. storage and computational units, as with other sorting algo-
Furthermore, prior rank-based designs required repeated ALU rithms. Our design uses simple standard library components
computations within the SRAM or memory array, which is including registers, a one-hot decoder, a one detector, an incre-
usually characterized as being time consuming. menter/decrementer, and a PC, combined with a simple control
For additional comparison, we evaluate the data reported unit that contains a small amount of delay logic.
in [49], which presents recent work on hardware sorting algo- Our design is at least 6× faster than software parallel
rithms implemened on the Xilinx FPGA xc7vx690tffg1761-2 algorithms that harness powerful computing resources for
using 32-bit fixed point operations and running at a frequency input data set sizes in the small-to-moderate range up to 216 .
of 125 MHz. Table IV shows the overall transistor counts, Additionally, our hardware design’s performance is approxi-
required number of BRAMs, and sorting time in micro- mately 1.5× better as compared to other optimized hardware-
seconds. These compared designs show a linear increase based hybrid sorting designs in terms of transistor count and
in the FF/LUT count with respect to the number of ele- design scalability, number of clock cycles and critical path
ments, however the BRAM requirements do not scale linearly. delay, and power consumption. Thus, our design is suitable
Since memory devices introduce performance bottlenecks, for most IC systems that require sorting algorithms as part of
this results in the non-linear execution time and non-linear their computational operations.
transistor count. Our results show that our comparison-free sorting CMOS
With respect to all evaluated results, our comparison-free hardware can sort N unsigned integer elements from end-to-
sorting design provides an efficient linear scalability of O(N). end with any input data set distribution within 2N to 3N
Our design uses simple registers (flag, order, and sorted clock cycles (lower and upper bounds, respectively) at a clock
registers) that are accessed on both the rising and falling frequency of 0.5 GHz using a 90-nm TSMC technology with
clock edges, and simple standard CMOS components with a 1 V power supply and a power consumption of 1.6 mW for
a forward flowing data movement architecture. Even though N = 1024 elements.
our design shows a linear performance cost of O(N), our Future work includes leveraging our sorting algorithm for
hardware design is recommended for data element set sizes of commercial parallel processing computing power, such as
less than 216 due to practical integration into large computing GPUs and parallel processing machines, in order to further
IC devices (e.g., graphics engines, routers, grid controllers.), improve large-scale sorting, and thus, further enhance embed-
where the sorting hardware accounts for no more than 10% of ded sorting for big data applications.
the IC’s characteristics (power and area).
R EFERENCES
VII. C ONCLUSION
[1] D. E. Knuth, The Art of Computer Programming. Reading, MA, USA:
In this paper, we proposed a novel mathematical Addison-Wesley, Mar. 2011.
comparison-free sorting algorithm and associated hardware [2] Y. Bang and S. Q. Zheng, “A simple and efficient VLSI sorting
architecture,” in Proc. 37th Midwest Symp. Circuits Syst., vol. 1. 1994,
implementation. Our sorting design exhibits linear complexity pp. 70–73.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
ABDEL-HAFEEZ AND GORDON-ROSS: EFFICIENT O( N ) COMPARISON-FREE SORTING ALGORITHM 1941
[3] T. Leighton, Y. Ma, and C. G. Plaxton, “Breaking the(n log2 n) barrier [29] J.-T. Yan, “An improved optimal algorithm for bubble-sorting-based non-
for sorting with faults,” J. Comput. Syst. Sci., vol. 54, no. 2, pp. 265–304, Manhattan channel routing,” IEEE Trans. Comput.-Aided Des. Integr.
1997. Circuits Syst., vol. 18, no. 2, pp. 163–171, Feb. 1999.
[4] Y. Han, “Deterministic sorting in O(n log log n) time and linear space,” [30] L. Skliarova, D. Mihhailov, V. Sklyarov, and A. Sudnitson, “Implemen-
J. Algorithms, vol. 50, no. 1, pp. 96–105, 2004. tation of sorting algorithms in reconfigurable hardware,” in Proc. 16th
[5] C. Canaan, M. S. Garai, and M. Daya, “Popular sorting algorithms,” IEEE Medit. Electrotech. Conf. (MELECON), Mar. 2012, pp. 107–110.
World Appl. Programm., vol. 1, no. 1, pp. 62–71, Apr. 2011. [31] N. Tabrizi and N. Bagherzadeh, “An ASIC design of a novel pipelined
[6] L. M. Busse, M. H. Chehreghani, and J. M. Buhmann, “The infor- and parallel sorting accelerator for a multiprocessor-on-a-chip,” in Proc.
mation content in sorting algorithms,” in Proc. IEEE Int. Symp. Inf. IEEE 6th Int. Conf. ASIC (ASICON), Oct. 2005, pp. 46–49.
Theory (ISIT), Jul. 2012, pp. 2746–2750. [32] H. Schröder, “VLSI-sorting evaluated under the linear model,” J. Com-
[7] R. Zhang, X. Wei, and T. Watanabe, “A sorting-based IO connec- plex., vol. 4, no. 4, pp. 330–355, Dec. 1988.
tion assignment for flip-chip designs,” in Proc. IEEE 10th Int. Conf. [33] H.-S. Yu, J.-Y. Lee, and J.-D. Cho, “A fast VLSI implementation of
ASIC (ASICON), Oct. 2013, pp. 1–4. sorting algorithm for standard median filters,” in Proc. 12th Annu. IEEE
[8] D. Fuguo, “Several incomplete sort algorithms for getting the median Int. ASIC/SOC Conf., Sep. 1999, pp. 387–390.
value,” Int. J. Digital Content Technol. Appl., vol. 4, no. 8, pp. 193–198, [34] G. Campobello and M. Russo, “A scalable VLSI speed/area tunable
Nov. 2010. sorting network,” J. Syst. Archit., vol. 52, no. 10, pp. 589–602, Oct. 2006.
[35] W. Zhou, Z. Cai, R. Ding, C. Gong, and D. Liu, “Efficient sorting
[9] W. Jianping, Y. Yutang, L. Lin, H. Bingquan, and G. Tao, “High-
design on a novel embedded parallel computing architecture with unique
speed FPGA-based SOPC application for currency sorting system,” in
memory access,” Comput. Elect. Eng., vol. 39, no. 7, pp. 2100–2111,
Proc. 10th Int. Conf. Electron. Meas. Instrum. (ICEMI), Aug. 2011,
Oct. 2013.
pp. 85–89.
[36] V. Sklyarov, “FPGA-based implementation of recursive algorithms,”
[10] R. Meolic, “Demonstration of sorting algorithms on mobile platforms,” Microprocess. Microsyst., vol. 28, nos. 5–6, pp. 197–211, Aug. 2004.
in Proc. CSEDU, 2013, pp. 136–141. [37] R. Lin and S. Olariu, “Efficient VLSI architectures for Columnsort,”
[11] F.-C. Leu, Y.-T. Tsai, and C. Y. Tang, “An efficient external sorting IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1,
algorithm,” Inf. Process. Lett., vol. 75, pp. 159–163, Sep. 2000. pp. 135–138, Mar. 1999.
[12] J. L. Bentley and R. Sedgewick, “Fast algorithms for sorting and [38] S. W. Moore and B. T. Graham, “Tagged up/down sorter—A hardware
searching strings,” in Proc. 8th Annu. ACM-SIAM Symp. Discrete priority queue,” Comput. J., vol. 38, no. 9, pp. 695–703, Sep. 1995.
Algorithms (SODA), Jan. 1997, pp. 360–369. [39] G. V. Russo and M. Russo, “A novel class of sorting networks,”
[13] L. Xiao, X. Zhang, and S. A. Kubricht, “Improving memory perfor- IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 43, no. 7,
mance of sorting algorithms,” J. Experim. Algorithmic, vol. 5, no. 3, pp. 544–552, Jul. 1996.
pp. 1–20, 2000. [40] S. Dong, X. Wang, and X. Wang, “A novel high-speed parallel scheme
[14] P. Sareen, “Comparison of sorting algorithms (on the basis of average for data sorting algorithm based on FPGA,” in Proc. IEEE 2nd Int.
case),” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3, Congr. Image Signal Process. (CISP), Oct. 2009, pp. 1–4.
pp. 522–532, Mar. 2013. [41] A. Széll and B. Fehér, “Efficient sorting architectures in FPGA,” in Proc.
[15] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “AA-SORT: Int. Carpathian Control Conf. (ICCC), May 2006, pp. 1–4.
A new parallel sorting algorithm for multi-core SIMD processors,” in [42] A. A. Colavita, A. Cicuttin, F. Fratnik, and G. Capello, “SORTCHIP:
Proc. 16th Int. Conf. Parallel Archit. Compil. Techn. (PACT), 2007, A VLSI implementation of a hardware algorithm for continuous data
pp. 189–198. sorting,” IEEE J. Solid-State Circuits, vol. 38, no. 6, pp. 1076–1079,
[16] V. Kundeti and S. Rajasekaran, “Efficient out-of-core sorting algorithms Jun. 2003.
for the parallel disks model,” J. Parallel Distrib. Comput., vol. 71, no. 11, [43] T. Demirci, I. Hatirnaz, and Y. Leblebici, “Full-custom CMOS realiza-
pp. 1427–1433, 2011. tion of a high-performance binary sorting engine with linear area-time
[17] G. Capannini, F. Silvestri, and R. Baraglia, “Sorting on GPUs for large complexity,” in Proc. Int. Symp. Circuits Syst. (ISCAS), vol. 5. May 2003,
scale datasets: A thorough comparison,” Int. Process. Manage., vol. 48, pp. V453–V456.
no. 5, pp. 903–917, 2012. [44] K. Ratnayake and A. Amer, “An FPGA architecture of stable-sorting on
[18] D. Cederman and P. Tsigas, “GPU-Quicksort: A practical quicksort algo- a large data volume: Application to video signals,” in Proc. 41st Annu.
rithm for graphics processors,” ACM J. Experim. Algorithmics (JEA), Conf. Inf. Sci. Syst., Mar. 2007, pp. 431–436.
vol. 14, Dec. 2009, Art. no. 4. [45] S. Alaparthi, K. Gulati, and S. P. Khatri, “Sorting binary numbers in
[19] B. Jan, B. Montrucchio, C. Ragusa, F. G. Ghan, and O. Khan, “Fast hardware—A novel algorithm and its implementation,” in Proc. IEEE
parallel sorting algorithms on GPUs,” Int. J. Distrib. Parallel Syst., Int. Symp. Circuits Syst. (ISCAS), May 2009, pp. 2225–2228.
vol. 3, no. 6, pp. 107–118, Nov. 2012. [46] J. F. Hughes et al., Computer Graphics: Principles and Practice, 3rd ed.
[20] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting Reading, MA, USA: Addison-Wesley, 2014.
algorithms for manycore GPUs,” in Proc. 23rd IEEE Int. Symp. Parallel [47] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
Distrib. Process., May 2009, pp. 1–10. mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int.
[21] C. Bunse, H. Höpfner, S. Roychoudhury, and E. Mansour, “Choosing Symp. Field Program. Gate (FPGA), Monterey, CA, USA, Feb. 2015,
the ‘best’ sorting algorithm from optimal energy consumption,” in Proc. pp. 240–249.
ICSOFT, vol. 2. 2009, pp. 199–206. [48] M. Zuluaga, P. Milder, and M. Püschel, “Streaming sorting networks,”
ACM Trans. Design Autom. Electron. Syst., vol. 21, no. 4, May 2016,
[22] A. D. Mishra and D. Garg, “Selection of best sorting algorithm,” Int.
Art. no. 55.
J. Intell. Inf. Process., vol. 2, no. 2, pp. 363–368, Jul./Dec. 2008.
[49] J. Matai et al., “Resolve: Generation of high-performance sorting
[23] T.-C. Lin, C.-C. Kuo, Y.-H. Hsieh, and B.-F. Wang, “Efficient algorithms architectures from high-level synthesis,” in Proc. ACM/SIGDA Int.
for the inverse sorting problem with bound constraints under the Symp. Field Program. Gate (FPGA), Monterey, CA, USA, Feb. 2016,
l∞-norm and the Hamming distance,” J. Comput. Syst. Sci., vol. 75, pp. 195–204.
no. 8, pp. 451–464, 2009. [50] Sorting Algorithms Animations, accessed on 2017.
[24] F. Henglein, “What is a sorting function?” J. Logic Algebraic Pro- [Online]. Available: https://www.toptal.com/developers/sorting-
gramm., vol. 78, no. 7, pp. 552–572, Aug./Sep. 2009. algorithms
[25] E. Mumolo, G. Capello, and M. Nolich, “VHDL design of a scalable [51] (2010). Cadence Online Documentation. [Online]. Available: http://
VLSI sorting device based on pipelined computation,” J. Comput. Inf. www.cadence.com
Technol., vol. 12, no. 1, pp. 1–14, 2004. [52] (2015). Synopsys Online Documentation. [Online]. [Online]. Available:
[26] E. Herruzo, G. Ruiz, J. I. Benavides, and O. Plata, “A new paral- http://www.synopsys.com
lel sorting algorithm based on odd-even mergesort,” in Proc. 15th [53] J. P. Uyemura, CMOS Logic Circuit Design. Norwell, MA, USA:
EUROMICRO Int. Conf. Parallel, Distrib. Netw.-Based Process. (PDP), Kluwer, 1999.
Feb. 2007, pp. 18–22. [54] J. P. Hayes, Computer Architecture and Organization, 2rd ed. New York,
[27] M. Thorup, “Randomized sorting in O(n log log n) time and linear space NY, USA: McGraw-Hill, 1994.
using addition, shift, and bit-wise Boolean operations,” J. Algorithms, [55] S. Lee, Advanced Digital Logic Design Using VHDL, State Machines,
vol. 42, no. 2, pp. 205–230, Feb. 2002. and Synthesis for FPGA’s. Luton, U.K.: Thomson Holidays, 2006.
[28] M. Afghahi, “A 512 16-b bit-serial sorter chip,” IEEE J. Solid-State [56] Taiwan Semiconductor Manufacturing Corporation. 90 nm CMOS ASIC
Circuits, vol. 26, no. 10, pp. 1452–1457, Oct. 1991. Process Digests, 2005.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.
1942 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 6, JUNE 2017
[57] Synopsys. (2010). HSPICE. [Online]. Available: http://www.synopsys. Ann Gordon-Ross (M’00) received the B.S. and
com Ph.D. degrees in computer science and engineering
[58] S. Abdel-Hafeez and A. Gordon-Ross, “A gigahertz digital CMOS from the University of California, Riverside, CA,
divide-by-N frequency divider based on a state look-ahead structure,” USA, in 2000 and 2007, respectively.
J. Circuits, Syst. Signal Process., vol. 30, no. 6, pp. 1549–1572, 2011. She is currently an Associate Professor of Electri-
[59] V. Stojanovic and V. G. Oklobdzija, “Comparative analysis of master- cal and Computer Engineering with the University
slave latches and flip-flops for high-performance and low-power sys- of Florida, Gainesville, FL, USA, where she is a
tems,” IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536–548, member of the NSF Center for High Performance
Apr. 1999. Reconfigurable Computing (CHREC). She is very
active in promoting diversity in STEM fields. Her
current research interests include embedded systems,
Saleh Abdel-Hafeez (M’01) received the B.S.E.E.,
M.S.E.E., and Ph.D. degrees in computer engineer- computer architecture, low-power design, reconfigurable computing, dynamic
ing from the USA with a specialization of very large optimizations, hardware design, real-time systems, and multicore platforms.
scale integration (VLSI) design. Dr. Gordon-Ross is the Faculty Advisor for the Women in Electrical
In 1997, he joined S3 Inc., Huntsville, AL, USA, and Computer Engineering and the Phi Sigma Rho National Society for
as a member of their technical staff, where he Women in Engineering and Engineering Technology, and she is an active
was involved in the IC circuit design related to member of the Women in Engineering ProActive Network. She received the
CAREER award from the National Science Foundation in 2010, the Best
cache memory, digital I/O, and ADCs. He was
the Chairman of Computer Engineering Department. Paper Awards at the Great Lakes Symposium on VLSI in 2010 and the
He is currently an Associate Professor with the IARIA International Conference on Mobile Ubiquitous Computing, Systems,
College of Computer and Information Technology, Services and Technologies in 2010, and the Best Ph.D. Forum Award at the
IEEE Computer Society Annual Symposium on VLSI in 2014. She has been a
Jordan University of Science and Technology, Irbid, Jordan. He holds three
patents (6, 265, 509; 6, 356, 509; 20040211982A1) in the field of IC design. Guest Speaker and has organized several international workshops/conferences
His current research interests include circuits and architectures for low-power on this topic, and participates in outreach programs at local K-12 schools.
and high-performance VLSI.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 07:57:06 UTC from IEEE Xplore. Restrictions apply.