Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
Abstract — Sorting is an important computational task needed in Hardware adaption of this algorithm is called Parallel Shift
almost all the modern data processing applications. Insertion sort Sort (PSS)[2]. PSS inserts the incoming elements at proper
is one of the simplest algorithms used for sorting. However, positions to get the sorted array as shown in Fig.1. We have
implementation of insertion sort in sequential execution leads to a implemented a ‘Fill and Flush’ version of parallel shift sort.
time complexity O(n2) making it less efficient. This often leads to
not preferring this sorting algorithm for many applications. This
paper explores the insertion sort implementation in VHDL using
parallel shift sort technique which results in linear time
complexity O(n). The designed model is further optimized for
operation at higher data rates. An iterative design using the
optimized model is then implemented on Xilinx Spartan-6 FPGA
which uses in-place computations and allows processing of large
data with less hardware resources. This makes the iterative
design ideal for area constrained applications which operate in a
dynamic input environment with fixed hardware such as real time
sensor data processing.
I. INTRODUCTION
For hardware implementation, the inner loop is realized using them as input to the next hardware unit to perform further
a processing element that consists of a comparator that tasks.
generates swapping trigger and two storage registers storing
the swapped or retained results. Outer loop will be emulated by 7. For N bit inputs, in place of 0(hex) use N bit smallest
the cascade of such processing elements. The sorting operation number (N no. of 0s) and in place of F use N bit largest
is thus mapped into concurrent hardware processes. number (N no. of 1s).
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India
1. The internal storage register works on the opposite clock Fig. 6. ‘Fill and Flush’ PSS sorting with modified processing
edge to that of the memory. In our entire design, the memory is elements
activated on the positive clock edge and the internal storage
register is negative edge triggered. Assume PE1 and PE2 are similar processing elements
connected in the chain in Fig. 7 with same clock input. Let
2. The output storage register is removed. It was needed if Tpc be propagation delay of combinational comparison and
there is a chance that incoming data would be lost before the
current clock cycle is complete (e.g. data coming over a serial selection block and Tps be the propagation delay of
link with no buffer storage). Since we are getting data from the sequential internal storage block.
memory and memory location would not change until the next
clock edge, that makes the output register redundant in our In order for both blocks to keep the comparison data ready
case. However, if data is coming from a source that may not before the data input arrives, the following constraint is
keep it steady for one cycle, output register would be required imposed on clock period Tclk : Tclk t Tpc Tps
for intermediate storage.
2. N=P : Sorter will give correct sorted output. The area and
timing efficiency are optimal in this case.
217
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India
Consider the following block diagram of the iterative sorting Flag ITERRESET is used to control the state diagram. It
technique : monitors the number of iterations that have occurred in the
system. Thereby it also keeps track of whether the sorting
process is complete or not.
1. The sorting module contains P processing elements and ITERRESET = 1 means N/P iterations are over and the sorting
reads the numbers (READ DATA) from dual port RAM is complete. The system goes to state 2 if it initially was in
addresses provided by control circuit via signal READ state 1 or if it was already in state 2, it remains in state 2
ADDR. displaying the sorted numbers.
2. In every iteration, it sorts P numbers out of unsorted ITERRESET = 0 means N/P iterations are not over and the
elements (out of total N) giving partially sorted data. sorting continues. The system goes to state 1 if it initially was
in state 2 thus starting fresh sorting process or if it was already
3. After each iteration of the sorting process, sorter output in state 1, it remains in state 1 thereby advancing to the next
(partially sorted data for the iteration) is stored back sorting iteration.
(WRITE DATA) in the dual port RAM starting from the
first address from which the numbers are read. WRITE Internal details of control circuit for the iterative sorting
ADDR signal provides this write back address. process are described with help of following block diagram
(clock inputs are inherently assumed for all sequential blocks ):
4. WRITE ENABLE signal avoids conflicts which may
arise due to an attempt to simultaneously read and write
to the same RAM address. Also, it provides
synchronization as we initially have to wait a few cycles
till useful data gets flushed out of the sorting module.
This is because the values initialized in the processing
elements will flush out first which do not carry useful
information.
218
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India
220
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India
TABLE II shows that FPGA resource utilization is almost Annual ACM/SIGDA International Symposium on Field
independent of number of inputs to be sorted. This confirms Programmable Gate Arrays (FPGA), 2011, pp. 45–54.
that fixed sorter hardware is able to handle varying number of [4] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, “A novel
inputs. sorting algorithm for many-core architectures based on adaptive
bitonic sort,” in Proceedings of the 2012 IEEE 26th
TABLE II International Parallel and Distributed Processing Symposium,
FPGA Resource Utilization (P = 10) 2012, vol. 1, pp. 227–237.
[5] R. Kobayashi and K. Kise, “FACE: Fast and Customizable
Number of inputs Registers LUTs Sorting Accelerator for Heterogeneous Many-core Systems,” in
Proceedings of the 9th Annual IEEE International Symposium on
to be sorted (N)
Embedded Multicore/Many-core Systems-on-Chip (MCSoC),
100 126 1102 2015, pp. 49–56.
[6] Z. Navabi, VHDL: Analysis and Modeling of Digital Systems,
1000 127 1106 2nd ed. New York: McGraw-Hill, 1998.
10000 129 1108 [7] W. Song, D. Koch, M. Lujan, and J. Garside, “Parallel
Hardware Merge Sorter,” in Proceedings of the 24th Annual
100000 132 1109 IEEE International Symposium on Field-Programmable Custom
Computing Machines, 2016, pp. 95–102.
[8] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A
VI. CONCLUSIONS
Hybrid Design for High Performance Large-Scale Sorting on
FPGA,” in Proceedings of the 2015 International Conference on
In this paper we have implemented the ‘Fill and Flush’ version
ReConFigurable Computing and FPGAs (ReConFig), 2015, pp.
of parallel shift sort technique. A generic design was 1–6.
successfully synthesized in Xilinx ISE and the prototype was [9] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal Parallel
tested on Xilinx Spartan-6 (XC6SLX45) FPGA. The initial Hardware K-Sorter and Top K-Sorter, with FPGA
(N=P) implementation with optimized hardware resulted in a Implementations,” in Proceedings of the 14th Annual
linear time complexity needing 2N clock cycles for sorting. International Symposium on Parallel and Distributed
Since the sorter will have a limited area allocated on the chip, Computing, 2015, pp. 138–147.
P will be constant. However, in a dynamic input environment,
where N is not constrained, for N>P case, we can’t increase P
to make P=N due to this area constraint. This issue was
resolved by extending basic N=P sorter to an iterative
implementation which allows fixed hardware to handle the
dynamic input scenarios. The iterative sorting operation
resulted in N2/P cycles execution time, thus still giving P times
faster performance than the sequential implementation,
however using N/P times less processing elements than the
basic hardware implementation.
REFERENCES
221
Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.