0% found this document useful (0 votes)
2 views

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

This paper presents an optimized implementation of the insertion sort algorithm using the Parallel Shift Sort (PSS) technique in VHDL, achieving linear time complexity O(n). The design is tailored for area-constrained applications and is implemented on a Xilinx Spartan-6 FPGA, allowing for efficient processing of large data sets in dynamic environments. The iterative sorting method described can handle varying input sizes by partially sorting data in multiple iterations, enhancing overall performance.

Uploaded by

黃鈺珊
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

This paper presents an optimized implementation of the insertion sort algorithm using the Parallel Shift Sort (PSS) technique in VHDL, achieving linear time complexity O(n). The design is tailored for area-constrained applications and is implemented on a Xilinx Spartan-6 FPGA, allowing for efficient processing of large data sets in dynamic environments. The iterative sorting method described can handle varying input sizes by partially sorting data in multiple iterations, enhancing overall performance.

Uploaded by

黃鈺珊
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Iterative Parallel Shift Sort : Optimization and Design

for Area Constrained Applications


Sumit Diware1, Sharath B. Krishna2
1,2
VLSI Design Tools and Technology, Indian Institute of Technology Delhi,
Hauz Khas, New Delhi 110016, India
1
[email protected]
2
[email protected]

Abstract — Sorting is an important computational task needed in Hardware adaption of this algorithm is called Parallel Shift
almost all the modern data processing applications. Insertion sort Sort (PSS)[2]. PSS inserts the incoming elements at proper
is one of the simplest algorithms used for sorting. However, positions to get the sorted array as shown in Fig.1. We have
implementation of insertion sort in sequential execution leads to a implemented a ‘Fill and Flush’ version of parallel shift sort.
time complexity O(n2) making it less efficient. This often leads to
not preferring this sorting algorithm for many applications. This
paper explores the insertion sort implementation in VHDL using
parallel shift sort technique which results in linear time
complexity O(n). The designed model is further optimized for
operation at higher data rates. An iterative design using the
optimized model is then implemented on Xilinx Spartan-6 FPGA
which uses in-place computations and allows processing of large
data with less hardware resources. This makes the iterative
design ideal for area constrained applications which operate in a
dynamic input environment with fixed hardware such as real time
sensor data processing.

Keywords — Insertion sort; VHDL; iterative; FPGA; linear time.

I. INTRODUCTION

Consider five unsorted numbers 21,19,24,15,16. These


numbers are to be sorted in ascending order using insertion
sort[1]. The execution of insertion sort can be demonstrated
with these inputs as follows :
Fig. 1. Parallel shift sort (PSS) algorithm
Iteration 1: Compare the first two elements. Since 21>19, swap
their positions. We get 19,21,24,15,16. Recent works regarding merge sort were explored to get better
design insights[3][4][5]. The entire design is implemented in
Iteration 2: Compare third element with all elements before it VHDL[6]. The sequential domain to parallel domain transition
(first and second, which are already sorted in ascending order) is achieved by designing a hardware processing element
and insert it at appropriate place among the three elements to described in the next section.
achieve ascending order array. As 19<24 and 21<24, 24 stays
at same place. We get 19,21,24,15,16. II. THE PROCESSING ELEMENT
Iteration 3: Compare next element with all elements before it Pseudo-code for a sequential execution loop for ascending
(first, second and third which are already sorted in ascending order insertion sort is (L: total no of inputs to be sorted
order) and insert it at appropriate place among these elements contained in an array S) :
to achieve ascending order array. Since, 19>15, 21>15 and
24>15 we get 15,19,21,24,16. for m=1 to L
n=m
Iteration 4: Compare next element with all elements before it while n > 0 and S(n-1) > S(n)
(first, second, third and fourth which are already sorted in swap S(n-1) and S(n)
ascending order) and insert it at appropriate place among these n = n-1
elements to achieve ascending order array. Since, 19>16, end while loop
21>16 and 24>16 but 15<16 we get 15,16,19,21,24. This is the end for loop
final sorted array.

978-1-5090-3012-5/17/$31.00 ©2017 IEEE


Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

For hardware implementation, the inner loop is realized using them as input to the next hardware unit to perform further
a processing element that consists of a comparator that tasks.
generates swapping trigger and two storage registers storing
the swapped or retained results. Outer loop will be emulated by 7. For N bit inputs, in place of 0(hex) use N bit smallest
the cascade of such processing elements. The sorting operation number (N no. of 0s) and in place of F use N bit largest
is thus mapped into concurrent hardware processes. number (N no. of 1s).

This algorithm is demonstrated in Fig. 4 to sort three numbers


5,3,4 in ascending order. The squares represent a processing
element. Input to a processing element is represented by arrow
entering the square from its left side and output to the next
stage from the output storage of processing element is
represented by arrow coming out of the square on its right side.
Cascade connections are made to form the chain.

Fig. 2 Basic processing element (PE)

Fig. 3. Chain of processing elements for sorting

Generalised ‘Fill and Flush’ PSS algorithm using the


processing element (PE) for 4 bit numbers is :

1. Initialize all the PEs in the cascade chain with


‘0’(smallest hexadecimal number) for ascending sorting
and with ‘F’(largest hexadecimal number) for descending
sorting.(‘Fill’ stage)

2. PE will retain the larger number on comparison for


ascending sorting and will retain the smaller number on
comparison for descending sort in its internal storage.

3. The other number will be propagated across the PE chain


by means of the output storage.
Fig. 4. Sorting using ‘Fill and Flush’ PSS technique
4. The PE chain and the internal registers will successively
get updated by the consecutive comparison and storage III. MODIFIED PROCESSING ELEMENT
operations. This propagation and updating puts each
incoming number at its proper position in the chain. Input to the processing element will be coming from a memory
in which the unsorted array is stored. Previous demonstration
5. After the last number to be sorted has been given to the was given using a memory which reads or writes at positive
sorter, next successive inputs should be ‘F’s (largest clock edge and the internal, output storage registers are also
hexadecimal number) for flushing the sorted numbers out positive edge triggered. This reduces the complexity from N2
of the sorter for ascending sorting and the same is cycles to 3N cycles. Hardware implementations of merge sort
achieved with successive ‘0’s (smallest hexadecimal have presented various optimization and acceleration
number) for descending sorting. Flushed out data is the methodologies[7]–[9]. In a similar manner, we have made an
sorted output. (‘Flush’ stage) attempt to improve this timing efficiency further with the
following two key modifications in the processing element
6. The flushing out is needed as we may need to store the design :
sorted elements in some memory locations or to give
216

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

1. The internal storage register works on the opposite clock Fig. 6. ‘Fill and Flush’ PSS sorting with modified processing
edge to that of the memory. In our entire design, the memory is elements
activated on the positive clock edge and the internal storage
register is negative edge triggered. Assume PE1 and PE2 are similar processing elements
connected in the chain in Fig. 7 with same clock input. Let
2. The output storage register is removed. It was needed if Tpc be propagation delay of combinational comparison and
there is a chance that incoming data would be lost before the
current clock cycle is complete (e.g. data coming over a serial selection block and Tps be the propagation delay of
link with no buffer storage). Since we are getting data from the sequential internal storage block.
memory and memory location would not change until the next
clock edge, that makes the output register redundant in our In order for both blocks to keep the comparison data ready
case. However, if data is coming from a source that may not before the data input arrives, the following constraint is
keep it steady for one cycle, output register would be required imposed on clock period Tclk : Tclk t Tpc  Tps
for intermediate storage.

Fig. 5 Modified processing element

Removal of output storage register makes the result


propagation in the chain faster and negative edge triggering
Fig. 7. Sorter topology analysis
makes the comparison and storage processing faster. This
speeds up the multiple dependent comparisons in the chain IV. ITERATIVE SORTING
reducing the sorting duration to 2N cycles.
Assume the sorter to be working in a dynamic input
This is demonstrated in Fig. 6 to sort three numbers 5,3,4 in environment where the number of inputs to be sorted is not
ascending order with arrows and squares having same meaning always constant. The sorter has number of processing elements
as previous section: (P) equal the number of inputs to be sorted (N). Different
scenarios arising in this system can be :

1. N<P : Sorter will give correct sorted output. This would


however result in reduced area and timing efficiency, wasting
(P-N) programming elements and (2P-2N) clock cycles.

2. N=P : Sorter will give correct sorted output. The area and
timing efficiency are optimal in this case.

3. N>P : Sorter cannot operate properly in this case. Out of the


N inputs it can partially sort only P inputs and rest N-P inputs
are unsorted.

The normal sorter cannot directly process the N>P scenario.


However, the problem can be solved by coupling this sorter
with a control circuit such that the resulting system partially
sorts P inputs out of total N and repeats the partial sorting for
N/P (ratio) no. of iterations. When N/P ratio is not an integer,
minimum no. of iterations will be obtained by the ceiling of
N/P ratio.

217

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

Consider the following block diagram of the iterative sorting Flag ITERRESET is used to control the state diagram. It
technique : monitors the number of iterations that have occurred in the
system. Thereby it also keeps track of whether the sorting
process is complete or not.

Fig. 8. Iterative sorting technique Fig. 9. Iterative sorting state diagram

1. The sorting module contains P processing elements and ITERRESET = 1 means N/P iterations are over and the sorting
reads the numbers (READ DATA) from dual port RAM is complete. The system goes to state 2 if it initially was in
addresses provided by control circuit via signal READ state 1 or if it was already in state 2, it remains in state 2
ADDR. displaying the sorted numbers.

2. In every iteration, it sorts P numbers out of unsorted ITERRESET = 0 means N/P iterations are not over and the
elements (out of total N) giving partially sorted data. sorting continues. The system goes to state 1 if it initially was
in state 2 thus starting fresh sorting process or if it was already
3. After each iteration of the sorting process, sorter output in state 1, it remains in state 1 thereby advancing to the next
(partially sorted data for the iteration) is stored back sorting iteration.
(WRITE DATA) in the dual port RAM starting from the
first address from which the numbers are read. WRITE Internal details of control circuit for the iterative sorting
ADDR signal provides this write back address. process are described with help of following block diagram
(clock inputs are inherently assumed for all sequential blocks ):
4. WRITE ENABLE signal avoids conflicts which may
arise due to an attempt to simultaneously read and write
to the same RAM address. Also, it provides
synchronization as we initially have to wait a few cycles
till useful data gets flushed out of the sorting module.
This is because the values initialized in the processing
elements will flush out first which do not carry useful
information.

5. After N numbers are once processed by the sorter, it


marks the end of one iteration and address is reset to the
initial (starting) storage address of RAM. It also triggers
CLEAR signal which reinitializes processing elements
for the next sorting iteration.

6. The partially sorted data is fed to the sorter for the


second iteration and the process is repeated till N/P
iterations.

7. RESET signal is hard-reset for the entire control circuit.


DISPLAY RESET is the hard-reset for RAM addresses
while displaying sorted numbers after the sorting process
is complete.

State diagram for control circuit is as shown :

218

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

4. When counter C1 overflows, OVERFLOW is set to 1,


meaning one read cycle has ended and next sorting
iteration is about to begin. To make the sorter ready, it
has to be properly initialized in the ‘Fill’ stage. This is
done by CLEAR signal. OVERFLOW signal selects the
proper value for CLEAR.

5. When the sorting is over, we don’t need the flushing


numbers and simply have to display the N sorted
numbers. Hence, when sorting iterations are over as
indicated by ITERRESET = 1, read address is given by
counter C2 that simply counts up to N instead of N+P.

6. DISPLAY RESET can be used to reset the display


address after sorting is complete and RESET can be used
to reset the read addresses during sorting process.

E.g. Consider twelve hexadecimal numbers to be sorted in


descending order 2,4,6,8,A,C,E,10,12,14,16,18 with 4
processing elements. (Worst case input test for a descending
sorter = all the inputs in ascending order)

Iteration 1 Output : A,C,E,10,12,14,16,18,8,6,4,2


It sorted smallest 4 numbers out of 12 in descending order
8,6,4,2. Rest 8 unsorted.
Iteration 2 Output : 12,14,16,18,10,E,C,A,8,6,4,2
It sorted smallest 4 numbers out of 8 unsorted ones in
descending order 10,E,C,A. Rest 4 unsorted.
Iteration 3 Output : 18,16,14,12,10,E,C,A,8,6,4,2
It sorted smallest 4 numbers out of 4 unsorted ones in
Fig. 10. Iterative sorting control circuit descending order 18,16,14,12. Nothing left to sort.
1. During sorting, we need the N numbers to be sorted and The iterative solution is implemented with dual port RAM
P flushing numbers (the largest/smallest n bit number) where simultaneous read and write can be done provided they
for flushing out the sorted data out of the processing operate on different memory locations. Same assembly can be
elements. Hence, for sorting process, read address is implemented using single port RAM which allows only either
provided by a counter C1 that counts up to N+P single read or single write to be performed at a time. However,
addresses. it not resource efficient as it needs two RAMs instead of only
one RAM in dual port case. For every iteration one RAM
2. Sorting module will flush out the initialization data out of provides the input numbers as input to sorter and other RAM
the processing elements for the first P cycles. Hence, stores the partially sorted data coming out of the sorter. Their
writing useful data at the initial address should begin P roles then get reversed for the next iteration. This needs a
cycles after the reading has begun i.e. (P+1)th data is the bidirectional linkage of sorting module to the two RAMs
first useful data from sorter. Hence, for initial P cycles, making control circuit more complicated, thereby increasing
write address is held to the initial read address (in our area, increasing length of critical path in the design and
case ‘00000’). After P cycles, write address is obtained reducing the execution speed. Hence, the faster and more area
by subtracting (P+1) from the read address C1 provided efficient system with dual port RAM has been chosen for
by the counter mapping first useful write operation at the implementation.
initial read address (‘0000’).
V. SIMULATION RESULTS AND ANALYSIS
3. Write back to the RAM is allowed by activating WRITE
ENABLE signal only if a read cycle is going on for N+P Fig.11 shows simulation for ascending order sorting of twelve
numbers (counter C1 has not overflown, indicated by unsorted hexadecimal input numbers by chain of twelve
OVERFLOW= 0) and ITERRESET= 0 meaning that the processing elements.
N/P iterations are not over. In any other case it is
prohibited. This provides synchronization and protection Inputs to be sorted are coming from the signal temp1[3:0] in
from read/write conflicts when the address pointed by the sequence 8,2,3,4,7,9,6,5,C,B,E,A and correct sorted output
counter C1 is reset to the initial one. is given by signal dout[3:0] as the sequence
2,3,4,5,6,7,8,9,A,B,C,E.
219

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

Fig. 11 . Sorting of 12 numbers by a chain of 12 processing


elements
Fig. 14. Sorting 12 numbers with 4 processing elements (Third
Next set of simulations illustrates descending order sorting of iteration)
12 hexadecimal input numbers by chain of 4 processing
The three iterative simulations showed that sorting 12 numbers
elements. Input numbers are 2,4,6,8,A,C,E,10,12,14,16,18.
with 4 processing elements needed 51 clock cycles. This can
Fig.12 shows the output of first sorting iteration being carried be explained as follows : Every iteration of sorting takes P+N
out and it matches the expected correct iteration output cycles to partially sort the numbers and flush them out to the
A,C,E,10,12,14,16,18,8,6,4,2 memory. However, during the hardware implementation,
considering various combinational blocks working in the
control circuit and to avoid dual port RAM read-write
conflicts, an extra clock cycle is needed for synchronization.
Thus, one iteration needs N+P+1 total cycles and total no. of
cycles needed will be (N+P+1)x(N/P). Substituting N=12 and
P=4 in our case yields (12+4+1)x(12/4) = 51 cycles as verified
by simulation.

Thus, irrespective of ascending or descending order sorting,


irrespective of best case or worst case input, no. of cycles
required by the iterative sorter is given by (N+P+1)x(N/P) =
Fig. 12. Sorting 12 numbers with 4 processing elements (First (N2+N+NP)/P. For large N and small P, N2 >> N and N2 >>
iteration)
NP. Hence, the no. of cycles needed would be approximately
Fig.13 shows the output of second sorting iteration being N2/P.
carried out and it matches the expected correct iteration output
Sequential implementation of insertion sort would be
12,14,16,18,10,E,C,A,8,6,4,2.
equivalent to using P = 1 in the iterative system, it would
require (N+2)xN= (N2+2N) cycles which can be approximated
to N2 cycles for large N.

This shows that iterative sorter is always P times faster than


the sequential implementation. Ideally, sorting of N numbers
would need N processing elements. However, even with (N/P)
times less hardware resources, as shown in TABLE I, for N=
105, iterative sorter gives P times better timing performance
(takes P times less execution cycles) than the sequential
execution which would need 1010 clock cycles for sorting.
Fig. 13. Sorting 12 numbers with 4 processing elements (Second
iteration) TABLE I
Execution Time Performance (N = 105)
Fig.14 shows the output of third sorting iteration being carried
out and it matches the expected correct final output Number of processing Iterative sorter execution time
18,16,14,12,10,E,C,A,8,6,4,2. elements (P) (Clock cycles)
10 1.00011x109
100 1.00101x108
1000 1.01001x107
10000 1.10001x106

220

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

TABLE II shows that FPGA resource utilization is almost Annual ACM/SIGDA International Symposium on Field
independent of number of inputs to be sorted. This confirms Programmable Gate Arrays (FPGA), 2011, pp. 45–54.
that fixed sorter hardware is able to handle varying number of [4] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, “A novel
inputs. sorting algorithm for many-core architectures based on adaptive
bitonic sort,” in Proceedings of the 2012 IEEE 26th
TABLE II International Parallel and Distributed Processing Symposium,
FPGA Resource Utilization (P = 10) 2012, vol. 1, pp. 227–237.
[5] R. Kobayashi and K. Kise, “FACE: Fast and Customizable
Number of inputs Registers LUTs Sorting Accelerator for Heterogeneous Many-core Systems,” in
Proceedings of the 9th Annual IEEE International Symposium on
to be sorted (N)
Embedded Multicore/Many-core Systems-on-Chip (MCSoC),
100 126 1102 2015, pp. 49–56.
[6] Z. Navabi, VHDL: Analysis and Modeling of Digital Systems,
1000 127 1106 2nd ed. New York: McGraw-Hill, 1998.
10000 129 1108 [7] W. Song, D. Koch, M. Lujan, and J. Garside, “Parallel
Hardware Merge Sorter,” in Proceedings of the 24th Annual
100000 132 1109 IEEE International Symposium on Field-Programmable Custom
Computing Machines, 2016, pp. 95–102.
[8] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A
VI. CONCLUSIONS
Hybrid Design for High Performance Large-Scale Sorting on
FPGA,” in Proceedings of the 2015 International Conference on
In this paper we have implemented the ‘Fill and Flush’ version
ReConFigurable Computing and FPGAs (ReConFig), 2015, pp.
of parallel shift sort technique. A generic design was 1–6.
successfully synthesized in Xilinx ISE and the prototype was [9] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal Parallel
tested on Xilinx Spartan-6 (XC6SLX45) FPGA. The initial Hardware K-Sorter and Top K-Sorter, with FPGA
(N=P) implementation with optimized hardware resulted in a Implementations,” in Proceedings of the 14th Annual
linear time complexity needing 2N clock cycles for sorting. International Symposium on Parallel and Distributed
Since the sorter will have a limited area allocated on the chip, Computing, 2015, pp. 138–147.
P will be constant. However, in a dynamic input environment,
where N is not constrained, for N>P case, we can’t increase P
to make P=N due to this area constraint. This issue was
resolved by extending basic N=P sorter to an iterative
implementation which allows fixed hardware to handle the
dynamic input scenarios. The iterative sorting operation
resulted in N2/P cycles execution time, thus still giving P times
faster performance than the sequential implementation,
however using N/P times less processing elements than the
basic hardware implementation.

Time (execution cycles) and space (resource utilization)


analysis of the iterative implementation confirms its faster
performance and the ability to handle dynamically changing
input scenarios. These key advantages of iterative sorter make
it a great option for applications which have a strict area and
resource constraint but are required to operate in a dynamic
input environment e.g. real time processing of data sets
obtained from a sensor module.

REFERENCES

[1] D. E. Knuth, The Art of Computer Programming, Volume 3:


Sorting and Searching, 2nd ed. Redwood City, CA, USA:
Addison Wesley Longman Publishing Co., 1998.
[2] K. Ø. Arisland, A. C. Aasbø, and A. Nundal, “VLSI parallel shift
sort algorithm and design,” Integration the VLSI Journal, vol. 2,
no. 4, pp. 331–347, 1984.
[3] D. Koch and J. Torresen, “FPGASort: A High Performance
Sorting Architecture Exploiting Run-time Reconfiguration on
Fpgas for Large Problem Sorting,” in Proceedings of the 19th

221

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.

You might also like