0% found this document useful (0 votes)

2 views7 pages

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

This paper presents an optimized implementation of the insertion sort algorithm using the Parallel Shift Sort (PSS) technique in VHDL, achieving linear time complexity O(n). The design is tailored for area-constrained applications and is implemented on a Xilinx Spartan-6 FPGA, allowing for efficient processing of large data sets in dynamic environments. The iterative sorting method described can handle varying input sizes by partially sorting data in multiple iterations, enhancing overall performance.

Uploaded by

黃鈺珊

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views7 pages

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

Uploaded by

黃鈺珊

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Iterative Parallel Shift Sort : Optimization and Design

for Area Constrained Applications

Sumit Diware1, Sharath B. Krishna2
1,2
VLSI Design Tools and Technology, Indian Institute of Technology Delhi,
Hauz Khas, New Delhi 110016, India
1
[email protected]
2
[email protected]

Abstract — Sorting is an important computational task needed in Hardware adaption of this algorithm is called Parallel Shift
almost all the modern data processing applications. Insertion sort Sort (PSS)[2]. PSS inserts the incoming elements at proper
is one of the simplest algorithms used for sorting. However, positions to get the sorted array as shown in Fig.1. We have
implementation of insertion sort in sequential execution leads to a implemented a ‘Fill and Flush’ version of parallel shift sort.
time complexity O(n2) making it less efficient. This often leads to
not preferring this sorting algorithm for many applications. This
paper explores the insertion sort implementation in VHDL using
parallel shift sort technique which results in linear time
complexity O(n). The designed model is further optimized for
operation at higher data rates. An iterative design using the
optimized model is then implemented on Xilinx Spartan-6 FPGA
which uses in-place computations and allows processing of large
data with less hardware resources. This makes the iterative
design ideal for area constrained applications which operate in a
dynamic input environment with fixed hardware such as real time
sensor data processing.

Keywords — Insertion sort; VHDL; iterative; FPGA; linear time.

I. INTRODUCTION

Consider five unsorted numbers 21,19,24,15,16. These

numbers are to be sorted in ascending order using insertion
sort[1]. The execution of insertion sort can be demonstrated
with these inputs as follows :
Fig. 1. Parallel shift sort (PSS) algorithm
Iteration 1: Compare the first two elements. Since 21>19, swap
their positions. We get 19,21,24,15,16. Recent works regarding merge sort were explored to get better
design insights[3][4][5]. The entire design is implemented in
Iteration 2: Compare third element with all elements before it VHDL[6]. The sequential domain to parallel domain transition
(first and second, which are already sorted in ascending order) is achieved by designing a hardware processing element
and insert it at appropriate place among the three elements to described in the next section.
achieve ascending order array. As 19<24 and 21<24, 24 stays
at same place. We get 19,21,24,15,16. II. THE PROCESSING ELEMENT
Iteration 3: Compare next element with all elements before it Pseudo-code for a sequential execution loop for ascending
(first, second and third which are already sorted in ascending order insertion sort is (L: total no of inputs to be sorted
order) and insert it at appropriate place among these elements contained in an array S) :
to achieve ascending order array. Since, 19>15, 21>15 and
24>15 we get 15,19,21,24,16. for m=1 to L
n=m
Iteration 4: Compare next element with all elements before it while n > 0 and S(n-1) > S(n)
(first, second, third and fourth which are already sorted in swap S(n-1) and S(n)
ascending order) and insert it at appropriate place among these n = n-1
elements to achieve ascending order array. Since, 19>16, end while loop
21>16 and 24>16 but 15<16 we get 15,16,19,21,24. This is the end for loop
final sorted array.

978-1-5090-3012-5/17/$31.00 ©2017 IEEE

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.
2017 6th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Sep. 20-22, 2017,
AIIT, Amity University Uttar Pradesh, Noida, India

For hardware implementation, the inner loop is realized using them as input to the next hardware unit to perform further
a processing element that consists of a comparator that tasks.
generates swapping trigger and two storage registers storing
the swapped or retained results. Outer loop will be emulated by 7. For N bit inputs, in place of 0(hex) use N bit smallest
the cascade of such processing elements. The sorting operation number (N no. of 0s) and in place of F use N bit largest
is thus mapped into concurrent hardware processes. number (N no. of 1s).

This algorithm is demonstrated in Fig. 4 to sort three numbers

5,3,4 in ascending order. The squares represent a processing
element. Input to a processing element is represented by arrow
entering the square from its left side and output to the next
stage from the output storage of processing element is
represented by arrow coming out of the square on its right side.
Cascade connections are made to form the chain.

Fig. 2 Basic processing element (PE)

Fig. 3. Chain of processing elements for sorting

Generalised ‘Fill and Flush’ PSS algorithm using the

processing element (PE) for 4 bit numbers is :

1. Initialize all the PEs in the cascade chain with

‘0’(smallest hexadecimal number) for ascending sorting
and with ‘F’(largest hexadecimal number) for descending
sorting.(‘Fill’ stage)

2. PE will retain the larger number on comparison for

ascending sorting and will retain the smaller number on
comparison for descending sort in its internal storage.

3. The other number will be propagated across the PE chain

by means of the output storage.
Fig. 4. Sorting using ‘Fill and Flush’ PSS technique
4. The PE chain and the internal registers will successively
get updated by the consecutive comparison and storage III. MODIFIED PROCESSING ELEMENT
operations. This propagation and updating puts each
incoming number at its proper position in the chain. Input to the processing element will be coming from a memory
in which the unsorted array is stored. Previous demonstration
5. After the last number to be sorted has been given to the was given using a memory which reads or writes at positive
sorter, next successive inputs should be ‘F’s (largest clock edge and the internal, output storage registers are also
hexadecimal number) for flushing the sorted numbers out positive edge triggered. This reduces the complexity from N2
of the sorter for ascending sorting and the same is cycles to 3N cycles. Hardware implementations of merge sort
achieved with successive ‘0’s (smallest hexadecimal have presented various optimization and acceleration
number) for descending sorting. Flushed out data is the methodologies[7]–[9]. In a similar manner, we have made an
sorted output. (‘Flush’ stage) attempt to improve this timing efficiency further with the
following two key modifications in the processing element
6. The flushing out is needed as we may need to store the design :
sorted elements in some memory locations or to give
216

1. The internal storage register works on the opposite clock Fig. 6. ‘Fill and Flush’ PSS sorting with modified processing
edge to that of the memory. In our entire design, the memory is elements
activated on the positive clock edge and the internal storage
register is negative edge triggered. Assume PE1 and PE2 are similar processing elements
connected in the chain in Fig. 7 with same clock input. Let
2. The output storage register is removed. It was needed if Tpc be propagation delay of combinational comparison and
there is a chance that incoming data would be lost before the
current clock cycle is complete (e.g. data coming over a serial selection block and Tps be the propagation delay of
link with no buffer storage). Since we are getting data from the sequential internal storage block.
memory and memory location would not change until the next
clock edge, that makes the output register redundant in our In order for both blocks to keep the comparison data ready
case. However, if data is coming from a source that may not before the data input arrives, the following constraint is
keep it steady for one cycle, output register would be required imposed on clock period Tclk : Tclk t Tpc Tps
for intermediate storage.

Fig. 5 Modified processing element

Removal of output storage register makes the result

propagation in the chain faster and negative edge triggering
Fig. 7. Sorter topology analysis
makes the comparison and storage processing faster. This
speeds up the multiple dependent comparisons in the chain IV. ITERATIVE SORTING
reducing the sorting duration to 2N cycles.
Assume the sorter to be working in a dynamic input
This is demonstrated in Fig. 6 to sort three numbers 5,3,4 in environment where the number of inputs to be sorted is not
ascending order with arrows and squares having same meaning always constant. The sorter has number of processing elements
as previous section: (P) equal the number of inputs to be sorted (N). Different
scenarios arising in this system can be :

1. N<P : Sorter will give correct sorted output. This would

however result in reduced area and timing efficiency, wasting
(P-N) programming elements and (2P-2N) clock cycles.

2. N=P : Sorter will give correct sorted output. The area and
timing efficiency are optimal in this case.

3. N>P : Sorter cannot operate properly in this case. Out of the

N inputs it can partially sort only P inputs and rest N-P inputs
are unsorted.

The normal sorter cannot directly process the N>P scenario.

However, the problem can be solved by coupling this sorter
with a control circuit such that the resulting system partially
sorts P inputs out of total N and repeats the partial sorting for
N/P (ratio) no. of iterations. When N/P ratio is not an integer,
minimum no. of iterations will be obtained by the ceiling of
N/P ratio.

217

Consider the following block diagram of the iterative sorting Flag ITERRESET is used to control the state diagram. It
technique : monitors the number of iterations that have occurred in the
system. Thereby it also keeps track of whether the sorting
process is complete or not.

Fig. 8. Iterative sorting technique Fig. 9. Iterative sorting state diagram

1. The sorting module contains P processing elements and ITERRESET = 1 means N/P iterations are over and the sorting
reads the numbers (READ DATA) from dual port RAM is complete. The system goes to state 2 if it initially was in
addresses provided by control circuit via signal READ state 1 or if it was already in state 2, it remains in state 2
ADDR. displaying the sorted numbers.

2. In every iteration, it sorts P numbers out of unsorted ITERRESET = 0 means N/P iterations are not over and the
elements (out of total N) giving partially sorted data. sorting continues. The system goes to state 1 if it initially was
in state 2 thus starting fresh sorting process or if it was already
3. After each iteration of the sorting process, sorter output in state 1, it remains in state 1 thereby advancing to the next
(partially sorted data for the iteration) is stored back sorting iteration.
(WRITE DATA) in the dual port RAM starting from the
first address from which the numbers are read. WRITE Internal details of control circuit for the iterative sorting
ADDR signal provides this write back address. process are described with help of following block diagram
(clock inputs are inherently assumed for all sequential blocks ):
4. WRITE ENABLE signal avoids conflicts which may
arise due to an attempt to simultaneously read and write
to the same RAM address. Also, it provides
synchronization as we initially have to wait a few cycles
till useful data gets flushed out of the sorting module.
This is because the values initialized in the processing
elements will flush out first which do not carry useful
information.

5. After N numbers are once processed by the sorter, it

marks the end of one iteration and address is reset to the
initial (starting) storage address of RAM. It also triggers
CLEAR signal which reinitializes processing elements
for the next sorting iteration.

6. The partially sorted data is fed to the sorter for the

second iteration and the process is repeated till N/P
iterations.

7. RESET signal is hard-reset for the entire control circuit.

DISPLAY RESET is the hard-reset for RAM addresses
while displaying sorted numbers after the sorting process
is complete.

State diagram for control circuit is as shown :

218

4. When counter C1 overflows, OVERFLOW is set to 1,

meaning one read cycle has ended and next sorting
iteration is about to begin. To make the sorter ready, it
has to be properly initialized in the ‘Fill’ stage. This is
done by CLEAR signal. OVERFLOW signal selects the
proper value for CLEAR.

5. When the sorting is over, we don’t need the flushing

numbers and simply have to display the N sorted
numbers. Hence, when sorting iterations are over as
indicated by ITERRESET = 1, read address is given by
counter C2 that simply counts up to N instead of N+P.

6. DISPLAY RESET can be used to reset the display

address after sorting is complete and RESET can be used
to reset the read addresses during sorting process.

E.g. Consider twelve hexadecimal numbers to be sorted in

descending order 2,4,6,8,A,C,E,10,12,14,16,18 with 4
processing elements. (Worst case input test for a descending
sorter = all the inputs in ascending order)

Iteration 1 Output : A,C,E,10,12,14,16,18,8,6,4,2

It sorted smallest 4 numbers out of 12 in descending order
8,6,4,2. Rest 8 unsorted.
Iteration 2 Output : 12,14,16,18,10,E,C,A,8,6,4,2
It sorted smallest 4 numbers out of 8 unsorted ones in
descending order 10,E,C,A. Rest 4 unsorted.
Iteration 3 Output : 18,16,14,12,10,E,C,A,8,6,4,2
It sorted smallest 4 numbers out of 4 unsorted ones in
Fig. 10. Iterative sorting control circuit descending order 18,16,14,12. Nothing left to sort.
1. During sorting, we need the N numbers to be sorted and The iterative solution is implemented with dual port RAM
P flushing numbers (the largest/smallest n bit number) where simultaneous read and write can be done provided they
for flushing out the sorted data out of the processing operate on different memory locations. Same assembly can be
elements. Hence, for sorting process, read address is implemented using single port RAM which allows only either
provided by a counter C1 that counts up to N+P single read or single write to be performed at a time. However,
addresses. it not resource efficient as it needs two RAMs instead of only
one RAM in dual port case. For every iteration one RAM
2. Sorting module will flush out the initialization data out of provides the input numbers as input to sorter and other RAM
the processing elements for the first P cycles. Hence, stores the partially sorted data coming out of the sorter. Their
writing useful data at the initial address should begin P roles then get reversed for the next iteration. This needs a
cycles after the reading has begun i.e. (P+1)th data is the bidirectional linkage of sorting module to the two RAMs
first useful data from sorter. Hence, for initial P cycles, making control circuit more complicated, thereby increasing
write address is held to the initial read address (in our area, increasing length of critical path in the design and
case ‘00000’). After P cycles, write address is obtained reducing the execution speed. Hence, the faster and more area
by subtracting (P+1) from the read address C1 provided efficient system with dual port RAM has been chosen for
by the counter mapping first useful write operation at the implementation.
initial read address (‘0000’).
V. SIMULATION RESULTS AND ANALYSIS
3. Write back to the RAM is allowed by activating WRITE
ENABLE signal only if a read cycle is going on for N+P Fig.11 shows simulation for ascending order sorting of twelve
numbers (counter C1 has not overflown, indicated by unsorted hexadecimal input numbers by chain of twelve
OVERFLOW= 0) and ITERRESET= 0 meaning that the processing elements.
N/P iterations are not over. In any other case it is
prohibited. This provides synchronization and protection Inputs to be sorted are coming from the signal temp1[3:0] in
from read/write conflicts when the address pointed by the sequence 8,2,3,4,7,9,6,5,C,B,E,A and correct sorted output
counter C1 is reset to the initial one. is given by signal dout[3:0] as the sequence
2,3,4,5,6,7,8,9,A,B,C,E.
219

Fig. 11 . Sorting of 12 numbers by a chain of 12 processing

elements
Fig. 14. Sorting 12 numbers with 4 processing elements (Third
Next set of simulations illustrates descending order sorting of iteration)
12 hexadecimal input numbers by chain of 4 processing
The three iterative simulations showed that sorting 12 numbers
elements. Input numbers are 2,4,6,8,A,C,E,10,12,14,16,18.
with 4 processing elements needed 51 clock cycles. This can
Fig.12 shows the output of first sorting iteration being carried be explained as follows : Every iteration of sorting takes P+N
out and it matches the expected correct iteration output cycles to partially sort the numbers and flush them out to the
A,C,E,10,12,14,16,18,8,6,4,2 memory. However, during the hardware implementation,
considering various combinational blocks working in the
control circuit and to avoid dual port RAM read-write
conflicts, an extra clock cycle is needed for synchronization.
Thus, one iteration needs N+P+1 total cycles and total no. of
cycles needed will be (N+P+1)x(N/P). Substituting N=12 and
P=4 in our case yields (12+4+1)x(12/4) = 51 cycles as verified
by simulation.

Thus, irrespective of ascending or descending order sorting,

irrespective of best case or worst case input, no. of cycles
required by the iterative sorter is given by (N+P+1)x(N/P) =
Fig. 12. Sorting 12 numbers with 4 processing elements (First (N2+N+NP)/P. For large N and small P, N2 >> N and N2 >>
iteration)
NP. Hence, the no. of cycles needed would be approximately
Fig.13 shows the output of second sorting iteration being N2/P.
carried out and it matches the expected correct iteration output
Sequential implementation of insertion sort would be
12,14,16,18,10,E,C,A,8,6,4,2.
equivalent to using P = 1 in the iterative system, it would
require (N+2)xN= (N2+2N) cycles which can be approximated
to N2 cycles for large N.

This shows that iterative sorter is always P times faster than

the sequential implementation. Ideally, sorting of N numbers
would need N processing elements. However, even with (N/P)
times less hardware resources, as shown in TABLE I, for N=
105, iterative sorter gives P times better timing performance
(takes P times less execution cycles) than the sequential
execution which would need 1010 clock cycles for sorting.
Fig. 13. Sorting 12 numbers with 4 processing elements (Second
iteration) TABLE I
Execution Time Performance (N = 105)
Fig.14 shows the output of third sorting iteration being carried
out and it matches the expected correct final output Number of processing Iterative sorter execution time
18,16,14,12,10,E,C,A,8,6,4,2. elements (P) (Clock cycles)
10 1.00011x109
100 1.00101x108
1000 1.01001x107
10000 1.10001x106

220

TABLE II shows that FPGA resource utilization is almost Annual ACM/SIGDA International Symposium on Field
independent of number of inputs to be sorted. This confirms Programmable Gate Arrays (FPGA), 2011, pp. 45–54.
that fixed sorter hardware is able to handle varying number of [4] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, “A novel
inputs. sorting algorithm for many-core architectures based on adaptive
bitonic sort,” in Proceedings of the 2012 IEEE 26th
TABLE II International Parallel and Distributed Processing Symposium,
FPGA Resource Utilization (P = 10) 2012, vol. 1, pp. 227–237.
[5] R. Kobayashi and K. Kise, “FACE: Fast and Customizable
Number of inputs Registers LUTs Sorting Accelerator for Heterogeneous Many-core Systems,” in
Proceedings of the 9th Annual IEEE International Symposium on
to be sorted (N)
Embedded Multicore/Many-core Systems-on-Chip (MCSoC),
100 126 1102 2015, pp. 49–56.
[6] Z. Navabi, VHDL: Analysis and Modeling of Digital Systems,
1000 127 1106 2nd ed. New York: McGraw-Hill, 1998.
10000 129 1108 [7] W. Song, D. Koch, M. Lujan, and J. Garside, “Parallel
Hardware Merge Sorter,” in Proceedings of the 24th Annual
100000 132 1109 IEEE International Symposium on Field-Programmable Custom
Computing Machines, 2016, pp. 95–102.
[8] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A
VI. CONCLUSIONS
Hybrid Design for High Performance Large-Scale Sorting on
FPGA,” in Proceedings of the 2015 International Conference on
In this paper we have implemented the ‘Fill and Flush’ version
ReConFigurable Computing and FPGAs (ReConFig), 2015, pp.
of parallel shift sort technique. A generic design was 1–6.
successfully synthesized in Xilinx ISE and the prototype was [9] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal Parallel
tested on Xilinx Spartan-6 (XC6SLX45) FPGA. The initial Hardware K-Sorter and Top K-Sorter, with FPGA
(N=P) implementation with optimized hardware resulted in a Implementations,” in Proceedings of the 14th Annual
linear time complexity needing 2N clock cycles for sorting. International Symposium on Parallel and Distributed
Since the sorter will have a limited area allocated on the chip, Computing, 2015, pp. 138–147.
P will be constant. However, in a dynamic input environment,
where N is not constrained, for N>P case, we can’t increase P
to make P=N due to this area constraint. This issue was
resolved by extending basic N=P sorter to an iterative
implementation which allows fixed hardware to handle the
dynamic input scenarios. The iterative sorting operation
resulted in N2/P cycles execution time, thus still giving P times
faster performance than the sequential implementation,
however using N/P times less processing elements than the
basic hardware implementation.

Time (execution cycles) and space (resource utilization)

analysis of the iterative implementation confirms its faster
performance and the ability to handle dynamically changing
input scenarios. These key advantages of iterative sorter make
it a great option for applications which have a strict area and
resource constraint but are required to operate in a dynamic
input environment e.g. real time processing of data sets
obtained from a sensor module.

REFERENCES

[1] D. E. Knuth, The Art of Computer Programming, Volume 3:

Sorting and Searching, 2nd ed. Redwood City, CA, USA:
Addison Wesley Longman Publishing Co., 1998.
[2] K. Ø. Arisland, A. C. Aasbø, and A. Nundal, “VLSI parallel shift
sort algorithm and design,” Integration the VLSI Journal, vol. 2,
no. 4, pp. 331–347, 1984.
[3] D. Koch and J. Torresen, “FPGASort: A High Performance
Sorting Architecture Exploiting Run-time Reconfiguration on
Fpgas for Large Problem Sorting,” in Proceedings of the 19th

221

Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on September 22,2024 at 09:33:21 UTC from IEEE Xplore. Restrictions apply.

2024 DSE M1 Suggested Solutions by Jacky
No ratings yet
2024 DSE M1 Suggested Solutions by Jacky
14 pages
Navigate A2 Elementary Coursebook
100% (1)
Navigate A2 Elementary Coursebook
207 pages
Documentation / Electronic Health Record: Vitals
50% (4)
Documentation / Electronic Health Record: Vitals
11 pages
Net Cafe Project
No ratings yet
Net Cafe Project
20 pages
Industrial Training Report (ADTPS)
100% (2)
Industrial Training Report (ADTPS)
16 pages
Design and Implementation of Sorting Algorithms Based On FPGA
No ratings yet
Design and Implementation of Sorting Algorithms Based On FPGA
4 pages
FPGA Based Hardware Accelerator For Sorting Data
No ratings yet
FPGA Based Hardware Accelerator For Sorting Data
4 pages
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
No ratings yet
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
7 pages
Linear Array: Jyotika Jain
No ratings yet
Linear Array: Jyotika Jain
22 pages
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
No ratings yet
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
23 pages
Implimentation and Analysis of Various Sorting Techniques
No ratings yet
Implimentation and Analysis of Various Sorting Techniques
30 pages
Getting Started: Sun-Yuan Hsieh
No ratings yet
Getting Started: Sun-Yuan Hsieh
30 pages
Nabil Mohsen Alzeqri
No ratings yet
Nabil Mohsen Alzeqri
7 pages
Q2.Nabil Mohsen Alzeqri
No ratings yet
Q2.Nabil Mohsen Alzeqri
7 pages
Week 2
No ratings yet
Week 2
46 pages
F8 PDF
No ratings yet
F8 PDF
32 pages
Parallel Distributed Computing Unit-4
No ratings yet
Parallel Distributed Computing Unit-4
27 pages
Pquick
No ratings yet
Pquick
19 pages
Algorithm ASSIGNMENT 1 Group 2
No ratings yet
Algorithm ASSIGNMENT 1 Group 2
6 pages
Parallel Algorithm & Sorting in Parallel Programming: Submitted By:-Submitted To: - Dalpat Songra
No ratings yet
Parallel Algorithm & Sorting in Parallel Programming: Submitted By:-Submitted To: - Dalpat Songra
42 pages
L1 L3
No ratings yet
L1 L3
54 pages
Insertion Sort Chapter 1-4
No ratings yet
Insertion Sort Chapter 1-4
11 pages
Oop Lab Manual 2023-24
No ratings yet
Oop Lab Manual 2023-24
25 pages
Sorting On A Mesh-Connected Parallel Computer
No ratings yet
Sorting On A Mesh-Connected Parallel Computer
30 pages
11 Sorting
No ratings yet
11 Sorting
131 pages
Comparison of Sorting Algorithms Based On Input Sequences: Ashutosh Bharadwaj Shailendra Mishra
No ratings yet
Comparison of Sorting Algorithms Based On Input Sequences: Ashutosh Bharadwaj Shailendra Mishra
4 pages
Reviw of Sorting Algorihms
No ratings yet
Reviw of Sorting Algorihms
4 pages
EHB208E 3 Lesson
No ratings yet
EHB208E 3 Lesson
58 pages
Competitive Coding Lab CS426
No ratings yet
Competitive Coding Lab CS426
52 pages
Chapter 2.0 Introduction To Algorithm 4th Edition
No ratings yet
Chapter 2.0 Introduction To Algorithm 4th Edition
4 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
Sorting 1
No ratings yet
Sorting 1
40 pages
Insertion Sort Bubble Sort Selection Sort
No ratings yet
Insertion Sort Bubble Sort Selection Sort
31 pages
Analysis of Algorithms CS 477/677: Sorting - Part A Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Sorting - Part A Instructor: George Bebis
31 pages
DAA Lab Manual New
No ratings yet
DAA Lab Manual New
60 pages
5 Insertion and Merge Sort
No ratings yet
5 Insertion and Merge Sort
37 pages
ADA-09
No ratings yet
ADA-09
26 pages
Analysis and Design of Algorithm Lab Manual
No ratings yet
Analysis and Design of Algorithm Lab Manual
49 pages
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
No ratings yet
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
5 pages
J1.S.P0003 Practice Material
No ratings yet
J1.S.P0003 Practice Material
3 pages
3.parallel Processing - Algorithms
No ratings yet
3.parallel Processing - Algorithms
37 pages
DSA Week2
No ratings yet
DSA Week2
84 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
597 pages
DS 4
No ratings yet
DS 4
53 pages
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
No ratings yet
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
9 pages
Sorting Algorithms
No ratings yet
Sorting Algorithms
19 pages
A Cooperative Sort Algorithm Based On Indexing
No ratings yet
A Cooperative Sort Algorithm Based On Indexing
6 pages
DS&A-Chapter Two
No ratings yet
DS&A-Chapter Two
5 pages
Chapter7 External Sorting (1)
No ratings yet
Chapter7 External Sorting (1)
23 pages
Lecture 4 5 6 Sort - Additional Resources
No ratings yet
Lecture 4 5 6 Sort - Additional Resources
75 pages
Shorting
No ratings yet
Shorting
27 pages
A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
No ratings yet
A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
14 pages
Analysis and design of algorithms 2020
No ratings yet
Analysis and design of algorithms 2020
9 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
No ratings yet
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
14 pages
Assignment #5: Sorting Lab: Due: Mon, Feb 25 2:15pm
No ratings yet
Assignment #5: Sorting Lab: Due: Mon, Feb 25 2:15pm
5 pages
11 Sorting
No ratings yet
11 Sorting
103 pages
Chapter 2-Simple Searching and Sorting Algorithms
100% (1)
Chapter 2-Simple Searching and Sorting Algorithms
21 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
adarshdaa
No ratings yet
adarshdaa
14 pages
Parallel Sorting Algorithms
No ratings yet
Parallel Sorting Algorithms
22 pages
Assignment-4: CS 202 - Data Structures
No ratings yet
Assignment-4: CS 202 - Data Structures
5 pages
Analysis of Algorithms CS 477/677: Sorting - Part A Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Sorting - Part A Instructor: George Bebis
31 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
C++ 2023 Final
No ratings yet
C++ 2023 Final
2 pages
Huawei V2&V3 Server RAID Controller Card User Guide 49
No ratings yet
Huawei V2&V3 Server RAID Controller Card User Guide 49
1,648 pages
Business Logistics
No ratings yet
Business Logistics
22 pages
Iso 20766 2 2018
No ratings yet
Iso 20766 2 2018
9 pages
01 Reading Passage - Business Ideas of The Future
100% (2)
01 Reading Passage - Business Ideas of The Future
2 pages
IGCSE Ratio Practice questions
No ratings yet
IGCSE Ratio Practice questions
6 pages
Splinting and Casting Workshop
No ratings yet
Splinting and Casting Workshop
21 pages
Toyota Hilux 4x4
No ratings yet
Toyota Hilux 4x4
4 pages
Geda Operating Instructions 1500 ZZP
100% (1)
Geda Operating Instructions 1500 ZZP
114 pages
alAMIN REPORT
No ratings yet
alAMIN REPORT
42 pages
AMOM Lecture1 - Fundamentals
No ratings yet
AMOM Lecture1 - Fundamentals
50 pages
HOW TO Proteus Basic Usage
No ratings yet
HOW TO Proteus Basic Usage
5 pages
Vector
No ratings yet
Vector
4 pages
JEANWATSONTHEORY
No ratings yet
JEANWATSONTHEORY
5 pages
Vim From Essentials To Mastery 2011
No ratings yet
Vim From Essentials To Mastery 2011
294 pages
MPU 3273/ LANG 2128/ BLC 221: Professional Communication
No ratings yet
MPU 3273/ LANG 2128/ BLC 221: Professional Communication
33 pages
4-Day Marmaris To Fethiye
No ratings yet
4-Day Marmaris To Fethiye
11 pages
Problem 6.0. An Early Atmospheric Engine Has A Single Horizontal Cylinder With A 3.2-ft
No ratings yet
Problem 6.0. An Early Atmospheric Engine Has A Single Horizontal Cylinder With A 3.2-ft
16 pages
Selangor Times 15 June 2012
No ratings yet
Selangor Times 15 June 2012
24 pages
Design and Implementation of Autonomous Lawn Mower
No ratings yet
Design and Implementation of Autonomous Lawn Mower
6 pages
PPM in SES
No ratings yet
PPM in SES
128 pages
A Thousand Splendid Suns 24052021
No ratings yet
A Thousand Splendid Suns 24052021
5 pages
Dissertation Sur Le Mariage de Figaro
100% (2)
Dissertation Sur Le Mariage de Figaro
5 pages
Genmath e Portfolio
No ratings yet
Genmath e Portfolio
17 pages
Inventario Actualizado Chemical Guys
No ratings yet
Inventario Actualizado Chemical Guys
8 pages

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

Uploaded by

Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications

Uploaded by

Iterative Parallel Shift Sort : Optimization and Design

for Area Constrained Applications

Keywords — Insertion sort; VHDL; iterative; FPGA; linear time.

Consider five unsorted numbers 21,19,24,15,16. These

978-1-5090-3012-5/17/$31.00 ©2017 IEEE

This algorithm is demonstrated in Fig. 4 to sort three numbers

Fig. 2 Basic processing element (PE)

Fig. 3. Chain of processing elements for sorting

Generalised ‘Fill and Flush’ PSS algorithm using the

1. Initialize all the PEs in the cascade chain with

2. PE will retain the larger number on comparison for

3. The other number will be propagated across the PE chain

Fig. 5 Modified processing element

Removal of output storage register makes the result

1. N<P : Sorter will give correct sorted output. This would

3. N>P : Sorter cannot operate properly in this case. Out of the

The normal sorter cannot directly process the N>P scenario.

Fig. 8. Iterative sorting technique Fig. 9. Iterative sorting state diagram

5. After N numbers are once processed by the sorter, it

6. The partially sorted data is fed to the sorter for the

7. RESET signal is hard-reset for the entire control circuit.

State diagram for control circuit is as shown :

4. When counter C1 overflows, OVERFLOW is set to 1,

5. When the sorting is over, we don’t need the flushing

6. DISPLAY RESET can be used to reset the display

E.g. Consider twelve hexadecimal numbers to be sorted in

Iteration 1 Output : A,C,E,10,12,14,16,18,8,6,4,2

Fig. 11 . Sorting of 12 numbers by a chain of 12 processing

Thus, irrespective of ascending or descending order sorting,

This shows that iterative sorter is always P times faster than

Time (execution cycles) and space (resource utilization)

[1] D. E. Knuth, The Art of Computer Programming, Volume 3:

You might also like