0% found this document useful (0 votes)
19 views10 pages

Isfpga16 Resolve

The document presents a framework for generating high-performance sorting architectures using high-level synthesis (HLS) to improve accessibility and productivity for FPGA implementations. It addresses the challenges of existing low-level hardware descriptions by providing optimized sorting algorithms and enabling non-experts to design efficient hardware sorters. Experimental results indicate that the generated architectures perform comparably to traditional RTL implementations, facilitating the development of customized heterogeneous FPGA/CPU sorting systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Isfpga16 Resolve

The document presents a framework for generating high-performance sorting architectures using high-level synthesis (HLS) to improve accessibility and productivity for FPGA implementations. It addresses the challenges of existing low-level hardware descriptions by providing optimized sorting algorithms and enabling non-experts to design efficient hardware sorters. Experimental results indicate that the generated architectures perform comparably to traditional RTL implementations, facilitating the development of customized heterogeneous FPGA/CPU sorting systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Resolve: Generation of High-Performance Sorting

Architectures from High-Level Synthesis

Janarbek Matai* , Dustin Richmond* , Dajung Lee† , Zac Blair* , Qiongzhi Wu* , Amin Abazari* ,
and Ryan Kastner*
*
Computer Science and Engineering, † Electrical and Computer Engineering
University of California, San Diego, La Jolla, CA 92093, United States
{jmatai, drichmond, dal064, zblair, qiw035, maabazari, kastner}@ucsd.edu

ABSTRACT high performance. FPGAs typically provide the best per-


Field Programmable Gate Array (FPGA) implementations formance for per Watt compared to CPUs and GPUs, but
of sorting algorithms have proven to be efficient, but ex- they are the most difficult to develop.
isting implementations lack portability and maintainability Designing efficient sorting applications using FPGAs is
because they are written in low-level hardware description difficult because it requires substantial domain specific knowl-
languages that require substantial domain expertise to de- edge about hardware, the underlying FPGA architecture,
velop and maintain. To address this problem, we develop a and the compiler tools. High-level synthesis (HLS) tools
framework that generates sorting architectures for different aim to improve the accessibility of FPGAs by minimizing
requirements (speed, area, power, etc.). Our framework pro- required domain specific knowledge by raising the level of
vides ten highly optimized basic sorting architectures, easily programming abstraction, which results in an increase in
composes basic architectures to generate hybrid sorting ar- productivity. Unfortunately, HLS is not a panacea. As re-
chitectures, enables non-hardware experts to quickly design ported in previous works, HLS generates efficient hardware
efficient hardware sorters, and facilitates the development when the input code is written in a specific coding style
of customized heterogeneous FPGA/CPU sorting systems. [10, 17], which we call restructured code. Therefore, cre-
Experimental results show that our framework generates ar- ating optimized hardware using HLS still requires intimate
chitectures that perform at least as well as existing RTL understanding of the underlying hardware architecture and
implementations for arrays smaller than 16K elements, and knowledge about how to effectively utilize the HLS tools.
are comparable to RTL implementations for sorting larger

Sorting Architectures
(size, performance, area,..)

arrays. We demonstrate a prototype of an end-to-end system

Customized
User constraints

using our sorting architectures for large arrays (16K-130K)


Sorting Architecture Generator
on a heterogeneous FPGA/CPU system.

1. INTRODUCTION Insertion FIFO-


Merge
Merge
Unit
Prefix
Sum
Sorting is an important, widely studied algorithmic prob- Radix
Smart-
Cell
lem [13] that is applicable to nearly every field of computa- Optimized Optimized
tion: data processing and databases [6, 11, 20], data com- Sorting Algorithms Sorting Elements
pression [5], distributed computing [9], image processing,
and computer graphics [4, 15]. Each application domain Figure 1: The Resolve sorting framework.
has unique requirements. For example, text data compres-
sion applications requires sorting arrays with few hundred In this paper, we develop a framework that generates high
elements. MapReduce sorts millions of elements. Database performance sorting architectures by composing basic sort-
applications sort both large and small size arrays. ing architectures implemented with optimized HLS primi-
The importance of sorting has led to the development and tives. This concept is shown in Figure 1. We note that this
study of parallel sorting algorithms [1] on CPUs [8], GPUs is similar to std::sort routine found in standard template
[22], and FPGAs [14]. Each platform has its advantages. library (STL), which selects a specific sorting algorithm from
CPUs are relatively easy to program, but often lack perfor- a pool of sorting algorithms. For example, STL uses inser-
mance compared to GPU and FPGA counterparts. GPUs tion sort for small lists (less than 15 elements), and then
are more difficult to program than CPUs, but they provide switches to merge sort for larger lists. We believe a routine
like std::sort for HLS is important to facilitate FPGA de-
Permission to make digital or hard copies of all or part of this work for personal or signs for non-hardware experts. Our framework uses RIFFA
classroom use is granted without fee provided that copies are not made or distributed [12] to integrate sorting cores into a fully functional hetero-
for profit or commercial advantage and that copies bear this notice and the full cita- geneous CPU/FPGA sorting system. The result is a system
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- that minimizes knowledge required to design high perfor-
publish, to post on servers or to redistribute to lists, requires prior specific permission mance sorting architectures for an FPGA.
and/or a fee. Request permissions from [email protected]. The specific contributions of this paper are:
FPGA ’16, February 21–23, 2016, Monterey, CA, USA.

c 2016 ACM. ISBN 978-1-4503-3856-1/16/02. . . $15.00 1. The design and implementation of highly optimized
DOI: http://dx.doi.org/10.1145/2847263.2847268 sorting primitives and basic sorting algorithms.
2. A framework to generate hybrid sorting architectures Table 1: Case study for insertion sort optimization in HLS
by composing these basic primitives.
Optimizations II Period Slices Category
3. A comparison of these generated sorting architectures 1 L3: pipeline II=1 661 3.75 29 slow/small
with other sorting architectures implemented on a FPGA.
2 L3: unroll factor=2 cyclic 730 3.84 112 slow/small
4. Integration with RIFFA [12] to demonstrate full end- partition array by fac-
tor=2
to-end sorting system.
3 L2: pipeline II=1 1194 3.06 47 slow/small
This paper is organized as follows: Section 2 provides a 4 L2: unroll factor=2 and 1193 3.50 144 slow/small
case study of insertion sort to demonstrate HLS optimiza- cyclic partition array by
tions. Section 3 discusses related work. Section 4 describes factor=2
the optimization of standard sorting primitives, and how to 5 L1: pipeline II=1 and 1 440.85 27291 faster/huge
use them to create efficient architectures for ten basic sort- complete partition array
ing algorithms. Section 5 presents our Resolve framework. 6 Code restructuring 64 2.90 374 fastest/small
Section 6 provides experimental results. We conclude in Sec-
tion 7.
We categorize the performance and area results from Ta-
2. CASE STUDY: INSERTION SORT ble 1 into three groups: 1) slow/small, 2) faster/huge, and
Listing 1 shows a common implementation of an insertion 3) fastest/small. Ideal design from HLS would be fast with
sort algorithm. Implementing this directly using a high-level small area. The first four designs are very slow and have
synthesis (HLS) tool would not provide an efficient archi- small area. Design 5 achieves higher performance (II=1 and
tecture. We must optimize it specifically for a hardware very large clock period) with unrealistically large area due
implementation. to aggressive HLS optimizations. Design 6 is hand written
by an expert HLS designer to create an optimal architecture;
1 void InsertionSort ( int array [ n ])
2 {
it achieves the highest performance with small area.
3 L1 : This case study demonstrates several concepts: First, writ-
4 int i , j , index ; ing efficient HLS code requires that the designer must un-
5 for ( i =1; i < n ; i ++)
6 {
derstand hardware concepts like unrolling and partitioning.
7 L2 : Second, the HLS designer must be able to diagnose any
8 index = array [ i ]; throughput problems, which requires substantial HLS tool
9 j = i; knowledge. Third, and most importantly, in order to achieve
10 while ( ( j > 0) && ( array [j -1] > index ) )
11 { the best results – high performance and low-area, it is typi-
12 L3 : cally required to re-write the software-centric code to create
13 array [ j ] = array [j -1]; an efficient hardware architecture.
14 j - -;
15 } The aim of this work is to make it easy to design optimized
16 array [ j ] = index ; sorting algorithms (like that of Design 6) from higher-level
17 } languages by providing a framework of optimized sorting al-
18 }
gorithms in HLS. This requires several steps: 1) understand
Listing 1: Typical source code for insertion sort. This does not the sorting algorithms, 2) study existing hardware imple-
create an optimized architecture using HLS tools. mentations (often written in register transfer level Verilog or
VHDL), and 3) modify the sorting algorithms to optimally
HLS tools typically provide optimization directives that synthesize to the FPGA. In the remainder of this paper, we
are embedded in input source code as a pragma. Through- address each of these issues.
out this work, we use semantics specific to the Xilinx Vivado
HLS tool. However, these ideas are generally applicable to
other HLS tools. Some common optimization directives are 3. RELATED WORK
pipeline, which exploits instruction level parallelism, unroll, There are two main bodies of past work related to this
which vectorizes loops, and partition, which divides arrays paper; these are hardware sorting architectures and high-
into multiple memories. We denote three potential locations level synthesis code generation.
for these directives: L1, L2, and L3. For example, we can Hardware sorting architectures: The first body of work
direct the HLS tool to exploit instruction level parallelism focuses on implementing hardware sorters (usually a single
by applying the pipeline pragma to the body of the in- algorithm) on an FPGA. There are a variety of published
ner for loop at point L3; similarly, we can apply other HLS works exploring sorting architectures on FPGA platforms.
optimizations at L1, L2, and L3. Unfortunately, as we will Several works have implemented a single sorting algorithm
shortly see, designers cannot rely on these directives alone, on a FPGA [3, 7, 16, 18, 21, 23], and some have explored
and must often write special code, which we call restruc- high performance sorting of large size inputs [6, 7, 14].
tured code, to generate the best results. This restructured All the above work focuses on designing a specific hard-
code requires substantial hardware design expertise [10, 17]. ware architecture for a particular algorithm. Our work en-
Table 1 presents the Initiation Interval (II), achieved clock ables the user to generate a vast number of different sorting
period, and utilization (slices) results for five different opti- architectures from high-level languages without writing low-
mizations at the different locations L1, L2, and L3. Design level code. Additionally, our work does this automatically
6 is a restructured implementation, i.e., it completely refac- from high-level languages, where previous works have used
tors the code with an eye towards an HLS style of coding. low level hardware description languages. Our framework
We will discuss Design 6 in more detail in Section 4. allows full parameterization, the composition of hybrid ar-
(a) (b) (c)
chitectures from multiple algorithms, and the ability to per-
A
>
x1

x2
*/*
u1
c
>
x1

x2
*/*
u1
c
>
x1

x2
*/*
u1

form quick design space exploration. Finally, the sorting ar- A’

chitectures generated from our work can be integrated with B B’


A
out
B
RIFFA to provide an an end-to-end system.
(d) (e) in
There are also a few works that study sorting in the con- (f)
val
>

x2

x1
text of high-level languages. Arcas-Abella et al. [2] looked

* / *u
in values hist

u2
at the feasibility of implementing bitonic sort and spatial

1
+
[val] 0

insertion sorting units using existing HLS tools (BlueSpec, out +


0 0

Chisel, LegUP, and OpenCL). This work is similar to ours


since it studies the implementation of sorting algorithms
Figure 2: Initial hardware architecture of sorting primitives
using HLS tools. Zuluaga et al. [23] presented a method generated from HLS. a) compare-swap, b) select-value element,
for generating sorting network architectures from a domain- c) merge, d) prefix-sum, e) histogram, f ) insertion cell
specific language. At a high level, the use of a domain-
specific language seems similar to our architecture-generation
approach. There are several main differences between afore- 1 A = in [0];
mentioned work and our work. First, we study multiple algo- 1 # pragma PARTITION out 2 # pragma PARTITION out
rithms instead of focusing on a single algorithm. Second, we cyclic factor =4 cyclic factor =4
2 # pragma PARTITION in 3 # pragma PARTITION in
generate optimized sorting architectures by composing one cyclic factor =4 cyclic factor =4
or more algorithms. Finally, we can address much larger 3 for ( i =0; i < SIZE ; i ++) { 4 for ( i =0; i < SIZE ; i ++) {
input sizes and the architectures generated from our work 4 # pragma UNROLL 5 # pragma UNROLL
factor =4 factor =4
are orders of magnitude better than [23]. Section 6 provides 5 # pragma PIPELINE 6 # pragma PIPELINE
a more detailed comparison of these works and the results 6 out [ i ]= out [i -1]+ in [ i ] 7 A = A + in [ i ];
generated from our Resolve framework. 7 } 8 out [ i ] = A ;
HLS code generation: The work by George et al. [10] pro- 9 }
Listing 2: Prefix sum (SW)
posed a domain-specific language based FPGA design us- Listing 3: Prefix-sum (HW)
ing existing high-level synthesis tools. This is similar to
our approach by allowing non-hardware designers to write Prefix Sum: Listing 2 shows “software-style” C code for
code (in their case using Scala) to generate optimized HLS prefix sum. Even for this simple prefix sum primitive, we
code. Their work targets specific computational patterns. have to restructure the code in non-intuitive ways to produce
Our work targets a specific domain (sorting) and creates a optimized hardware. First, we apply unroll and pipeline op-
framework for the user to explore a vast number of differ- timizations to expose data and instruction level parallelism.
ent sorting architectures using sorting primitives and basic We also perform cyclic partitioning on the arrays in and
sorting algorithms. out to match the memory access patterns required by un-
rolling. By pipelining the loop, we expect to get II = 1,
and by unrolling, we expect to get a speed up by a factor of
4. HARDWARE SORTING 4. However, the data dependencies between out[i − 1] and
Figure 1 shows the structure of our framework. It has out[i] prevent us from achieving the expected results. Fig-
three components: 1) Block 1 is a library of optimized pa- ure 2 (d) shows the hardware architectures for the code in
rameterizable sorting primitives. These sorting primitives Listing 2. Optimized architecture with an II = 1 is shown
are the building blocks of our framework. Block 2 repre- in Figure 3 (a), and Listing 3 shows the HLS code for this
sents our basic sorting algorithms. The algorithms use the optimal hardware architecture.
sorting primitives to implement all the basic sorting algo- 1 stage1 ( in , t ) {
rithms on an FPGA using high-level synthesis. Block 3 is 1 # pragma HLS DATAFLOW
2 for ( i =0; i < SIZE ;
the sorting architecture generator. Here we use the sort- ++ i ) {
2 // omitted partition
3 # pragma HLS UNROLL
ing primitives and basic algorithms to generate optimized 3 // pragmas
factor =4
4 stage1 ( IN , TEMP ) ;
hybrid sorting architectures to meet different system con- 5 ...
4 # pragma HLS
straints. The following describes each of these components PIPELINE
6 stage ( TEMP , OUT ) ;
5 t[i] =
in more detail. 7 }
in [i -1]+ in [ i ];
6 }}
Listing 4: Prefix sum dataflow
4.1 Sorting Primitives Listing 5: Prefix sum stages
This section presents optimized HLS implementations of
sorting primitives. Previous works presented a list of several
common sorting primitives, e.g., compare-swap, select-value, values acc hist
in
and a merge unit [14]. After analyzing more common sort- +
[old_val]
+ if (old_val==val)
ing algorithms, we added three more primitives to this list: acc
acc
prefix-sum, histogram, and insertion-cell. Our basic sort- out + [val]
else
ing algorithms (presented in Section 4.2) are implemented
(a) (b)
efficiently in hardware using these six sorting primitives.
Figure 2 shows the initial hardware architectures generated Figure 3: Optimal hardware architectures for prefix sum and
from HLS code for our sorting primitives. Section 2 de- histogram that give II = 1
scribed how restructured HLS code is necessary to generate
an efficient hardware from HLS. We now present the opti- As an additional example, we present another optimized
mization of prefix sum, merge, and insertion-cell. HLS block for prefix sum which implements the reduction
pattern. The reduction pattern uses log(n) parallel stages 6 CURR_REG = IN_A ;
to compute a prefix sum of size n in parallel. The individual 7 } else
8 OUT . write ( IN_A ) ;
stages do not have the data dependency seen in the previ- 9 return CURR_REG ;
ous example. Listing 4 shows a high-level prefix sum imple- 10 }
mentation using a reduction pattern. The stage functions
Listing 7: The code for the sorting primitive insertion-cell.
are implementations of the parallel stages without the data
dependency. Listing 5 shows the code for the first stage The code for insertion-cell is shown in Listing 7. The func-
function. Since there is no data dependency, it is straight- tion takes one input argument IN and one output argument
forward to get a speed up of 4× or more by unrolling and OUT. It uses a hls::stream<> type to indicate that these in-
cyclically partitioning as in Listing 5. Multiple versions of put and outputs can use a FIFO interface. The cell holds
optimized sorting primitives such as in Listing 3 and List- the previous value in the CURR_REG static variable. It must
ing 4 will facilitate to do easy design space exploration with save this value across function calls, and thus declares it as a
these primitives. For example, the prefix sum in Listing 3 static variable. The architecture compares the input value
achieves the desired unrolling factor with reduced frequency, to the previous value, and outputs the larger of these two
while the prefix sum in Listing 4 with the same unrolling fac- values. The next section shows how to use this primitive to
tor achieves higher frequency. create a linear insertion sort algorithm.
1 void MergeUnit ( hls :: stream < int > & IN1 ,
hls :: stream < int > & IN2 , hls :: stream < int > & OUT ,
4.2 Sorting Algorithms
int n ) { In this section, we elaborate on the HLS implementations
2 int a , b ; of four kinds of sorting algorithms: nested loop, recursive,
3 int subIndex1 = 1 , subIndex2 = 1;
4 IN1 . read ( a ) ; IN2 . read ( b ) ; non-comparison, and sorting network. Table 2 summarizes
5 for ( int i =0; i < n ; i ++) { the results of our HLS implementations.
6 # pragma HLS PIPELINE
7 if ( subIndex1 == n /2+1) { 4.2.1 Nested Loop Sorting Algorithms
8 OUT [ i ] = b ;
9 IN2 . read ( b ) ; The selection sort algorithm iteratively finds the mini-
10 subIndex2 ++; mum element in an array and swaps it with the first element
11
12
} else if ( subIndex2 == n /2+1) {
OUT [ i ] = a ;
until the list is sorted . This algorithm runs in O(n2 ), where
13 IN1 . read ( a ) ; n is the number of array elements. In HLS, we can pipeline
14 subIndex1 ++; the inner loop to get II = 1, which still gives us O(n2 ) time.
15 } else if ( a < b ) {
16 OUT [ i ] = a ;
We can create a better design by sorting from both “sides”,
17 IN1 . read ( a ) ; i.e., finding the minimum and maximum elements in paral-
18 subIndex1 ++; lel, which reduces the number of iterations in the outer loop
19 } else { by 2×. This gives us O(n2 /2) time. In general, selection
20 OUT [ i ] = b ;
21 IN2 . read ( b ) ; sort does not translate into high performance hardware us-
22 subIndex2 ++; ing HLS. However, selection sort can be used to produce an
23 } area-efficient sorting algorithm implementation.
24 }
25 } The rank sort algorithm sorts by computing the rank of
each element in an array, and then inserting them at their
Listing 6: FIFO based streaming merge primitive rank index. The rank is the total number of elements greater
than or less than the element to be sorted. Sequential rank
Merge: The merge primitive combines two sorted n/2 size sort has a complexity of O(n2 ). The rank sort algorithm
arrays into a sorted array of size n. Figure 2 (c) shows the can be fully parallelized in HLS: sorting an array of size n
hardware architecture. Listing 6 shows the HLS implemen- has n units operating in parallel computing the rank of each
tation of streaming FIFO-based merge unit. Implementa- element. However, this process uses 2 × n2 storage to sort
tion of merge unit with C arrays is straightforward. Here the array of size n. Rank sort can be useful when designing
the IN1 and IN2 are two sorted arrays and OUT is the merged sorting hardware in HLS because it is a good algorithm for
output. The for loop in Line 5 runs n times where n/2 is exploring area and performance trade-offs.
the size of IN1 and IN2. It reads one element from either
IN1 or IN2 on each iteration and writes it to the output until Cell 1 Cell 2 Cell n
the end of the FIFO is reached. We pipelined this loop to Unsorted Sorted
Input output
get an II = 1 so it does one read operation every cycle.
Insertion Cell: Insertion cell is a hardware sorting primi-
Logic &
tive for insertion sort algorithms. The hardware architecture en, x1 Comp u1 en,
flush flush
has an input, an output, a comparator, and a register – see x2 u2
x3 u3
Figure 2 (f). The insertion-cell compares the current input
with the current value in current register. The smaller (or
larger depending sort direction) of current register and the 0
0

current input is given as an output.


0 0

value 0 0 value

1 T InsertionCell ( hls :: stream < int > & IN ,


hls :: stream < int > & OUT ) { Figure 4: Hardware architecture of linear insertion sort
2 static int CURR_REG =0;
3 int IN_A = IN . read () ;
4 if ( IN_A > CURR_REG ) { Insertion sort iterates through an input array maintaining
5 OUT . write ( CURR_REG ) ; sorted order for every element that it has seen. Insertion
Table 2: Sorting Algorithms evaluations when implementing them using HLS. n=number of elements to sort. *n=number of insertion
sort cells, t*= number of compare-swap elements

Parallel HLS Implementation


Algorithm name SW Complexity Parallel tasks Complexity (II) Storage Main Sorting Primitives
Selection sort O(n2 ) 2 O(n2 /2) O(2 × n) Compare-swap
Rank sort O(n2 ) n O(n) O(n2 ) Histogram, Compare-swap
Bubble sort O(n2 ) 2 O(2 × n2 ) O(2 × n) Compare-swap
Insertion sort O(n2 ) - O(n) n* Compare-swap, insertion-cell

Merge sort O(n log n) - O(n) O(2 × logn) Merge Unit
Quick (Sample) sort O(n log n) or O(n2 ) t O(n/t log n/t) O(n × t) Prefix sum
Counting sort O(n × k) (k=3) 3 O(n) O((k − 1)n Prefix sum, Histogram
Radix sort O(n × k) (k=4) 4 O(n) O((k − 1)n Prefix sum, Histogram, Counting Sort
Bitonic sort - t O(log 2 n) O(n × t) Compare-swap
Odd-even transposition sort O(n2 ) t* O(n2 /t∗ ) O(t∗) Compare-swap

sort has a complexity of O(n2 ). Listing 1 shows a software- implementation to one that is synthesizable requires a mod-
centric HLS implementation of insertion sort. We discussed ification of software implementation to remove the recursive
some naive HLS optimizations for insertion sort in Section 2. function calls in the code.
These used different optimization directives (pragmas) in an Merge sort has two primary tasks. The first task par-
attempt to create a better hardware implementations. These titions the array into individual elements, and the second
designs (Designs 1 - 5 in Table 1 did not result in the optimal merges them. The majority of the work is performed in the
implementation. Design 6 give the best result. Here we merging unit, which is implemented with a merge primitive.
describe code restructuring optimizations of Design 6. This was described in Section 4.1.
An efficient hardware implementation of insertion sort Merge sort is implemented in hardware using merge sorter
uses an linear array of insertion-cells [2, 3, 16, 21] or a sorting tree [14] or using odd-even merge sort. Listing 9 provides
network [19]. Here we focus on a linear insertion sort imple- an outline of the code for streaming merge sorter tree. In
mentation; we discuss sorting network implementation later. this code, IN1, IN2, IN3 and IN4 are n/4 size inputs, and
Figure 4 shows architecture from Arcas-Abella et al. [2]. In OUT is a size n output. MergePrimitive1 and MergePrim-
this architecture a series of cells (insertion-cell primitives) itive2 merges two sorted lists of array size n/4 and n/2,
operate in parallel to sort a given array. It compares the respectively. Using the dataflow pragma, we can perform a
current input (IN) with the current value in current regis- functional pipeline across these three functions. Merge sort
ter (CURR REG). The smaller of current register and the based on odd-even merge also uses merge sorting primitive
current input is given as an output to OU T . to sort a given n size array with II of n. Merge sort can be
Listing 8 shows the source code that represents the hard- optimized in hardware by running n log n tasks in parallel.
ware architecture in Figure 4. A cascade of insertion-cells
1 void C a s c a d e M e r g e S o r t ( hls :: stream < int > & IN1 ,
is implemented in a pipelined manner using the dataflow 2 hls :: stream < int > & IN2 , hls :: stream < int > & IN3 ,
pragma, and series of calls to the InsertionCell function 3 hls :: stream < int > & IN4 , hls :: stream < int >
from Listing 7. Note that we have four different versions & OUT ) {
4 # pragma HLS DATAFLOW
of the function – InsertionCell1, InsertionCell2, etc.. It 5 # pragma HLS stream depth =4 variable = IN1
is necessary to replicate the functions due to the use of the 6 for ( int i =0; i < SIZE /4; i ++) {
static variable. Each of these functions has the same code 7 // read input data
8 }
as in Listing 7. This implementation achieves O(n) time 9 MergePrimitive1 ( IN1 , IN2 , TEMP1 ) ;
complexity to sort an array of size n. 10 MergePrimitive1 ( IN3 , IN4 , TEMP2 ) ;
11 MergePrimitive2 ( TEMP1 , TEMP2 , OUT ) ;
1 void InsertionSort ( hls :: stream <T > & IN , 12 }
hls :: stream <T > & OUT ) {
2 # pragma HLS DATAFLOW Listing 9: FIFO based streaming merge sorter tree
3 hls :: stream <T > out1 , out2 , out3 ;
4 // Function calls ; Quick sort uses a randomly selected pivot to recursively
5 InsertionCell1 ( IN , out1 ) ;
6 InsertionCell2 ( out1 , out2 ) ;
split an array into elements that are larger and smaller than
7 InsertionCell3 ( out2 , out3 ) ; the pivot. After selecting a pivot, all elements smaller than
8 InsertionCell4 ( out3 , OUT ) ; pivot are moved left of the pivot, i.e., they are in a lower
9 } index in the array. This process is repeated for the left
Listing 8: Insertion Sort code for HLS design based on the and right sides separately. The software complexity of this
hardware architecture in Figure 4. The InsertionCell functions algorithm is O(n2 ) in the worst case and O(n log n) in the
use the code from Listing 7. best case. Non-recursive (iterative) version of quick sort can
be implemented in HLS with slow performance. Instead, we
chose to implement a parallel version of quick sort known
4.2.2 Recursive Algorithms as sample sort. In sample sort, we can run t tasks to divide
A pure software implementation of merge sort and quick the work of pivot_function to sort n size array into n/t.
sort are not possible in HLS due to the use of recursive func- The integration of t results from tasks can be done using the
tions. HLS tools (including Vivado HLS) typically do not prefix sum primitive. Essentially, this implementation sorts
allow recursive function calls. Changing from a recursive an n size array in O(n) time with higher BRAM usage.
4.2.3 Non-comparison based due to required IO throughput. This requires balancing the
Counting sort has three stages. First the counting sort parallelism and area in HLS and will be discussed later. For
computes the histogram of elements from the unsorted in- example using parallel n compare-swap elements, odd-even
put array. The second stage performs a prefix sum on the transposition sort can sort an n size array in O(n).
histogram from the previous stage. The final stage sorts the
array. Final stage first reads the value from the unsorted in- 5. SORTING ARCHITECTURE GENERATOR
put array. Then it finds the first index of that element from
In this section, we describe our framework for generating
the prefix sum stage and writes it to the output array. Then
sorting architectures. A user can perform design space ex-
it increments the index in the prefix sum by one. Figure 5
ploration for a range different application parameters. And
(a) shows an example of the counting sort algorithm on an
once she has decided on a particular architecture, the frame-
8 element input array. The first stage performs a histogram
work generates a customized sorting architecture that can
on the input data. There are only three values (2, 3, 4), and
run on out of the box on a heterogeneous CPU/FPGA sys-
they occur 3, 2, and 3 times in the unsorted input array,
tem. It creates the RTL code if the user wishes to integrate
respectively. The second stage does a prefix sum across the
it into the system in another manner.
histogram frequencies. This tells us the starting index for
The flow for our sorting framework is shown in Figure 8.
each of the three values. The value 2 starts at index 0; the
We define user constraint as a tuple U C(T, S, B, F, N ) where
value 3 starts at index 3; and the value 4 starts at index 5.
T , S, B, F and N are throughput, number of slices, num-
The final stage uses these prefix sum indices to fill in the
ber of block rams, frequency, and the number of elements
sorted array. Parallel counting sort can be designed using
to sort. We define V as a set of sorting designs that can
function pipelining of three stages. It runs in O(n) time
perform sorting on an input array of size N . The sorting
using O(n × k) (k is constant) memory storage.
architecture generation is a problem to find a design D of
(a) Counting sort (b) Radix sort the form D(T, S, B, F, N ) that satisfies the U C.
Unsorted 1) Histogram 2 3 4 1. Counting sort
Input
3 2 3 Mem(n)
4 2 4 4 2 3 3 2
2. Counting sort a) RD::= | RD v1 | RD v2| BD v3| RD v4| RD v5
+
Mem(n) BtS::= | BtS v1 | BtS v2| BtS v3| BtS v4| BtS v5
Sorted
Input
2) Prefix sum 0 3 5
3. Counting sort

4 +1 Mem(n) b) Sort ::= c) match Sort (n, v)::=
3) Histogram 0 3 6 4. Counting sort
| SS n | SS n emit SS (v)
| RS n | RS n emit RS (v)
| BS n | BS n emit BS (v)
Figure 5: An example hardware architectures for counting sort | IS n | IS n emit IS (v)
and radix sort | MS n | MS n emit MS (v)
| QS n | QS n emit QS (v)
| RD n | RD n emit RD (v)
Radix sort works by applying counting sort for each digit | BtS n | BtS n emit BtS (v)
of the input data. For example, to sort 32-bit integers, we | OET n | OET n emit OET (v)
| OEM n | OEM n emit OEM (v)
can apply counting sort four times to each of the four hex- | Merge (Sort, Sort)
adecimal (radix 8) digits. We can implement a fully par-
allel radix sort in HLS using functional pipelining of each Figure 7: Grammar of domain-specific language. SS=Selection
counting sort. An individual counting sort operation has a sort, RS=Rank sort, BS=Bubble sort, IS=Insertion sort,
MS=Merge sort, QS=Quick sort, RD=Radix sort, BtS=Bitonic
throughput of n, thus fully parallel radix sort will also have
sort, OET=Odd-even transposition sort, OEM=Odd even
a throughput of n. To store the outputs of intermediate merge sort. a) Sorting architectural variants for particular
stages, we need n × k storage. Here k is usually 4 for 32-bit algorithm, b) Sort function grammar, c) Code generator
number or 8 for 64-bit number. Thus to sort 32-bit number
in parallel, we use 3 × n storage (3 intermediate memory
storage) as shown in Figure 5 (b).
from components import InsertionSort
(a) (b) 1 from components import MergeSort
x0 y0 x0 y0 from components import RadixSort
x1 y1 x1 y1
x2 y2 x2 y2
x3 y3 x3 y3 .... Sorting
x4 y4 x4 y4
x5 y5 x5 y5 Architecture
x6 y6 x6 y6
2 conf = Configuration.Configuration(…) Selection
x7 y7 x7 y7 #sort = RadixSort(10, “RadixSort”, 32, 4)

Figure 6: a) Bitonic sort, b) Odd-even transposition sort Input_array = [1,2,3,..] HLS/


@TopLevel synthesize/
def sort(input_array_a, BW, options=[]): Simulate/P &R
3 //Write python sorting
4.2.4 Sorting networks … ISE/Vivado
Sorting networks [19] is a set of compare-swap primitives #Call
connected by wires. Bubble sort is an instance of a sorting sort(input_array, 32, fastest) bitstream
network. Two examples of sorting networks (bitonic and
odd-even transposition) are shown in Figure 6. For each Figure 8: Design flow of Resolve.
vertical connection, the minimum of two inputs is assigned
to the upper wire and the maximum goes to the lower wire. Our framework is implemented as a small domain-specific
Due to parallel nature of sorting networks, they are easier language. Figure 7 shows simplified grammar of the lan-
to implement in HLS than other sorting algorithms. How- guage. The sorting architectures defined in previous sec-
ever, sorting networks does not scale well in hardware [14] tions are defined by types for instance, RD and IS. Each
sorting algorithm has a number of different implementa- Algorithm 1: Customized Sorting Architecture Gener-
tions, called variants. For example, radix sort (RD) has ation
five variants: RD v1, RD v2, RD v3, RD v4, RD v5,. The Data: UC={T, S, B, F, N }, V={V1 , V2 , ..Vm },
sort function can use any sorting algorithm or a composi- P={N/2, N/4..}
tion of one or more algorithms. If we wanted to create an Result: D=architecture for U C, R=performance area
implementation that sorts n elements, we could define it as results
any of the basic sorting algorithms from Figure 7. For ex- 1 if U C is 1 then
ample, SS n creates a selection sort implementation, and BS 2 [D, R]=SorterGenerator(V, N)
n uses the bubble sort algorithm for the implementation. 3 end
If we wish to create a hybrid sorting architecture we could 4 else
perform Merge(QS n/2, QS n/2), which uses quick sort on 5 foreach (P ) do
the two halves of the input data and merges the results to- 6 [D, R]=SorterGenerator(V, P)
gether. The expression: Merge(Merge(RD n/4, RD n/4), 7 if CheckU serConstraints(U C) then
Merge(RD n/4, RD n/4)) splits the input data into quar- 8 emitMerge(D, P)
ters, and then merges them twice. The elements for the 9 if sim/impl is 1 then
quarter arrays can be sorted using different sorting algo- 10 R = Simulate D
rithms in our framework. In this example, radix sort is 11 R =Implement D
used to sort the quarter arrays. Based on the Sort func- 12 end
tion, the emit function generates specific variant of sorting
13 end
architecture. Thus our framework completely abstracts the
underlying architectural details from the user, and allows 14 end
the user to generate an optimized architecture in a matter 15 end
of minutes. 16 Procedure SorterGenerator(V, N )
Data: V, N
To use the framework, the user writes Python code as de-
Result: D : Design, R : Report
scribed in Figure 8. It has three components: Part  1 is a
17 TS(1, 2, .., m)=CalculateThroughput(V, N )
library of the template generator classes for existing sort-
18 S = min(V1 (t), V2 (t), ..Vm (t))
ing algorithms (e.g., InsertionSort, MergeSort). There
19 [D, R]=emitCode S
are currently eleven classes, some with multiple architec-
20 if sim/impl is 1 then
ture variants. All these classes inherit from base class called
21 Simulate D Implement D
Sorting. The Sorting class provides common class meth-
22 end
ods and members (e.g., size, bit width) for all the sort-
ing algorithms. Each class provides parameterizable func-
tions tailored to specific sorting algorithm. For example,
RadixSort.optimized II1(size, bit − width) generates opti- example, we know linear insertion sort (LIS) has II = 1,
mized Radix sort with II = 1, while f unctional pipelining so the T S(LIS) = 1 × N . Then it generates design D
(size, bit width) generates a dataflow pipelined radix sort and returns report R. In the case of |U C| > 1, we must
for a given parameters. Part  2 is HLS project generator satisfy user constraints; In Algorithm 1, we present a case
and configuration class. The configuration class accepts sev- where there is not a design in the current pool that satis-
eral parameters. These are the FPGA device, frequency, fies U C (other case where there is a D that satisfies U C
clock period, simulate true, implement true, and name of is straightforward). We use a heuristic approach that con-
the module. If simulate_true=1 then the generated design tinuously divides N into halves until it finds a design that
is simulated and verified with a selected simulator inside satisfies U C. For a returned design D from SorterGener-
HLS. If the implement_true=1, then the design is physically ator, we call CheckUserConstraints to check these condi-
evaluated by RTL synthesis. tions: U C(T ) > D(T ), U C(S) < D(S), U C(B) < D(B),
The users write their top level function in Part ; 3 this U C(F ) > D(F ). If these conditions meet, then emitMerge
calls the sorting routine. TopLevel is a Python decorator generates HLS code from pre-wrapped templates in python.
which allows us to add additional information to the exist-
ing Python function. Once TopLevel decorator starts ex- 6. EXPERIMENTAL RESULTS
ecuting, it does several things. First, it generates a cus-
In this section, we present the performance and utiliza-
tomized sorting architecture tailored to user provided pa-
tion results for a representative set of architectures gener-
rameters using Algorithm 1. Here V is a set of all differ-
ated by our framework, and the end-to-end (CPU/FPGA)
ent variants of existing sorting architectures, and D and R
implementation of selected sorting architectures. Finally, we
are returned sorting design and respective simulation/im-
compare our designs with existing implementations of sort-
plementations results. The user provides U C. U C must
ing hardware architectures.
contain at least one element which is size of array to sort
Basic Sorting Algorithms: We implemented basic sort-
(N ). If U C is one, then sorter generates a design from exist-
ing algorithms – selection sort, rank sort, linear insertion
ing designs which has the highest throughput using Sorter-
sort, merge sort (two variants), sample sort, radix sort (two
Generator function. The emitCode function generates opti-
variants), bitonic sort, and transposition sort (two variants)
mized sorting architectures using existing HLS architectures
– for three different problem sizes (32, 2014, 16384). The
(templates) wrapped in python code. The SorterGenera-
results are shown in Table 3. Results presented in Ta-
tor includes CalculateThroughput function that calculates
ble 3 are obtained after RTL synthesis targeting the Xilinx
throughput T S of current design using initial II of each
xc7vx1140tflg1930-1 chip using Vivado HLS 2014.3. The
variant. We assume the II of each variant is known. For
performance results are presented in terms of megabytes per
Table 3: Implementation results for different sorting architectures. Tasks=number of parallel sorting processes. Entries with ’-’ are
omitted since the sorting architecture is not good for that particular size (e.g., the utilization is too high to fit on the target device).

32 1024 16384
Algorithm name Tasks Slices BRAM Freq MB/s Slices BRAM Freq MB/s Slices BRAM Freq MB/s
Selection sort 2 26 0 266 50 410 12 232 3.5 599 192 171 97
Rank sort 2 119 4 389 508 162 16 419 4 504 256 348 < 10
Linear insertion sort n 374 0 345 1380 12046 0 310 1243 - - - -
Merge sort (P) log n 1526 18 164 954 2035 40 239 482 484 608 155 1244
Merge sort (UP) log n 666 18 180 550 1268 40 281 899 2474 832 177 567
MergeStream (P) log n 529 8 211 794 1425 20 189 756 2487 140 166 666
Sample sort - - - - - 2777 218 228 911 5174 2838 127 510
8-bit Radix sort 4 1420 19 227 42 1500 36 230 202 1743 456 222 220
4-bit Radix sort 8 2146 30 353 223 2470 60 362 356 3352 960 289 289
Bitonic sort - 4391 0 268 1073 3239 56 268 1048 7274 1280 230 922
Odd-even trans 8*2 929 33 342 96 1254 36 301 15 1361 128 225 0.8
Odd-even trans 16*2 1326 0 323 70 2209 68 270 29 2370 128 212 1.64
Merge (Stream) - 221 0 395 1407 231 0 374 1490 255 0 368 1474
Merge4 + Radix - - - - - - - - - 1010 168 244 411
Merge8 + Radix - - - - - - - - - 2584 240 245 782
Merge16 + Radix - - - - - - - - - 4786 320 148 858

second (MB/s). We show a broad set of implementations to use more parallelism 8-way and 16-way) into 8 and 16, re-
highlight the ability of our framework to create a broad num- spectively. Then uses radix sort to sort the sub arrays.
ber of Pareto optimal designs rather than simply show the Table 3 presents some of the basic sorting architectures.
best results. Once we have these kinds of sorting architectures, it is straight-
Selection sort and rank sort both have small utilization forward to generate even more sorting architectures for dif-
with limited throughput especially as the input size increases. ferent user constraints. For example, we presented slices,
Linear insertion sort has very high throughput, but it does achieved clock period and throughput results for streaming
not scale well as the number of slices has a linear relation- merge sort (pipelined (P) and unpipelined (UP)) in Fig-
ship (to sort n size array, n insertion-cell is required) with ure 9. These results are obtained for different sizes and
the input size since we are directly increasing the number of different user specified clock period. We only presented one
insertion sort cells. Thus linear insertion sort architecture case study here; we can generate broad number of Pareto op-
should only be used to sort arrays with small sizes (e.g, 512). timal designs for aforementioned different sorting algorithms
The designs Merge sort (P) and Merge sort (UP) are to meet different user constraints.
pipelined and unpipelined versions of cascade of odd-even End-to-end sorting system: To the best of our knowl-
merge [13]. Merge Stream (P) is the streaming version of edge, there is no published end-to-end system implementa-
the cascade of odd-even merge sort. Pipelined version of tion of large sorting problems using architectures created
merge sort achieve better II except for size 1024. This is from HLS. We implemented and tested a number of dif-
caused because HLS tool is doing loop level transformations ferent sorting algorithms on a hybrid CPU/FPGA system
when we do not have pipeline for size 1024. Sample sort using RIFFA 2.2.1 [12]. The HLS sorting architectures use
tends to achieve higher throughput but uses more BRAMs AXI stream. The corresponding AXI signals are connected
than other sorting architectures. to signals of RIFFA. We present the area and performance
The 8-bit radix sort has four parallel tasks; the 4-bit radix of the several prototypes (sizes) in Table 4. In the first
sort has eight parallel tasks. Radix sort provides a good row of Table 4, we present the area results for RIFFA using
area-throughput tradeoff. In the 4-bit implementation, dou- only loop-back HLS module (i.e., an empty HLS module).
bling the area produces a greater than 4× speedup for 32 This shows the overhead of RIFFA. The remaining results
inputs. This trend does not continue for larger input sizes include RIFFA and the sorting algorithm. Results for 16384
though the throughput does increase in all cases. This in- and 65536 are obtained using the xc7vx690tffg1761-2 FPGA
dicates that radix sort is suitable for medium size arrays. running at 125MHz, and PC with Intel Core i7 CPU at 3.6
Bitonic sort achieves high throughput for, but it tends to GHz and 16 GB RAM. The CPU is used only to transmit
use more BRAMs than merge sort. Thus, bitonic sort is and receive data. The sorting implemented on the FPGA
suitable for sorting medium size arrays. can sort data at a rate of 0.44 - 0.5 GB/s. Our end-to-end
In the second part of Table 3, we present four hybrid system does not overlap communication and sorting times.
sorting architectures. Merge (Stream) is a streaming ver- Thus, it has an average throughput of 0.23 GB/s. The last
sion of merge sort that operates on pre-sorted inputs. It is line of Table 4 shows hybrid sorter results for 131072 size
designed for heterogeneous CPU/FPGA sorting where the formed by two 65536 size sorters. CPU merges outputs of
smaller arrays are pre-sorted in CPU. Merge4+Radix is gen- sub sorters. These results can be improved linearly by using
erated with the user constraints U C(T = H, n = 16384, S < more channels on RIFFA or increasing the clock frequency.
1500, B < 170). This architecture uses merge primitve to Comparison to previous work: We compare the re-
combine four 4096-element radix sorts, which gives the high- sults from our framework with the sorting networks from the
est throughput design with less than 170 Block RAMs (B < Spiral project [23], interleaved linear insertion sort (ILS) [21],
170). Merge8+Radix and Merge16+Radix architectures di- and merge sort [14]. We selected these because insertion
vides the input array (similar to Merge4+Radix except they sort is usually best suited for small size arrays, sorting net-
works are used for both small and medium size arrays, and a
Case study: Merge sort design space exploration
106
● ● ●

● ● ●
● ●

● 9 ●


● 5.5 ●
● ●
● 10 ●

Achieved Clock Period


103.4 ●

● ●

● ●


8 ● ●



● ●
105 ●

Throughput
● ● ●

● ●
● ●
Slices

● ● ●
● ●
103.2

● ● 7 ● ●
● ●



● ●
● ● 104.5 ●



● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
6 ● ●



● ●

104 ●



103

● ● ● ●
● ● ● ● ●

● ● ●
● ● ●

● ● ● ● ●
● ●
● 5 ● 103.5
● ● ●


● ● ● ●

8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Size Size Size

Designs ● P_3 ● P_4 ● P_5 ● P_6 ● P_7 ● P_8 ● P_9 ● UP_3 ● UP_4 ● UP_5 ● UP_6 ● UP_7 ● UP_8

Figure 9: Design space expiration of generated architectures: P X (X is user specified clock period and X = 3 to 10): piplelined and
UP X (X=3 to 10): unpipelined versions of merge sort.

Table 4: Area and performance of end-to-end system. *HLS mentation for the 1024 element array. The smallest design
result of 131072 size hybrid sorter. +indicates CPU merging
time). from Spiral is SN2 I. For example, to sort a 16384 element
array, SN2 I uses 13.7× more BRAMs, and its throughput
Design Size FF/LUT BRAM II
is 14× worse than our merge sort implementation. SN1 and
SN5 for 16384 size could not fit on target device (e.g., SN5
RIFFA N/A 19472/16395 71 N/A
requires 8196 BRAMs while target device has only 1470).
RIFFA+Sorting IP 16384 25118/20368 141 18434 We also compared our results to work by Chen et al. [7]
RIFFA+Sorting IP 65536 26353/21707 333 73730
which designs an energy efficient bionic sort on the same
RIFFA+Sorting IP 131072 38436/31816 609 *73730+ target device. Their designs uses 19927 LUTs and 2 BRAMS
for sorting 1024 elements, and it uses 36656 LUTs and 88
BRAMs for sorting 16384 elements. The LUTs and BRAMs
merge sort is best for larger size arrays. Finally, we compare are calculated using the utilization percentage from [7].
against the sorting architectures implemented in various dif-
Table 6: Streaming insertion sort generated in this paper
ferent high-level languages [2]. (Resolve) vs. Interleaved linear insertion sorter (ILS) [21].
First we compare our results (streaming merge sort) to
sorting architectures from the Spiral project [23]. We used 64 128 256
the same parameters in both cases: 32-bit fixed point type
ILS Throughput (MSPS) [21] 4.6 2.33 1.16
for all architectures, Xilinx xc7vx690tffg1761-2, streaming Resolve Throughput (MSPS) 5.3 2.54 1.29
width of one (one streaming input and one streaming out- Ratio 1.13X 1.08X 1.1X
put), and 125 MHz frequency. Spiral generates five different ILS Slices [21] 1113 2227 4445
sorting architectures (SN1, SN2, SN3, SN4, and SN5). SN1 Resolve Slices 792 1569 3080
and SN3 are high performance fully streaming architectures Ratio 0.7X 0.7X 0.69X
with large area. SN2 and SN4 balance area and through-
put. And SN5 is an architecture optimized for area [23]. We Table 6 presents the throughput and utilization results of
compare against SN1, SN2, and SN5 because they provide interleaved linear insertion sorter (ILS) and our streaming
a good balance between performance and area. For SN2, insertion sort for different sizes (64, 128, 256). We calcu-
we generate fully streaming (SN2 S) and iterative (SN2 I) lated the slices of ILS by using slices per node × number of
versions. We only compared our result against to the SN5 elements (size). The slices per node for w = 1 is obtained
fully steaming version because the iterative version of SN5 from [21]. The throughput is the number of MSPS for a
has a very low performance (e.g., throughput of SN5 itera- given size (64, 128, 256). Our insertion sorter has average
tive version for size 1024 is 102621). We implemented these 1.1X better throughput while using 0.6X fewer slices.
designs (SN1, SN2 I, SN S, and SN5) using Vivado 2015.2. Arcas-Abella et al. [2] develop a spatial insertion sort and
All of the results are presented after place-and-route. bitonic sort using Bluespec, LegUp, Chisel, and Verilog. Ta-
Table 5 compares the four architectures from Spiral to ble 7 shows comparison of our spatial insertion / bitonic sort
our work. The throughput (II) is the number of clock cy- designs to implementations of this work. We achieve higher
cles need to sort an array of n elements. We obtained Spi- throughput and use less area. Our bitonic sort achieves the
ral throughput results from the report generated by online same throughput with comparable area results.
tool (http://www.spiral.net/hardware/sort/sort.html). The Koch et al. [14] use partial reconfiguration to sort large
throughput of our work is obtained from Vivado HLS co- arrays. They achieve a sorting throughput of 667 MB/s to
simulation. In each case, this is the II for sorting one n 2 GB/s. We can improve our throughput by increasing the
size array. The best design (fastest, small area) from Spiral frequency (our HLS cores run at 125 MHz) and using addi-
project is SN2 S for 1024. SN S uses 17.9× more BRAMS, tional RIFFA channels. Our system consumes more BRAMs
4.6× more FFs, 2.1× more LUTs than our merge sort imple- because they implement a FIFO-based merge sort using a
Table 5: Comparison to Spiral [23]. II is the number of clock cycles to produce one sorted array.

64 1024 16384
FF/LUT BRAM II FF/LUT BRAM II FF/LUT BRAM II
Spiral SN1 5866 / 1775 10 64 34191 / 28759 162 1024 - - -
Spiral SN2 I 2209 / 880 5 397 4053 / 2002 45 10261 6790/2547 964 229405
Spiral SN2 S 5912 / 1803 10 64 16165 / 5991 125 1024 62875 /2744884 1395 / 16384 16384
Spiral SN5 9386 / 3023 18 64 27130 / 11104 225 1024 - - -
Resolve 1560 / 1401 2 68 3486 / 2848 7 1028 6515 / 4901 70 16388

Table 7: Comparison of our work to [2]. * calculated with II=1 [7] R. Chen et al. Energy and memory efficient mapping
of bitonic sorting on fpga. In International Symposium
Spatial Insertion Bitonic on Field-Programmable Gate Arrays, pages 240–249.
FF/ LUT MB/s LUT/FF MB/s ACM, 2015.
Verilog 2081/ 641 1301 10250/ 2640 38016 [8] J. Chhugani et al. Efficient implementation of sorting
BSV 2012/ 1701 1310 10250/ 2640 38326 on multi-core simd cpu architecture. Proceedings of the
Chisel 2012/ 720 1317 10272/ 2649 38447
LegUp 1115/ 823 3.13 4210/ 5180 1034
VLDB Endowment, 1(2):1313–1324, 2008.
[9] J. Dean et al. Mapreduce: Simplified data processing on
Resolve 605/ 661 1415 6404/ 9827 38016*
large clusters. Communications of the ACM, 51(1):107–
113, 2008.
[10] N. George et al. Hardware system synthesis from
shared memory blocks for both input streams. Writing to a domain-specific languages. In Field Programmable
FIFO using two different processes during functional pipelin- Logic and Applications, pages 1–8. IEEE, 2014.
ing is not supported by HLS tools that we used. [11] G. Graefe. Implementing sorting in database systems.
ACM Computing Surveys (CSUR), 38(3):10, 2006.
[12] M. Jacobsen et al. Riffa 2.1: A reusable integration
7. CONCLUSION framework for fpga accelerators. ACM Transactions
The Resolve framework generates optimized sorting ar- on Reconfigurable Technology and Systems (TRETS),
chitectures from pre-optimized HLS blocks. Resolve comes 2015.
with a number of highly optimized sorting primitives and [13] D. E. Knuth. The art of computer programming, volume
sorting architectures. Both the primitives and basic sorting 3: sorting and searching. Addison-Wesley Professional,
algorithms can be combined in countless manners using our 1998.
domain specific language, which allows for efficient design [14] D. Koch et al. Fpgasort: A high performance sorting ar-
space exploration to enable a user to meet all of the nec- chitecture exploiting run-time reconfiguration on fpgas
essary system design constraints. The user can customize for large problem sorting. In International symposium
these hardware implementations in terms of sorting element on Field programmable gate arrays, pages 45–54. ACM,
size and data type, throughput, and FPGA device utilization 2011.
constraints. Resolve integrates these sorting architectures [15] C. Lauterbach et al. Fast bvh construction on gpus. In
with RIFFA, which enables designers to call these hardware Computer Graphics Forum, volume 28, pages 375–384.
accelerated sorting functions directly from a CPU with a Wiley Online Library, 2009.
PCIe enabled FPGA card. [16] R. Marcelino et al. Sorting units for fpga-based embed-
ded systems. In Distributed Embedded Systems: Design,
References Middleware and Resources, pages 11–22. Springer, 2008.
[17] J. Matai et al. Enabling fpgas for the masses. In First
[1] S. G. Akl. Parallel sorting algorithms. AP, Inc, 1985. International Workshop on FPGAs for Software Pro-
[2] O. Arcas-Abella et al. An empirical evaluation of high- grammers, 2014.
level synthesis languages and tools for database ac- [18] R. Mueller et al. Data processing on fpgas. Proceedings
celeration. In International Conference on Field Pro- of the VLDB Endowment, 2(1):910–921, 2009.
grammable Logic and Applications. IEEE, 2014. [19] R. Mueller et al. Sorting networks on fpgas. The VLDB
[3] M. Bednara et al. Tradeoff analysis and architecture de- JournalâĂŤThe International Journal on Very Large
sign of a hybrid hardware/software sorter. In Interna- Data Bases, 21(1):1–23, 2012.
tional Conference on Application-Specific Systems, Ar- [20] R. Mueller, J. Teubner, and G. Alonso. Data process-
chitectures, and Processors, pages 299–308. IEEE, 2000. ing on fpgas. Proceedings of the VLDB Endowment,
[4] V. Brajovic et al. A sorting image sensor: An ex- 2(1):910–921, 2009.
ample of massively parallel intensity-to-time process- [21] J. Ortiz et al. A streaming high-throughput linear
ing for low-latency computational sensors. In Inter- sorter system with contention buffering. International
national Conference on Robotics and Automation, vol- Journal of Reconfigurable Computing, 2011.
ume 2, pages 1638–1643. IEEE, 1996. [22] N. Satish et al. Designing efficient sorting algorithms
[5] M. Burrows and D. J. Wheeler. A block-sorting lossless for manycore gpus. In IPDPS, pages 1–10. IEEE, 2009.
data compression algorithm. 1994. [23] M. Zuluaga et al. Computer generation of streaming
[6] J. Casper et al. Hardware acceleration of database sorting networks. In Design Automation Conference,
operations. In International symposium on Field- pages 1245–1253. ACM, 2012.
programmable gate arrays, pages 151–160. ACM, 2014.

You might also like