Isfpga16 Resolve
Isfpga16 Resolve
Janarbek Matai* , Dustin Richmond* , Dajung Lee† , Zac Blair* , Qiongzhi Wu* , Amin Abazari* ,
and Ryan Kastner*
*
Computer Science and Engineering, † Electrical and Computer Engineering
University of California, San Diego, La Jolla, CA 92093, United States
{jmatai, drichmond, dal064, zblair, qiw035, maabazari, kastner}@ucsd.edu
Sorting Architectures
(size, performance, area,..)
Customized
User constraints
x2
*/*
u1
c
>
x1
x2
*/*
u1
c
>
x1
x2
*/*
u1
x2
x1
text of high-level languages. Arcas-Abella et al. [2] looked
* / *u
in values hist
u2
at the feasibility of implementing bitonic sort and spatial
1
+
[val] 0
value 0 0 value
sort has a complexity of O(n2 ). Listing 1 shows a software- implementation to one that is synthesizable requires a mod-
centric HLS implementation of insertion sort. We discussed ification of software implementation to remove the recursive
some naive HLS optimizations for insertion sort in Section 2. function calls in the code.
These used different optimization directives (pragmas) in an Merge sort has two primary tasks. The first task par-
attempt to create a better hardware implementations. These titions the array into individual elements, and the second
designs (Designs 1 - 5 in Table 1 did not result in the optimal merges them. The majority of the work is performed in the
implementation. Design 6 give the best result. Here we merging unit, which is implemented with a merge primitive.
describe code restructuring optimizations of Design 6. This was described in Section 4.1.
An efficient hardware implementation of insertion sort Merge sort is implemented in hardware using merge sorter
uses an linear array of insertion-cells [2, 3, 16, 21] or a sorting tree [14] or using odd-even merge sort. Listing 9 provides
network [19]. Here we focus on a linear insertion sort imple- an outline of the code for streaming merge sorter tree. In
mentation; we discuss sorting network implementation later. this code, IN1, IN2, IN3 and IN4 are n/4 size inputs, and
Figure 4 shows architecture from Arcas-Abella et al. [2]. In OUT is a size n output. MergePrimitive1 and MergePrim-
this architecture a series of cells (insertion-cell primitives) itive2 merges two sorted lists of array size n/4 and n/2,
operate in parallel to sort a given array. It compares the respectively. Using the dataflow pragma, we can perform a
current input (IN) with the current value in current regis- functional pipeline across these three functions. Merge sort
ter (CURR REG). The smaller of current register and the based on odd-even merge also uses merge sorting primitive
current input is given as an output to OU T . to sort a given n size array with II of n. Merge sort can be
Listing 8 shows the source code that represents the hard- optimized in hardware by running n log n tasks in parallel.
ware architecture in Figure 4. A cascade of insertion-cells
1 void C a s c a d e M e r g e S o r t ( hls :: stream < int > & IN1 ,
is implemented in a pipelined manner using the dataflow 2 hls :: stream < int > & IN2 , hls :: stream < int > & IN3 ,
pragma, and series of calls to the InsertionCell function 3 hls :: stream < int > & IN4 , hls :: stream < int >
from Listing 7. Note that we have four different versions & OUT ) {
4 # pragma HLS DATAFLOW
of the function – InsertionCell1, InsertionCell2, etc.. It 5 # pragma HLS stream depth =4 variable = IN1
is necessary to replicate the functions due to the use of the 6 for ( int i =0; i < SIZE /4; i ++) {
static variable. Each of these functions has the same code 7 // read input data
8 }
as in Listing 7. This implementation achieves O(n) time 9 MergePrimitive1 ( IN1 , IN2 , TEMP1 ) ;
complexity to sort an array of size n. 10 MergePrimitive1 ( IN3 , IN4 , TEMP2 ) ;
11 MergePrimitive2 ( TEMP1 , TEMP2 , OUT ) ;
1 void InsertionSort ( hls :: stream <T > & IN , 12 }
hls :: stream <T > & OUT ) {
2 # pragma HLS DATAFLOW Listing 9: FIFO based streaming merge sorter tree
3 hls :: stream <T > out1 , out2 , out3 ;
4 // Function calls ; Quick sort uses a randomly selected pivot to recursively
5 InsertionCell1 ( IN , out1 ) ;
6 InsertionCell2 ( out1 , out2 ) ;
split an array into elements that are larger and smaller than
7 InsertionCell3 ( out2 , out3 ) ; the pivot. After selecting a pivot, all elements smaller than
8 InsertionCell4 ( out3 , OUT ) ; pivot are moved left of the pivot, i.e., they are in a lower
9 } index in the array. This process is repeated for the left
Listing 8: Insertion Sort code for HLS design based on the and right sides separately. The software complexity of this
hardware architecture in Figure 4. The InsertionCell functions algorithm is O(n2 ) in the worst case and O(n log n) in the
use the code from Listing 7. best case. Non-recursive (iterative) version of quick sort can
be implemented in HLS with slow performance. Instead, we
chose to implement a parallel version of quick sort known
4.2.2 Recursive Algorithms as sample sort. In sample sort, we can run t tasks to divide
A pure software implementation of merge sort and quick the work of pivot_function to sort n size array into n/t.
sort are not possible in HLS due to the use of recursive func- The integration of t results from tasks can be done using the
tions. HLS tools (including Vivado HLS) typically do not prefix sum primitive. Essentially, this implementation sorts
allow recursive function calls. Changing from a recursive an n size array in O(n) time with higher BRAM usage.
4.2.3 Non-comparison based due to required IO throughput. This requires balancing the
Counting sort has three stages. First the counting sort parallelism and area in HLS and will be discussed later. For
computes the histogram of elements from the unsorted in- example using parallel n compare-swap elements, odd-even
put array. The second stage performs a prefix sum on the transposition sort can sort an n size array in O(n).
histogram from the previous stage. The final stage sorts the
array. Final stage first reads the value from the unsorted in- 5. SORTING ARCHITECTURE GENERATOR
put array. Then it finds the first index of that element from
In this section, we describe our framework for generating
the prefix sum stage and writes it to the output array. Then
sorting architectures. A user can perform design space ex-
it increments the index in the prefix sum by one. Figure 5
ploration for a range different application parameters. And
(a) shows an example of the counting sort algorithm on an
once she has decided on a particular architecture, the frame-
8 element input array. The first stage performs a histogram
work generates a customized sorting architecture that can
on the input data. There are only three values (2, 3, 4), and
run on out of the box on a heterogeneous CPU/FPGA sys-
they occur 3, 2, and 3 times in the unsorted input array,
tem. It creates the RTL code if the user wishes to integrate
respectively. The second stage does a prefix sum across the
it into the system in another manner.
histogram frequencies. This tells us the starting index for
The flow for our sorting framework is shown in Figure 8.
each of the three values. The value 2 starts at index 0; the
We define user constraint as a tuple U C(T, S, B, F, N ) where
value 3 starts at index 3; and the value 4 starts at index 5.
T , S, B, F and N are throughput, number of slices, num-
The final stage uses these prefix sum indices to fill in the
ber of block rams, frequency, and the number of elements
sorted array. Parallel counting sort can be designed using
to sort. We define V as a set of sorting designs that can
function pipelining of three stages. It runs in O(n) time
perform sorting on an input array of size N . The sorting
using O(n × k) (k is constant) memory storage.
architecture generation is a problem to find a design D of
(a) Counting sort (b) Radix sort the form D(T, S, B, F, N ) that satisfies the U C.
Unsorted 1) Histogram 2 3 4 1. Counting sort
Input
3 2 3 Mem(n)
4 2 4 4 2 3 3 2
2. Counting sort a) RD::= | RD v1 | RD v2| BD v3| RD v4| RD v5
+
Mem(n) BtS::= | BtS v1 | BtS v2| BtS v3| BtS v4| BtS v5
Sorted
Input
2) Prefix sum 0 3 5
3. Counting sort
…
4 +1 Mem(n) b) Sort ::= c) match Sort (n, v)::=
3) Histogram 0 3 6 4. Counting sort
| SS n | SS n emit SS (v)
| RS n | RS n emit RS (v)
| BS n | BS n emit BS (v)
Figure 5: An example hardware architectures for counting sort | IS n | IS n emit IS (v)
and radix sort | MS n | MS n emit MS (v)
| QS n | QS n emit QS (v)
| RD n | RD n emit RD (v)
Radix sort works by applying counting sort for each digit | BtS n | BtS n emit BtS (v)
of the input data. For example, to sort 32-bit integers, we | OET n | OET n emit OET (v)
| OEM n | OEM n emit OEM (v)
can apply counting sort four times to each of the four hex- | Merge (Sort, Sort)
adecimal (radix 8) digits. We can implement a fully par-
allel radix sort in HLS using functional pipelining of each Figure 7: Grammar of domain-specific language. SS=Selection
counting sort. An individual counting sort operation has a sort, RS=Rank sort, BS=Bubble sort, IS=Insertion sort,
MS=Merge sort, QS=Quick sort, RD=Radix sort, BtS=Bitonic
throughput of n, thus fully parallel radix sort will also have
sort, OET=Odd-even transposition sort, OEM=Odd even
a throughput of n. To store the outputs of intermediate merge sort. a) Sorting architectural variants for particular
stages, we need n × k storage. Here k is usually 4 for 32-bit algorithm, b) Sort function grammar, c) Code generator
number or 8 for 64-bit number. Thus to sort 32-bit number
in parallel, we use 3 × n storage (3 intermediate memory
storage) as shown in Figure 5 (b).
from components import InsertionSort
(a) (b) 1 from components import MergeSort
x0 y0 x0 y0 from components import RadixSort
x1 y1 x1 y1
x2 y2 x2 y2
x3 y3 x3 y3 .... Sorting
x4 y4 x4 y4
x5 y5 x5 y5 Architecture
x6 y6 x6 y6
2 conf = Configuration.Configuration(…) Selection
x7 y7 x7 y7 #sort = RadixSort(10, “RadixSort”, 32, 4)
32 1024 16384
Algorithm name Tasks Slices BRAM Freq MB/s Slices BRAM Freq MB/s Slices BRAM Freq MB/s
Selection sort 2 26 0 266 50 410 12 232 3.5 599 192 171 97
Rank sort 2 119 4 389 508 162 16 419 4 504 256 348 < 10
Linear insertion sort n 374 0 345 1380 12046 0 310 1243 - - - -
Merge sort (P) log n 1526 18 164 954 2035 40 239 482 484 608 155 1244
Merge sort (UP) log n 666 18 180 550 1268 40 281 899 2474 832 177 567
MergeStream (P) log n 529 8 211 794 1425 20 189 756 2487 140 166 666
Sample sort - - - - - 2777 218 228 911 5174 2838 127 510
8-bit Radix sort 4 1420 19 227 42 1500 36 230 202 1743 456 222 220
4-bit Radix sort 8 2146 30 353 223 2470 60 362 356 3352 960 289 289
Bitonic sort - 4391 0 268 1073 3239 56 268 1048 7274 1280 230 922
Odd-even trans 8*2 929 33 342 96 1254 36 301 15 1361 128 225 0.8
Odd-even trans 16*2 1326 0 323 70 2209 68 270 29 2370 128 212 1.64
Merge (Stream) - 221 0 395 1407 231 0 374 1490 255 0 368 1474
Merge4 + Radix - - - - - - - - - 1010 168 244 411
Merge8 + Radix - - - - - - - - - 2584 240 245 782
Merge16 + Radix - - - - - - - - - 4786 320 148 858
second (MB/s). We show a broad set of implementations to use more parallelism 8-way and 16-way) into 8 and 16, re-
highlight the ability of our framework to create a broad num- spectively. Then uses radix sort to sort the sub arrays.
ber of Pareto optimal designs rather than simply show the Table 3 presents some of the basic sorting architectures.
best results. Once we have these kinds of sorting architectures, it is straight-
Selection sort and rank sort both have small utilization forward to generate even more sorting architectures for dif-
with limited throughput especially as the input size increases. ferent user constraints. For example, we presented slices,
Linear insertion sort has very high throughput, but it does achieved clock period and throughput results for streaming
not scale well as the number of slices has a linear relation- merge sort (pipelined (P) and unpipelined (UP)) in Fig-
ship (to sort n size array, n insertion-cell is required) with ure 9. These results are obtained for different sizes and
the input size since we are directly increasing the number of different user specified clock period. We only presented one
insertion sort cells. Thus linear insertion sort architecture case study here; we can generate broad number of Pareto op-
should only be used to sort arrays with small sizes (e.g, 512). timal designs for aforementioned different sorting algorithms
The designs Merge sort (P) and Merge sort (UP) are to meet different user constraints.
pipelined and unpipelined versions of cascade of odd-even End-to-end sorting system: To the best of our knowl-
merge [13]. Merge Stream (P) is the streaming version of edge, there is no published end-to-end system implementa-
the cascade of odd-even merge sort. Pipelined version of tion of large sorting problems using architectures created
merge sort achieve better II except for size 1024. This is from HLS. We implemented and tested a number of dif-
caused because HLS tool is doing loop level transformations ferent sorting algorithms on a hybrid CPU/FPGA system
when we do not have pipeline for size 1024. Sample sort using RIFFA 2.2.1 [12]. The HLS sorting architectures use
tends to achieve higher throughput but uses more BRAMs AXI stream. The corresponding AXI signals are connected
than other sorting architectures. to signals of RIFFA. We present the area and performance
The 8-bit radix sort has four parallel tasks; the 4-bit radix of the several prototypes (sizes) in Table 4. In the first
sort has eight parallel tasks. Radix sort provides a good row of Table 4, we present the area results for RIFFA using
area-throughput tradeoff. In the 4-bit implementation, dou- only loop-back HLS module (i.e., an empty HLS module).
bling the area produces a greater than 4× speedup for 32 This shows the overhead of RIFFA. The remaining results
inputs. This trend does not continue for larger input sizes include RIFFA and the sorting algorithm. Results for 16384
though the throughput does increase in all cases. This in- and 65536 are obtained using the xc7vx690tffg1761-2 FPGA
dicates that radix sort is suitable for medium size arrays. running at 125MHz, and PC with Intel Core i7 CPU at 3.6
Bitonic sort achieves high throughput for, but it tends to GHz and 16 GB RAM. The CPU is used only to transmit
use more BRAMs than merge sort. Thus, bitonic sort is and receive data. The sorting implemented on the FPGA
suitable for sorting medium size arrays. can sort data at a rate of 0.44 - 0.5 GB/s. Our end-to-end
In the second part of Table 3, we present four hybrid system does not overlap communication and sorting times.
sorting architectures. Merge (Stream) is a streaming ver- Thus, it has an average throughput of 0.23 GB/s. The last
sion of merge sort that operates on pre-sorted inputs. It is line of Table 4 shows hybrid sorter results for 131072 size
designed for heterogeneous CPU/FPGA sorting where the formed by two 65536 size sorters. CPU merges outputs of
smaller arrays are pre-sorted in CPU. Merge4+Radix is gen- sub sorters. These results can be improved linearly by using
erated with the user constraints U C(T = H, n = 16384, S < more channels on RIFFA or increasing the clock frequency.
1500, B < 170). This architecture uses merge primitve to Comparison to previous work: We compare the re-
combine four 4096-element radix sorts, which gives the high- sults from our framework with the sorting networks from the
est throughput design with less than 170 Block RAMs (B < Spiral project [23], interleaved linear insertion sort (ILS) [21],
170). Merge8+Radix and Merge16+Radix architectures di- and merge sort [14]. We selected these because insertion
vides the input array (similar to Merge4+Radix except they sort is usually best suited for small size arrays, sorting net-
works are used for both small and medium size arrays, and a
Case study: Merge sort design space exploration
106
● ● ●
●
● ● ●
● ●
●
● 9 ●
●
●
● 5.5 ●
● ●
● 10 ●
●
Throughput
● ● ●
●
● ●
● ●
Slices
● ● ●
● ●
103.2
●
● ● 7 ● ●
● ●
●
●
●
● ●
● ● 104.5 ●
●
●
●
● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
6 ● ●
●
●
●
● ●
●
104 ●
●
●
●
103
●
● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ●
●
● ● ● ● ●
● ●
● 5 ● 103.5
● ● ●
●
●
● ● ● ●
●
8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Size Size Size
Designs ● P_3 ● P_4 ● P_5 ● P_6 ● P_7 ● P_8 ● P_9 ● UP_3 ● UP_4 ● UP_5 ● UP_6 ● UP_7 ● UP_8
Figure 9: Design space expiration of generated architectures: P X (X is user specified clock period and X = 3 to 10): piplelined and
UP X (X=3 to 10): unpipelined versions of merge sort.
Table 4: Area and performance of end-to-end system. *HLS mentation for the 1024 element array. The smallest design
result of 131072 size hybrid sorter. +indicates CPU merging
time). from Spiral is SN2 I. For example, to sort a 16384 element
array, SN2 I uses 13.7× more BRAMs, and its throughput
Design Size FF/LUT BRAM II
is 14× worse than our merge sort implementation. SN1 and
SN5 for 16384 size could not fit on target device (e.g., SN5
RIFFA N/A 19472/16395 71 N/A
requires 8196 BRAMs while target device has only 1470).
RIFFA+Sorting IP 16384 25118/20368 141 18434 We also compared our results to work by Chen et al. [7]
RIFFA+Sorting IP 65536 26353/21707 333 73730
which designs an energy efficient bionic sort on the same
RIFFA+Sorting IP 131072 38436/31816 609 *73730+ target device. Their designs uses 19927 LUTs and 2 BRAMS
for sorting 1024 elements, and it uses 36656 LUTs and 88
BRAMs for sorting 16384 elements. The LUTs and BRAMs
merge sort is best for larger size arrays. Finally, we compare are calculated using the utilization percentage from [7].
against the sorting architectures implemented in various dif-
Table 6: Streaming insertion sort generated in this paper
ferent high-level languages [2]. (Resolve) vs. Interleaved linear insertion sorter (ILS) [21].
First we compare our results (streaming merge sort) to
sorting architectures from the Spiral project [23]. We used 64 128 256
the same parameters in both cases: 32-bit fixed point type
ILS Throughput (MSPS) [21] 4.6 2.33 1.16
for all architectures, Xilinx xc7vx690tffg1761-2, streaming Resolve Throughput (MSPS) 5.3 2.54 1.29
width of one (one streaming input and one streaming out- Ratio 1.13X 1.08X 1.1X
put), and 125 MHz frequency. Spiral generates five different ILS Slices [21] 1113 2227 4445
sorting architectures (SN1, SN2, SN3, SN4, and SN5). SN1 Resolve Slices 792 1569 3080
and SN3 are high performance fully streaming architectures Ratio 0.7X 0.7X 0.69X
with large area. SN2 and SN4 balance area and through-
put. And SN5 is an architecture optimized for area [23]. We Table 6 presents the throughput and utilization results of
compare against SN1, SN2, and SN5 because they provide interleaved linear insertion sorter (ILS) and our streaming
a good balance between performance and area. For SN2, insertion sort for different sizes (64, 128, 256). We calcu-
we generate fully streaming (SN2 S) and iterative (SN2 I) lated the slices of ILS by using slices per node × number of
versions. We only compared our result against to the SN5 elements (size). The slices per node for w = 1 is obtained
fully steaming version because the iterative version of SN5 from [21]. The throughput is the number of MSPS for a
has a very low performance (e.g., throughput of SN5 itera- given size (64, 128, 256). Our insertion sorter has average
tive version for size 1024 is 102621). We implemented these 1.1X better throughput while using 0.6X fewer slices.
designs (SN1, SN2 I, SN S, and SN5) using Vivado 2015.2. Arcas-Abella et al. [2] develop a spatial insertion sort and
All of the results are presented after place-and-route. bitonic sort using Bluespec, LegUp, Chisel, and Verilog. Ta-
Table 5 compares the four architectures from Spiral to ble 7 shows comparison of our spatial insertion / bitonic sort
our work. The throughput (II) is the number of clock cy- designs to implementations of this work. We achieve higher
cles need to sort an array of n elements. We obtained Spi- throughput and use less area. Our bitonic sort achieves the
ral throughput results from the report generated by online same throughput with comparable area results.
tool (http://www.spiral.net/hardware/sort/sort.html). The Koch et al. [14] use partial reconfiguration to sort large
throughput of our work is obtained from Vivado HLS co- arrays. They achieve a sorting throughput of 667 MB/s to
simulation. In each case, this is the II for sorting one n 2 GB/s. We can improve our throughput by increasing the
size array. The best design (fastest, small area) from Spiral frequency (our HLS cores run at 125 MHz) and using addi-
project is SN2 S for 1024. SN S uses 17.9× more BRAMS, tional RIFFA channels. Our system consumes more BRAMs
4.6× more FFs, 2.1× more LUTs than our merge sort imple- because they implement a FIFO-based merge sort using a
Table 5: Comparison to Spiral [23]. II is the number of clock cycles to produce one sorted array.
64 1024 16384
FF/LUT BRAM II FF/LUT BRAM II FF/LUT BRAM II
Spiral SN1 5866 / 1775 10 64 34191 / 28759 162 1024 - - -
Spiral SN2 I 2209 / 880 5 397 4053 / 2002 45 10261 6790/2547 964 229405
Spiral SN2 S 5912 / 1803 10 64 16165 / 5991 125 1024 62875 /2744884 1395 / 16384 16384
Spiral SN5 9386 / 3023 18 64 27130 / 11104 225 1024 - - -
Resolve 1560 / 1401 2 68 3486 / 2848 7 1028 6515 / 4901 70 16388
Table 7: Comparison of our work to [2]. * calculated with II=1 [7] R. Chen et al. Energy and memory efficient mapping
of bitonic sorting on fpga. In International Symposium
Spatial Insertion Bitonic on Field-Programmable Gate Arrays, pages 240–249.
FF/ LUT MB/s LUT/FF MB/s ACM, 2015.
Verilog 2081/ 641 1301 10250/ 2640 38016 [8] J. Chhugani et al. Efficient implementation of sorting
BSV 2012/ 1701 1310 10250/ 2640 38326 on multi-core simd cpu architecture. Proceedings of the
Chisel 2012/ 720 1317 10272/ 2649 38447
LegUp 1115/ 823 3.13 4210/ 5180 1034
VLDB Endowment, 1(2):1313–1324, 2008.
[9] J. Dean et al. Mapreduce: Simplified data processing on
Resolve 605/ 661 1415 6404/ 9827 38016*
large clusters. Communications of the ACM, 51(1):107–
113, 2008.
[10] N. George et al. Hardware system synthesis from
shared memory blocks for both input streams. Writing to a domain-specific languages. In Field Programmable
FIFO using two different processes during functional pipelin- Logic and Applications, pages 1–8. IEEE, 2014.
ing is not supported by HLS tools that we used. [11] G. Graefe. Implementing sorting in database systems.
ACM Computing Surveys (CSUR), 38(3):10, 2006.
[12] M. Jacobsen et al. Riffa 2.1: A reusable integration
7. CONCLUSION framework for fpga accelerators. ACM Transactions
The Resolve framework generates optimized sorting ar- on Reconfigurable Technology and Systems (TRETS),
chitectures from pre-optimized HLS blocks. Resolve comes 2015.
with a number of highly optimized sorting primitives and [13] D. E. Knuth. The art of computer programming, volume
sorting architectures. Both the primitives and basic sorting 3: sorting and searching. Addison-Wesley Professional,
algorithms can be combined in countless manners using our 1998.
domain specific language, which allows for efficient design [14] D. Koch et al. Fpgasort: A high performance sorting ar-
space exploration to enable a user to meet all of the nec- chitecture exploiting run-time reconfiguration on fpgas
essary system design constraints. The user can customize for large problem sorting. In International symposium
these hardware implementations in terms of sorting element on Field programmable gate arrays, pages 45–54. ACM,
size and data type, throughput, and FPGA device utilization 2011.
constraints. Resolve integrates these sorting architectures [15] C. Lauterbach et al. Fast bvh construction on gpus. In
with RIFFA, which enables designers to call these hardware Computer Graphics Forum, volume 28, pages 375–384.
accelerated sorting functions directly from a CPU with a Wiley Online Library, 2009.
PCIe enabled FPGA card. [16] R. Marcelino et al. Sorting units for fpga-based embed-
ded systems. In Distributed Embedded Systems: Design,
References Middleware and Resources, pages 11–22. Springer, 2008.
[17] J. Matai et al. Enabling fpgas for the masses. In First
[1] S. G. Akl. Parallel sorting algorithms. AP, Inc, 1985. International Workshop on FPGAs for Software Pro-
[2] O. Arcas-Abella et al. An empirical evaluation of high- grammers, 2014.
level synthesis languages and tools for database ac- [18] R. Mueller et al. Data processing on fpgas. Proceedings
celeration. In International Conference on Field Pro- of the VLDB Endowment, 2(1):910–921, 2009.
grammable Logic and Applications. IEEE, 2014. [19] R. Mueller et al. Sorting networks on fpgas. The VLDB
[3] M. Bednara et al. Tradeoff analysis and architecture de- JournalâĂŤThe International Journal on Very Large
sign of a hybrid hardware/software sorter. In Interna- Data Bases, 21(1):1–23, 2012.
tional Conference on Application-Specific Systems, Ar- [20] R. Mueller, J. Teubner, and G. Alonso. Data process-
chitectures, and Processors, pages 299–308. IEEE, 2000. ing on fpgas. Proceedings of the VLDB Endowment,
[4] V. Brajovic et al. A sorting image sensor: An ex- 2(1):910–921, 2009.
ample of massively parallel intensity-to-time process- [21] J. Ortiz et al. A streaming high-throughput linear
ing for low-latency computational sensors. In Inter- sorter system with contention buffering. International
national Conference on Robotics and Automation, vol- Journal of Reconfigurable Computing, 2011.
ume 2, pages 1638–1643. IEEE, 1996. [22] N. Satish et al. Designing efficient sorting algorithms
[5] M. Burrows and D. J. Wheeler. A block-sorting lossless for manycore gpus. In IPDPS, pages 1–10. IEEE, 2009.
data compression algorithm. 1994. [23] M. Zuluaga et al. Computer generation of streaming
[6] J. Casper et al. Hardware acceleration of database sorting networks. In Design Automation Conference,
operations. In International symposium on Field- pages 1245–1253. ACM, 2012.
programmable gate arrays, pages 151–160. ACM, 2014.