0% found this document useful (0 votes)
14 views12 pages

Data Processing On Fpgas

Uploaded by

DinhThu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

Data Processing On Fpgas

Uploaded by

DinhThu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Processing on FPGAs

Rene Mueller Jens Teubner Gustavo Alonso


[email protected] [email protected] [email protected]
Systems Group, Department of Computer Science, ETH Zurich, Switzerland

ABSTRACT sors; and there will be highly specialized cores such as field-
Computer architectures are quickly changing toward hetero- programmable gate arrays (FPGAs) [13, 22]. An example of
geneous many-core systems. Such a trend opens up inter- such a heterogeneous system is the Cell Broadband Engine,
esting opportunities but also raises immense challenges since which contains, in addition to a general-purpose core, multi-
the efficient use of heterogeneous many-core systems is not ple special execution cores (synergistic processing elements,
a trivial problem. In this paper, we explore how to program or SPEs).
data processing operators on top of field-programmable gate Given that existing applications and operating systems
arrays (FPGAs). FPGAs are very versatile in terms of how already have significant problems when dealing with multi-
they can be used and can also be added as additional pro- core systems [5], such diversity adds yet another dimension
cessing units in standard CPU sockets. to the complex task of adapting data processing software to
In the paper, we study how data processing can be accel- new hardware platforms. Unlike in the past, it is no longer
erated using an FPGA. Our results indicate that efficient just a question of taking advantage of specialized hardware,
usage of FPGAs involves non-trivial aspects such as having but a question of adapting to new, inescapable architectures.
the right computation model (an asynchronous sorting net- In this paper, we focus our attention on FPGAs as one
work in this case); a careful implementation that balances of the more different elements that can be found in many-
all the design constraints in an FPGA; and the proper inte- core systems. FPGAs are (re-)programmable hardware that
gration strategy to link the FPGA to the rest of the system. can be tailored to almost any application. However, it is as
Once these issues are properly addressed, our experiments yet unclear how the potential of FPGAs can be efficiently
show that FPGAs exhibit performance figures competitive exploited. Our contribution with this work is to study the
with those of modern general-purpose CPUs while offering design trade-offs encountered when using FPGAs for data
significant advantages in terms of power consumption and processing, as well as to provide a set of guidelines for how
parallel stream evaluation. to make design choices such as:

(1) FPGAs have relatively low clock frequencies. Naı̈ve de-


1. INTRODUCTION signs will exhibit a large latency and low throughput.
Taking advantage of specialized hardware has a long tradi- We show how this can be avoided by using asynchronous
tion in data processing. Some of the earliest efforts involved circuits. We also show that asynchronous circuits (such
building entire machines tailored to database engines [8]. as sorting networks) are well suited for common data
More recently, graphic processing units (GPUs) have been processing operations like comparisons and sorting.
used to efficiently implement certain types of operators [11,
(2) Asynchronous circuits are notoriously more difficult to
12].
design than synchronous ones. This has led to a pref-
Parallel to these developments, computer architectures
erence for synchronous circuits in studies of FPGA us-
are quickly evolving toward heterogeneous many-core sys-
age [13]. Using the example of sorting networks, we
tems. These systems will soon have a (large) number of
illustrate systematic design guidelines to create asyn-
processors and the processors will not be identical. Some
chronous circuits that solve database problems.
will have full instruction sets, others will have reduced or
specialized instruction sets; they may use different clock fre- (3) FPGAs provide inherent parallelism whose only limita-
quencies or exhibit different power consumption; floating tion is the amount of chip space to accommodate par-
point arithmetic-logic units will not be present in all proces- allel functionality. We show how this can be managed
and demonstrate an efficient circuit for parallel stream
processing.
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage, (4) FPGAs can be very useful as database co-processors
the VLDB copyright notice and the title of the publication and its date appear, attached to an engine running on conventional CPUs.
and notice is given that copying is by permission of the Very Large Data This integration is not trivial and opens up several ques-
Base Endowment. To copy otherwise, or to republish, to post on servers tions on how an FPGA can fit into the complete ar-
or to redistribute to lists, requires a fee and/or special permission from the chitecture. In our work, we demonstrate an embedded
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France heterogeneous multi-core setup and identify trade-offs
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. in FPGA integration design.
(5) FPGAs are attractive co-processors because of the po-
tential for tailored design and parallelism. We show that CLB
FPGAs are also very interesting in regard to power con-
sumption as they consume significantly less power, yet slice

provide at a performance comparable to the one of con- switch


box
ventional CPUs. This makes FPGAs good candidates
PPC PPC BRAM
for multi-core systems as cores where certain data pro-
core 0 core 1 18x18bit
cessing tasks can be offloaded. multiplier

To illustrate the trade-offs and as a running example, we


describe the implementation of a median operator that de-
pends on sorting as well as on arithmetics. We use it in
a streaming fashion to illustrate sliding window functional-
ity. The implementation we discuss in the paper is designed
to illustrate the design space of FPGA-based co-processing. Figure 1: Simplified FPGA architecture: 2D array
Our experiments show that FPGAs can clearly be a useful of CLBs, each consisting of 4 slices and a switch
component of a modern data processing system, especially box. Available in silicon: 2 PowerPC cores, BRAM
in the context of multi-core architectures. blocks and multipliers.

Outline. We start our work by setting the context with


related work (Section 2). After introducing the necessary The advantage of using customized hardware as a data-
technical background in Section 3, we illustrate the imple- base co-processor is well known since many years. For in-
mentation of a median operator using FPGA hardware (Sec- stance, DeWitt’s direct system comprises of a number of
tion 4). Its integration into a complete multi-core system is query processors whose instruction sets embrace common
our topic for Section 5, before we evaluate our work in Sec- database tasks such as join or aggregate operators [8]. Sim-
tion 6. We wrap up in Section 7. ilar ideas have been commercialized recently in terms of da-
tabase appliances sold by, e.g., Netezza [7], Kickfire [19], or
XtremeData [16]. All of them appear to be based on special-
2. RELATED WORK ized, hard-wired acceleration chips, which primarily provide
A number of research efforts have explored how databases a high degree of data parallelism. Our approach can be used
can use the potential of modern hardware architectures. Ex- to exploit the reconfigurability of FPGAs at runtime. By re-
amples include optimizations for cache efficiency (e.g., [21]) programming the chip for individual workloads or queries,
or the use of vector primitives (“SIMD instructions”) in we can achieve higher resource utilization and implement
database algorithms [29]. The QPipe [14] engine exploits data and task parallelism. By studying the foundations of
multi-core functionality by building an operator pipeline FPGA-assisted database processing in detail, this work is an
over multiple CPU cores. Likewise, stream processors such important step toward our goal of building such a system.
as Aurora [2] or Borealis [1] are implemented as networks
of stream operators. An FPGA with database functionality FPGAs are being successfully applied in signal process-
could directly be plugged into such systems to act as a node ing, and we draw on some of that work in Sections 4 and 5.
of the operator network. The particular operator that we use as a running example
The shift toward an increasing heterogeneity is already to demonstrate FPGA-based co-processing is a median over
visible in terms of tailor-made graphics or network CPUs, a sliding window. The implementation of a median with
which have found their way into commodity systems. Govin- FPGAs has already been studied [27], but only on smaller
daraju et al. demonstrated how the parallelism built into values than the 32 bit integers considered in this paper. Our
graphics processing units can be used to accelerate common median implementation is similar to the sorting network pro-
database tasks, such as the evaluation of predicates and ag- posed by Oflazer [24]. As we show in Section 6.1, we gain
gregates [12]. The GPUTeraSort algorithm [11] parallelizes significant performance advantages by designing the network
a sorting problem over multiple hardware shading units on to run in an asynchronous mode.
the GPU. Within each unit, it achieves parallelization by
using SIMD operations on the GPU processors. The AA- 3. OVERVIEW OF FPGAS
Sort [17], CellSort [9], and MergeSort [6] algorithms are Field-programmable gate arrays are reprogrammable hard-
very similar in nature, but target the SIMD instruction sets ware chips for digital logic. FPGAs are an array of logic
of the PowerPC 970MP, Cell, and Intel Core 2 Quad pro- gates that can be configured to construct arbitrary digi-
cessors, respectively. tal circuits. These circuits are specified using either circuit
The use of network processors for database processing was schematics or hardware description languages such as Ver-
studied by Gold et al. [10]. The particular benefit of such ilog or VHDL. A logic design on an FPGA is also referred
processors for database processing is their enhanced support to as a soft IP-core (intellectual property core). Existing
for multi-threading. commercial libraries provide a wide range of pre-designed
We share our view on the role of FPGAs in upcoming sys- cores, including those of complete CPUs. More than one
tem architectures with projects such as Kiwi [13] or Liquid soft IP-core can be placed onto an FPGA chip.
Metal [15]. Both projects aim at off-loading traditional CPU
tasks to programmable hardware. Mitra et al. [22] recently 3.1 FPGA Architecture
outlined how FPGAs can be used as co-processors in an SGI Figure 1 sketches the architecture of the Xilinx Virtex II
Altix supercomputer to accelerate XML filtering. Pro XC2VP30 FPGA used in this paper [28]. The FPGA is
PowerPC cores 2 3.2 Hardware Setup
Slices 13,696 FPGAs are typically available pre-mounted on a circuit
18 kbit BRAM blocks 136 (=2,448 kbit, board that includes additional peripherals. Such circuit
usable as 272 kB) boards provide an ideal basis for the assessment we per-
18×18-bit multipliers 136 form here. Quantitative statements in this report are based
I/O pads 644 on a Xilinx XUPV2P development board with a Virtex-II
Pro XC2VP30 FPGA chip. Relevant for the discussion in
Table 1: Characteristics of Xilinx XC2VP30 FPGA. this paper are the DDR DIMM socket which we populated
with a 512 MB RAM module. For terminal I/O of the soft-
carryout
ware running on the PowerPC, a RS232 UART interface is
available. The board also includes a 100 Mbit Ethernet port.
MUXCY Y The board is clocked at 100 MHz. This clock drives both,
0 1
DY the FPGA-internal buses as well as the external I/O con-
G4
D Q YQ nectors, such as the DDR RAM. The PowerPC cores are
G3 D
clocked at 300 MHz.
G2
G1
Register/ 4. A STREAMING MEDIAN OPERATOR
LUT Latch As a running example suitable to illustrate the design of
& data processing operations in FPGAs, we have implemented
an operator that covers many of the typical aspects of data
0 intensive operations such as comparisons of data elements,
1 MUXCY sorting, and I/O issues. In this way the lessons learned from
Y
0 1
DY implementing this operator can be generalized to other op-
F4 erators using similar building blocks. The design illustrates
D Q YQ many of the design constraints in FPGAs, which are very
F3 D
F2 different from the design constraints encountered in conven-
tional database engines. For instance, parallelism in a nor-
F1
Register/ mal database is limited by the CPU and memory available.
LUT Latch In an FPGA, it is limited by the chip space available. In
& a CPU, parallel threads may interfere with each other. In
an FPGA, parallel circuits do not interfere at all, thereby
0 achieving 100 % parallelism. Similarly, algorithms in a CPU
1
look very different from the same algorithms implemented
carryin as circuits and, in fact, they have very different behavior
and complexity patterns.
Figure 2: Simplified Virtex-II Pro slice consisting of We illustrate many of these design aspects using a median
2 LUTs and 2 register/latch components. The gray operator over a count-based sliding window implemented on
components are configured during programming. the aforementioned Xilinx board. This is an operator com-
monly used to, for instance, eliminate noise in sensor read-
ings [25] and in data analysis tasks [26]. For illustration
purposes and to simplify the figures and the discussion, we
a 2D array of configurable logic blocks (CLBs). Each logic assume a window size of 8 tuples. For an input stream S,
block consists of 4 slices that contain logic gates (in terms the operator can then be described in CQL [3] as
of lookup tables, see below) and a switch box that connects
slices to an FPGA interconnect fabric. Select median(v)
(Q1 )
In addition to the CLBs, FPGA manufacturers provide From S [ Rows 8 ] .
frequently-used functionality as discrete silicon components
(hard IP-cores). Such hard IP-cores include block RAM The semantics of this query are illustrated in Figure 3.
(BRAM) elements (each containing 18 kbit fast storage) Attribute values vi in input stream S are used to construct
as well as 18×18-bit multiplier units. A number of In- a new output tuple Ti0 for every arriving input tuple Ti . A
put/Output Blocks (IOBs) link to external RAM or network- conventional (CPU-based) implementation would probably
ing devices. Two on-chip PowerPC 405 cores are directly use a ring buffer to keep the last eight input values (we as-
wired to the FPGA fabric and to the BRAM components. sume unsigned integer numbers), then, for each input tuple
Table 1 shows a summary of the characteristics of the FPGA Ti ,
used in this paper. (1) sort the window elements vi−7 , . . . , vi to obtain an or-
A simplified circuit diagram of a programmable slice is dered list of values v10 ≤ · · · ≤ v80 and
shown in Figure 2. Each slice contains two lookup tables
(LUTs) with four inputs and one output each. A LUT v 0 +v 0
(2) compute the mean value between v40 and v50 , 4 2 5 , to
can implement any binary-valued function with four binary- construct the output tuple Ti0 (for an odd-sized window,
inputs. The output of the LUTs can be fed to a buffer block the median would instead be the middle element of the
which can be configured as a register (flip-flop). The output sorted sequence).
is also fed to a multiplexer (MUXCY in Figure 2), which
allows the implementation of fast carry logic. We will shortly see how the data flow in Figure 3 directly
T9 T8 T7 T6 T5 T4 T3 T2 T1 T0
[20].
input
t stream Sorting Network Properties. As can be seen in the two
example networks in Figure 4, the number of comparisons
required for a full network implementation depends on the
count−based window particular choice of the network. The bitonic merge sorter
v9 v8 v7 v6 v5 v4 v3 v2 v1
for N = 8 inputs in Figure 4(a) uses 24 comparators in total,
whereas the even-odd merge network (Figure 4(b)) can do
sorting with only 19. For other choices of N , we listed the required
v’1 v’2 v’3 v’4 v’5 v’6 v’7 v’8 number of comparators in Table 2.
The graphical representation in Figure 4 indicates another
important metric of sorting networks. Comparators with in-
+ dependent data paths can be grouped into processing stages
and evaluated in parallel. The number of necessary stages
:2 is referred to as the depth S(N ) of the sorting network. For
streaming eight input values, bitonic merge networks and even-odd
T’ median
9
operator merge networks both have a depth of six.
Compared to even-odd merge networks, bitonic merge
T’9 T’ networks observe two additional interesting characteristics:
8 T’
7 T’
6 T’
5 T’
4 T’
3 T’
2 T’
1 T’
0
output (i) all signal paths have the same length (by contrast, the
t stream data path from x0 to y0 in Figure 4(b) passes through three
comparators, whereas from x5 to y5 involves six) and
Figure 3: Median aggregate over a count-based slid- (ii) the number of comparators in each stage is constant (4
ing window (window size 8). comparators per stage for the bitonic merge network, com-
pared with 2–5 for the even-odd merge network).
CPU-Based Implementations. These two properties are
leads to an implementation in FPGA hardware. Before that,
the main reason why many software implementations of sort-
we discuss the algorithmic part of the problem for Step (1).
ing have opted for a bitonic merge network, despite its higher
comparator count (e.g., [9, 11]). Differences in path lengths
4.1 Sorting may require explicit buffering for those values that do not
Sorting is the critical piece in the median operator and actively participate in comparisons at specific processing
known to be particularly expensive on conventional CPUs. stages. At the same time, additional comparators might
It is also a common data processing operation that can be cause no additional cost in architectures that can evaluate
very efficiently implemented in FPGAs using asynchronous a number of comparisons in parallel using, for instance, the
circuits. Highly tuned and vectorized software implementa- SIMD instruction sets of modern CPUs.
tions require in the order of fifty cycles to sort eight numbers
on modern CPUs [6]. 4.2 An FPGA Median Operator
Once the element for sorting is implemented using a sort-
Sorting Networks. Some of the most efficient conven-
ing network, the complete operator can be implemented in
tional approaches to sorting are also the best options in the
an FPGA using the sketch in Figure 3. Each of the solid
context of FPGAs. Sorting networks are attractive in both
arrows corresponds to 32 wires in the FPGA interconnect
scenarios, because they (i) do not require control flow in-
fabric, carrying the binary representation of a 32-bit integer
structions or branches and (ii) are straightforward to par-
number. Sorting and mean computation can both be pack-
allelize (because of their simple data flow pattern). On
aged into logic components, whose internals we now present.
modern CPUs, sorting networks suggest the use of vector
primitives, which has been demonstrated in [9, 11, 17]. Comparator Implementation on an FPGA. The data
Figure 4 illustrates two different networks that sort eight flow in the horizontal direction of Figure 4 also translates
input values. Input data enters a network at the left end. As into wires on the FPGA chip. The entire network is obtained
the data travels to the right, comparators each exchange by wiring a set of comparators, each implemented in FPGA
two values, if necessary, to ensure that the larger value al- logic. The semantics of a comparator is easily expressible
ways leaves a comparator at the bottom. The bitonic merge in the hardware description language VHDL (where <= in-
network (Figure 4(a)) is based on a special property of bi- dicates an assignment):
tonic sequences (i.e., those that can be obtained by con-
entity comparator is
catenating two monotonic sequences). A component-wise
port (a : in std_logic_vector(31 downto 0);
merging of two such sequences always yields another bitonic
b : in std_logic_vector(31 downto 0);
sequence, which is efficiently brought into monotonic (i.e.,
min : out std_logic_vector(31 downto 0);
sorted) order afterward.
max : out std_logic_vector(31 downto 0));
In an even-odd merge sorting network (Figure 4(b)), an
end comparator;
input of 2p values is split into two sub-sequences of length
architecture behavioral of comparator is
2p−1 . After the two 2p−1 -sized sequences have been sorted
min <= a when a < b else b;
(recursively using even-odd merge sorting), an even-odd mer-
max <= b when a < b else a;
ger combines them into a sorted result sequence. Other sort-
end behavioral;
ing algorithms can be represented as sorting networks, too.
For details we refer to the work of Batcher [4] or a textbook The resulting logic circuit is shown in Figure 5. The
x0 y0 x0 y0
x1 y1 x1 y1
x2 y2 x2 y2
x3 y3 x3 y3
x4 y4 x4 y4
x5 y5 x5 y5
x6 y6 x6 y6
x7 y7 x7 y7
(a) Bitonic merge sorting network (b) Even-odd merge sorting network

Figure 4: Sorting networks for 8 elements. Dashed comparators are not used for the median.

bubble/insertion even-odd merge bitonic merge


N (N −1) p 2 p−1
exact C(N ) = 2
C(2 ) = (p − p + 4)2 C(2p ) = (p2 + p)2p−2
S(N ) = 2N − 3 S(2p ) = p(p+1)
2
S(2p ) = p(p+1)
2
C(N ) = O(N 2 ) C(N ) = O N log2 (N C(N ) = O N log2 (N
 
asymptotic ) )
S(N ) = O(N ) S(N ) = O log2 (N ) S(N ) = O log2 (N )
N =8 C(8) = 28 C(8) = 19 C(8) = 24
S(8) = 13 S(8) = 6 S(8) = 6

Table 2: Comparator count C(N ) and depth S(N ) of different sorting networks.

1 slice network (Figure 4(b)), by contrast, can do the same work


= = = with only 19 comparators, which amount to only 912 slices
0 0 0 a>=b (≈ 6.7 % of the chip). Available slices are the scarcest re-
1 1 1 1
source in FPGA programming. The 20 % savings in space,
0 1 32 31
a therefore, makes even-odd merge networks preferable over
0 1 32 31
b bitonic merge sorters on FPGAs. The runtime performance
of an FPGA-based sorting network depends exclusively on
c c c c c c
a a a a a a the depth of the network (which is the same for both net-
b b b b b b
min(a , b)
1 32 31 works).
0 0 1 32 31
max(a , b) Optimizing for the Median Operation. Since we are
only interested in the computation of a median, a fully
Figure 5: FPGA implementation of a 32-bit com- sorted data sequence is more than required. Even with the
parator. Total space consumption is 48 slices (16 to dashed comparators in Figure 4 omitted, the average over
compare and 32 to select minimum/maximum val- y3 and y4 will still yield a correct median result.
ues). This optimization saves 2 comparators for the bitonic, and
3 for the even-odd sorting network. Moreover, the even-odd-
based network is now shortened by a full stage, reducing its
32 bits of the two inputs a and b are compared first (up- execution time. The optimized network in Figure 4(b) now
per half of the circuit), yielding a Boolean output signal consumes only 16 comparators, i.e., 768 slices or 5.6 % of
c for the outcome of the predicate a ≥ b. Signal c drives the chip.
2 × 32 multiplexers that connect the proper input lines to Averaging Two Values in Logic. To obtain the final
the output lines for min(a, b) and max(a, b) (lower half of median value, we are left with the task of averaging the two
the circuit). Equality comparisons = and multiplexers middle elements in the sorted sequence. The addition of two
each occupy one lookup table on the FPGA, resulting integer values is a classic example of a digital circuit and,
in a total space consumption of 48 FPGA slices for each for 32-bit integers, consists of 32 full adders. To obtain the
comparator. mean value, the 33-bit output must be divided by two or—
The FPGA implementation in Figure 5 is particularly expressed in terms of logic operations—bit-shifted by one.
time efficient. All lookup tables are wired in a way such The bit shift, in fact, need not be performed explicitly in
that all table lookups happen in parallel. Outputs are com- hardware. Rather, we can connect the upper 32 bits of the
bined using the fast carry logic implemented in silicon for 33-bit sum directly to the operator output.
this purpose. Overall, the space consumption of the mean operator is
16 slices (two adders per slice).
The Right Sorting Network for FPGAs. To imple-
ment a full bitonic merge sorting network, 24 comparators Sliding Windows. The sliding window of the median op-
need to be plugged together as shown in Figure 4(a), re- erator is implemented as a 32-bit wide linear shift register
sulting in a total space requirement of 1152 slices (or 8.4 % with depth 8 (see Figure 6). The necessary 8 × 32 flip-flops
of the space of our Virtex-II Pro chip). An even-odd merge occupy 128 slices (each slice contains two flip-flops).
user logic of aggregation core Program Glue Logic
Memory CPU
median User

IPIF
:2 + BRAM PPC405
33 128 kB Core 0 Logic

median out 32 64 Aggregation Core 0


sorting network Memory for
Data Streams PLB
data in 32
512 MB DDR User

IPIF
DDR DIMM RAM Logic
data write/ Module Cntrl. Interrupt
shift clock Bridge Controller Aggregation Core 1
Serial Port UART
32
User

IPIF
Figure 6: Sliding window implementations as 8 × 32
OPB Logic
linear shift register. circuit
board FPGA Aggregation Core 2

4.3 Generalization to Other Operators


Figure 7: Architecture of the on-chip system:
The ideas presented here in the context of the median PowerPC core, 3 aggregation cores, BRAM for pro-
operator are immediately applicable to a wide range of other gram, interface to external DDR RAM and UART
common operators. Operators such as selection, projection, for terminal I/O.
and simple arithmetic operations (max, min, sum, etc.) can
be implemented as a combination of logical gates and simple
circuits similar to the ones presented here. We described one is used to connect memory and fast peripheral components
strategy to obtain such circuits in [23]. (such as network cards) to the PowerPC core. The 32 bit-
As the designs described showed, the overhead of oper- wide on-chip peripheral bus (OPB) is intended for slow pe-
ators implemented in an FPGA is very low. In addition, ripherals, to keep them from slowing down fast bus transac-
as shown in the examples, it is possible to execute many tions. The two buses are connected by a bridge. The driver
such operators in parallel, which yields higher throughput code executed by the PowerPC core (including code for our
and lower latency than in the typical sequential execution measurements) is stored in 128 kB block RAM connected to
in CPUs. the PLB.
Sorting is a common and expensive operation in many Two soft IP-cores provide controller functionality to ac-
queries. Data processed by the FPGA and forwarded to cess external DDR RAM and a serial UART connection link
the CPU or the disk can be sorted as explained with little (RS-232). They are connected to the input/output blocks
impact on performance. Similarly, using asynchronous cir- (IOBs) of the FPGA chip. We equipped our system with
cuits, subexpressions of predicates of selection operators can 512 MB external DDR RAM and used a serial terminal con-
be executed in parallel. nection to control our experiments.
Our streaming median operator participates in the sys-
5. SYSTEM DESIGN tem inside a dedicated processing core, dubbed “aggregation
So far we have looked at our FPGA-based database oper- core” in Figure 7. More than one instance of this compo-
ator as an isolated component. However, FPGAs are likely nent can be created at a time, all of which are connected to
to be used to complement regular CPUs in variety of config- the PLB. An aggregation core consists of user logic, as de-
urations. For instance, to offload certain processing stages scribed in detail in the previous section. A parameterizable
of a query plan or filter an incoming stream before feeding IP interface (IPIF, provided by Xilinx as a soft IP-core)
it into the CPU for further processing. provides the glue logic to connect the user component to
In conventional databases, the linking of operators among the bus. In particular, it implements the bus protocol and
themselves and to other parts of the system is a well under- handles bus arbitration and DMA transfers. A similar IPIF
stood problem. In FPGAs, these connections can have a component with the same interface on the user-logic side is
critical impact on the effectiveness of FPGA co-processing. also available for the OPB. However, since we aim for high
In addition, there are many more options to be considered in data throughput, we chose to attach the aggregation cores
terms of the resources available at the FPGA such as using to the faster PLB.
the built-in PowerPC CPUs and soft IP-cores implementing
communication buses or controller components for various 5.2 Putting it All Together
purposes. In this section we illustrate the trade-offs in this Many operators involve frequent iteration over the data;
part of the design and show how hardware connectivity of data transfers to and from memory; and data acquisition
the elements differs from connectivity in software. from the network or disks. As in conventional databases,
these interactions can completely determine the overall per-
5.1 System Overview formance. It is thus of critical importance to design the
Using the Virtex-II Pro-based development board described memory/CPU/circuits interfaces so as to optimize perfor-
in Section 3.2, we have implemented the embedded system mance.
shown in Figure 7. We only use one of the two available To illustrate the design options and the trade-offs involved,
PowerPC cores (our experiments indicate that the use of a we consider three configurations (attachments of the aggre-
second CPU core would not lead to improved throughput). gation core to the CPU) of the FPGA. These configurations
The system further consists of two buses of different width are based on registers connected to the input signals of the
and purpose. The 64-bit wide processor local bus (PLB) IP-core and mapped into the memory space of the CPU.
user logic
median median
:2 + :2 +
33 33

RFIFO
BRAM
32 32
sorting network sorting network
32
32

WFIFO
BRAM
shift clock
data write
WFIFO memory−mapped
WFIFO_STATUS registers
DATA_IN memory−mapped IPIF RFIFO
AGG_OUT registers RFIFO_STATUS user logic
PLB aggregation core
IPIF
aggregation core
Figure 9: Attachment of aggregation core through
PLB
Write-FIFO and Read-FIFO queues.
Figure 8: Attachment of aggregation core through
memory-mapped registers. on the aggregation core turns out to be an inherent prob-
lem of using a general-purpose FIFO implementation (such
Information can then be sent between the aggregation core as the one provided with the Xilinx IPIF interface). Re-
and the CPU using load/store instructions. implementing the FIFO functionality in user logic can rem-
edy this deficiency, as we describe next.
Configuration 1: Slave Registers. The first approach
uses two 32-bit registers DATA IN and AGG OUT as shown in Configuration 3: Master Attachment. In the previ-
Figure 8. The IP interface is set to trigger a clock signal ous configuration, access is through a register that cannot
upon a CPU write into the DATA IN register. This signal be manipulated in 64-bit width. Instead of using a register
causes a shift in the shift register (thereby pulling the new through a bus, we can use memory mapping between the ag-
tuple from DATA IN) and a new data set to start propagating gregation core and the CPU to achieve a full 64 bit transfer
through the sorting network. A later CPU read instruction width. The memory mapping is now done on the basis of
for AGG OUT then will read out the newly computed aggregate contiguous regions rather than a single address. Two regions
value. are needed, one for input and one for output. These mem-
This configuration is simple and uses few resources. How- ory regions correspond to local memory in the aggregation
ever, it has two problems: lack of synchronization and poor core and are implemented using BRAMs.
bandwidth usage. We can improve on this approach even further by tak-
In this configuration the CPU and the aggregation core ing advantage of the fact that the transfers to/from these
are accessing the same registers concurrently with no syn- regions can be offloaded to a DMA controller. We have con-
chronization. The only way to avoid race conditions is to sidered two options: one with the DMA controller run by
add artificial time delays between the access operations. the CPU and one with the DMA controller run in (the IPIF
In addition, each tuple in this configuration requires two of) the aggregation core. Of these two options, the latter
32-bit memory accesses (one write followed by one read). one is preferrable since it frees the DMA controller of the
Given that the CPU and the aggregation core are connected CPU to perform other tasks. In the following, we call this
to a 64-bit bus (and hence could transmit up to 2 × 32 bits configuration master attachment. In Figure 10, we show all
per cycle), this is an obvious waste of bandwidth. the memory mapped registers the CPU uses to set up the
transfers, although we do not discuss them here in detail for
Configuration 2: FIFO Queues. The second configura- lack of space. The figure also shows the interrupt line used
tion we explore solves the lack of synchronization by intro- to notify the CPU that new results are available.
ducing FIFO queues between the CPU and the aggregation The master attachment configuration has the advantage
core (Figure 9). Interestingly, this is the same solution as that the aggregation core can independently initiate the
the one adopted in data stream management systems to de- write-back of results once they are ready, without having
couple operators. to synchronize with an external DMA controller. This re-
The CPU writes tuples into the Write-FIFO (WFIFO) duces latency, uses the full available bandwidth, and gives
and reads median values from the Read-FIFO queue (RFIFO). the aggregation core control over the flow of data, leaving
The two queues are implemented in the IPIF using addi- the CPU free to perform other work and thereby increasing
tional block RAM components (BRAM). The aggregation the chances for parallelism.
core independently dequeues items from the Write-FIFO
queue and enqueues the median results into the Read-FIFO
queue. Status registers in both queues allow the CPU to 6. EVALUATION
determine the number of free slots (write queue) and the We evaluated the different design options described above.
number of available result items (read queue). All experiments were done on the Xilinx XUPV2P develop-
This configuration avoids the need for explicit synchro- ment board. Our focus is on the details of the soft IP-core
nization. There is still the drawback that the interface uses and we abstract from effects caused for example by I/O (net-
only 32 bits of the 64 available on the bus. The mismatch work and disks) by performing all the processing into and
between a 64-bit access on the CPU side and a 32-bit width out of off-chip memory (512 MB DDR RAM).
sorting network
median
:2 +

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5
33

Adder
32
sequencer/

sorting network
machine
state

32
switching phase SYNCHRONOUS
stable phase (idle time)
shift clock t
0 1 2 3 4 5 6 cycle
BRAM BRAM execution time = 6 cycles
CONTROL memory−mapped
Cmds
Intr

0..31 32..63 STATUS registers


SRC_ADDR ASYNCHRONOUS
IPIF DST_ADDR t
Master Attachment LENGTH user logic 0 1 2
aggregation core 13.3 ns
PLB
execution time = 2 cycles (rounded to next clock)
Figure 10: Master attachment of aggregation core
supporting through DMA transfers to external
Figure 11: Synchronous implementation of the ag-
memory.
gregation core requires 6 clock cycles, i.e., 60 ns.
In an asynchronous implementation the output is
6.1 Asynchronous vs. Synchronous Designs ready after 13.3 ns (the output signals can be read
after 2 cycles).
We first consider and evaluate possible implementations of
the sorting network discussed in Section 4.1. As indicated,
the even-odd merge network is more space efficient so this For a fast 3.22 GHz processor, this corresponds to ≈ 15 ns,
is the one we consider here. The implementation options 13 % more than the FPGA used in our experiments. The
are important in terms of the overall latency of the operator short latency is a consequence of a deliberate design choice.
which, in turn, will determine how fast data streams can be Our circuit operates in a strictly asynchronous fashion, not
processed. bound to any external clock.
Asynchronous design. We start by considering an asyn- Synchronous design. In a traditional synchronous imple-
chronous design. The eight 32-bit signals are applied at the mentation all circuit elements use a common clock. Regis-
input of the sorting network and then ripple down the stages ters are then necessary between each of the five stages of the
of the sorting network. Until the correct result has stabilized sorting network.
at the output, signals have to traverse up to five compara- A synchronous implementation of the sorting network in
tor stages. The exact latency of the sorting network, the Section 4 inherently uses six clock cycles (i.e., 60 ns in a
signal propagation delay, depends on the implementation of 100 MHz system) to sort eight elements.
the comparator element and on the on-chip routing between Both design choices are illustrated in Figure 11. In this fig-
the comparators. ure, the gray-shaded time intervals indicate switching phases
The total propagation delay is determined by the longest during which actual processing happens (i.e., when signals
signal path. For a single comparator, this path starts in the are changing). During intervals shown in white, signals are
equality comparison LUT, passes through 32 carry logic mul- stable. The registers are used as buffers until the next clock
tiplexers, and ends at one min/max multiplexer. According cycle. As the figure shows, the switching phase is shorter
to the FPGA data sheet [28] the propagation delay for a than the clock length.
single 4-input LUT is 0.28 ns. The carry logic multiplexers
Comparison. The latency of the asynchronous design is
and the switching network cause an additional delay. The
13.3 ns. Taking into consideration that the sorting net-
overall latency for the median output to appear after the
work needs to be connected to other elements that are asyn-
input is set can be computed with a simulator provided by
chronous, the effective latency is 2 clock cycles or 20 ns. The
Xilinx that uses the post-routing and element timing data
latency of the synchronous design is 60 ns or 6 cycles, clearly
of the FPGA.1
slower than the asynchronous circuit. On the other hand,
For our implementation we obtain a latency of 13.3 ns. An
the synchronous circuit has a throughput of one tuple per
interesting point of reference is the performance of a tuned
cycle while the asynchronous circuit has a throughput of 1
SIMD implementation on current CPU hardware. It has
tuple every 2 cycles. The synchronous implementation re-
been suggested that 50 CPU cycles is the minimum required
quires more space due to the additional hardware (flip-flops)
to sort 8 elements on a modern general-purpose CPU [6].
necessary to implement the registers between the compara-
1 tor stages. The space needed is given by:
One might be tempted to physically measure the latency of
the sorting network by connecting the median operator di-
rectly to the I/O pins of the FPGA. However, signal buffers (5 stages × 8 elements + 1 sum) × 32 bits =
at the inputs and outputs (IOBs) of the FPGA and the 1312 flip-flops/core ≡ 5% of the FPGA/core .
switching network in between add significant latency (up to
10 ns). Any such measurement is bound to be inaccurate. The higher complexity of asynchronous circuits has led
100
10000 bubble
merge
80 quick
1000
execution time [µs]

heap
even-odd

time [sec]
60 FPGA
100

40
10

20
1
DMA master attachment
slave register 0
0.1

x86-64

Cell PPE

PPC G5

PPC G4

PPC 405
16 B 256 B 4 kB 64 kB
data size

Figure 12: Total execution time to process data


Figure 13: Execution time for processing a single
streams of different size on the FPGA-based aggre-
256 MB data set on different CPUs using different
gation core.
sorting algorithms and on the FPGA.

many FPGA design to rely solely on synchronous circuits streams with very high arrival rates so that the tuples can
[13]. Our results indicate, however, that for data process- be batched.
ing there are simple asynchronous designs that can signifi- Using configuration 3, we have also measured the time it
cantly reduce latency (at the cost of throughput). In terms takes for the complete median operator to process 256 MB
of transforming algorithms into asynchronous circuits, not of data consisting of 4-byte tuples. It takes 6.173 seconds to
all problems can expressed in an asynchronous way. From process all the data at a rate of more than 10 million tuples
a theoretical point of view, every problem where the only per second. This result is shown as the horizontal line in
dependence of the output signal are the input signals, can Figure 13.
be converted into an asynchronous circuit (a combinatorial
circuit). The necessary circuit can be of significant size, 6.3 FPGA Performance in Perspective
however (while synchronous circuits may be able to re-use
FPGAs can be used as co-processor of data processing en-
the same logic elements in more than one stage). A more
gines running on conventional CPUs. This, of course, pre-
practical criterion can be obtained by looking at the algo-
sumes that using the FPGA to run queries or parts of queries
rithm that the circuit mimics in hardware. As a rule of
does not result in a net performance loss. In other words,
thumb, algorithms that require a small amount of control
the FPGA must not be significantly slower than the CPU.
logic (branches or loops) and have a simple data flow pat-
Achieving this is not trivial because of the much slower clock
tern are the most promising candidates for asynchronous
rates on the FPGA.
implementations.
Here we study the performance of the FPGA compared to
that of CPUs when running on a single data stream. Later
6.2 Median Operator on we are going to consider parallelism.
We now compare two of the configurations discussed in To ensure that the choice of a software sorting algorithm
Section 5.2 and then evaluate the performance of the com- is not a factor in the comparison, we have implemented eight
plete aggregation core using the best configuration. different sorting algorithms in software and optimized them
We compare configuration 1 (slave register) with configu- for performance. Seven are traditional textbook algorithms:
ration 3 (master attachment). We use maximum-sized DMA quick sort, merge sort, heap sort, gnome sort, insertion sort,
transfers (4 kB) between external memory and the FPGA selection sort, and bubble sort. The eighth is an implemen-
block RAM to minimize the overhead spent on interrupt tation of the even-odd merge sorting network of Section 4.1
handling. We do not consider configuration 2 (FIFO queues) using CPU registers.
because it does not offer a performance improvement over
configuration 1. We ran the different algorithms on several hardware plat-
Figure 12 shows the execution time for streams of varying forms. We used an off-the-shelf desktop Intel x86-64 CPU
size up to 64 kB. While we see a linearly increasing execution (2.66 GHz Intel Core2 quad-core Q6700) and the following
time for configuration 1, configuration 2 requires a constant PowerPC CPUs: a 1 GHz G4 (MCP7457) and a 2.5 GHz
execution time of 96 µs for all data sizes up to 4 kB, then G5 Quad (970MP), the PowerPC element (PPE not SPEs)
scales linearly with increasing data sizes (this trend contin- of the Cell, and the embedded 405 core of our FPGA. All
ues beyond 64 kB). This is due to the latency incurred by implementations are single-threaded. For illustration pur-
every DMA transfer (up to 4 kB can be sent within a sin- poses, we limit our discussion to the most relevant subset of
gle transfer). 96 µs are the total round-trip time, measured algorithms.
from the time the CPU writes to the control register in or- Figure 13 shows the wall-clock time observed when pro-
der to initiate the Read-DMA transfer until it receives the cessing 256 MB (as 32-bit tuples) through the median sliding
interrupt. window operator. The horizontal line indicates the execu-
These results indicate that configuration 1 (slave registers) tion time of the FPGA implementation. Timings for the
is best for processing small amounts of data or streams with merge, quick, and heap sort algorithms on the embedded
low arrival rates. Configuration 3 (master attachment) is PowerPC core did not fit into scale (303 s, 116 s, and 174 s,
best for large amounts of data (greater than 4 kB) or data respectively). All our software implementations were clearly
Intel Core 2 Q6700: cores flip-flops LUTs slices %
Thermal Design Power (CPU only) 95 W 0 1761 1670 1905 13.9 %
Extended HALT Power (CPU only) 24 W 1 3727 6431 4997 36.5 %
Measured total power (230 V) 102 W 2 5684 10926 7965 58.2 %
Xilinx XUPV2P development board: 3 7576 15597 11004 80.3 %
Calculated power estimate (FPGA only) 1.3 W 4 9512 20121 13694 100.0 %
Measured total power (230 V) 8.3 W
Table 4: FPGA resource usage. The entry for 0
Table 3: Power consumption of an Intel Q6700- cores represents the space required to accommodate
based desktop system and the Xilinx XUPV2P all the necessary circuitry external to the aggrega-
FPGA board used in this paper. Measured values tion cores (UART, DDR controller, etc.).
are under load when running median computation.

duce idle power. FPGAs offer power management even be-


CPU-bound. It is also worth noting that given the small yond that, and many techniques from traditional chip design
window, the constant factors and implementation overheads can directly be used in an FPGA context. For example, us-
of each algorithm predominate and, thus, the results do not ing clock gating parts of the circuit can be completely dis-
match the known asymptotic complexity of each algorithm. abled, including clock lines. This significantly reduces the
The performance observed indicates that the implemen- idle power consumption of the FPGA chip.
tation of the operator on the FPGA is comparable to that
of conventional CPUs. In the cases where it is worse, it is 6.5 Parallelism: Space Management
not significantly slower. Therefore, the FPGA is a viable Another advantage of FPGAs is their inherent support
option for offloading data processing out of the CPU which for parallelism. By instantiating multiple aggregation cores
then can be devoted to other purposes. When power con- in FPGA hardware, multiple data streams can be processed
sumption and parallel processing are factored in, FPGAs truly in parallel. The number of instances that can be cre-
look even more interesting as co-processors for data man- ated is determined both by the size of the FPGA, i.e., its
agement. number of slices, and by the capacity of the FPGA inter-
connect fabric.
6.4 Power Consumption We placed four instances of the median aggregation core
While the slow clock rate of our FPGA (100 MHz) re- on the Virtex-II Pro. Table 4 shows the resource usage de-
duces performance, there is another side to this coin. The pending on the number of aggregation cores. We also give
power consumption of a logic circuit depends linearly on the the usage in percent of the total number of available slices
frequency at which it operates (U and f denote voltage and (13,696). Note that there is a significant difference in size
frequency, respectively): between the space required by the median operator (700 to
P ∝ U2 × f . 900 slices) and the space required by the complete aggrega-
tion core (about 3000 slices). This overhead comes from the
Therefore, we can expect our 100 MHz circuit to consume additional circuitry necessary to put the median operator
significantly less energy than the 3.2 GHz x86-64. into the configuration 3 discussed above.
It is difficult to reliably measure the power consumption of The use of parallelism brings forth another design trade-
an isolated chip. Instead, we chose to list some approximate off characteristic of FPGAs. To accommodate four aggre-
figures in Table 3. Intel specifies the power consumption gation cores, the VHDL compiler starts trading latency for
of our Intel Q6700 to be between 24 and 95 W (the for- space by placing unrelated logic together into the same slice,
mer figure corresponds to the “Extended HALT Powerdown resulting in longer signal paths and thus longer delays. This
State”) [18]. For the FPGA, a power analyzer provided by effect can also be seen in Figure 14, where we illustrate
Xilinx reports an estimated consumption of 1.3 W. the space occupied by the four aggregation cores. Occupied
More meaningful from a practical point of view is the space regions are not contiguous, which increases signal path
overall power requirement of a complete system under load. lengths.
Therefore, we took both our systems, unplugged all periph- The longer path lengths have a significant implication for
erals not required to run the median operator and measured asynchronous circuits. Without any modification, the me-
the power consumption of both systems at the 230 V wall dian operator produces incorrect results. The longer signal
socket. As shown in Table 3, the FPGA has a 12-fold ad- paths result in longer switching phases in the sorting net-
vantage (8.3 W over 102 W) compared to the CPU-based work, leading to an overall latency of more than two cycles
solution here. (20 ns). Incorrect data reading can be avoided by intro-
As energy costs and environmental concerns continue to ducing a wait cycle and reading the aggregation result three
grow, the consumption of electrical power (the “carbon foot- cycles after setting the input signals. This implies that asyn-
print” of a system) is becoming an increasingly decisive fac- chronous circuits need to be treated more carefully if used in
tor in system design. Though the accuracy of each individ- high density scenarios where most of the FPGA floor space
ual number in Table 3 is not high, our numbers clearly show is used.
that adding a few FPGAs can be more power-efficient than Other FPGA models such as the Virtex-5 have signifi-
simply adding CPUs in the context of many-core architec- cantly larger arrays (7.6 times larger than our Virtex-II Pro)
tures. and higher clocks (5.5 times). On such a chip, assuming that
Modern CPUs have sophisticated power management such a single core requires 3,000 slices, we estimate that ≈ 30 ag-
as dynamic frequency and voltage scaling that allow to re- gregation cores can be instantiated, provided that the mem-
PowerPC 405 speedup
streams FPGA seq. alt. seq. alt.
1 1.54 s 10.1 s – 7× –
2 1.56 s 20.2 s 36.7 s 13× 24×
3 1.58 s 30.4 s 55.1 s 19× 35×
4 1.80 s 40.5 s 73.5 s 22× 41×
aggr. core 0 aggr. core 1 aggr. core 2
Table 5: Execution times for different number of
concurrent streams (64 MB data set per stream).

with the number of aggregation cores. It is also interesting


to note that with four cores we did not reach the limit in
memory bandwidth, neither on the DDR RAM nor on the
PLB.
aggr. core 3 CPU, BRAM, One last question that remains open is whether a similar
UART, Interface
parallelism could be achieved with a single CPU. Table 5
to ext. RAM, etc.
contains the execution times obtained with a CPU-only im-
plementation for multiple streams, assuming either sequen-
Figure 14: Resource usage on the FPGA chip (floor- tial processing (one stream after the other) or tuple-wise
plan) by the 4 aggregation cores and the remaining alternation between streams. Cache conflicts lead to a sig-
system components. nificant performance degradation in the latter case.
Clearly, a single CPU cannot provide the same level of
101 parallelism as an FPGA. Obviously, this could be achieved
100
with more CPUs but at a considerable expense. From this
and the previous results, we conclude that FPGAs offer a
execution time [s]

10-1 very attractive platform as data co-processors and that they


can be effectively used to run data processing operators.
10-2

10-3
1 aggregation core 7. SUMMARY
2 aggregation cores
10-4 3 aggregation cores In this paper we have assessed the potential of FPGAs
4 aggregation cores as co-processor for data intensive operations in the context
10-5
1kB 8kB 64kB 512kB 4MB 32MB 256MB of multi-core systems. We have illustrated the type of data
data size processing operations where FPGAs have performance ad-
vantages (through parallelism and low latency) and discuss
Figure 15: Total execution time to process multiple several ways to embed the FPGA into a larger system so
data streams using concurrent aggregation cores. that the performance advantages are maximized. Our ex-
periments show that FPGAs bring additional advantages in
terms of power consumption. These properties make FPGAs
ory bandwidth does not further limit this number. very interesting candidates for acting as additional cores in
the heterogeneous many-core architectures that are likely to
6.6 Parallelism: Performance become pervasive. The work reported in this paper is a first
We used the four aggregation cores mentioned above to but important step to incorporate the capabilities of FPGAs
run up to four independent data streams in parallel. We into data processing engines in an efficient manner. The
ran streams of increased size over configurations with an in- higher design costs of FPGA-based implementations may
creasing amount of cores. Figure 15 shows the wall-clock still amortize, for example, if a higher throughput (using
execution times for processing multiple data streams in par- multiple parallel processing cores as shown in the previous
allel, each on a separate aggregation core. Table 5 summa- section) can be obtained in a FPGA-based stream process-
rizes the execution times for a stream of 64 MB (column ing system for a large fraction of queries.
‘FPGA’). As part of future work we intend to explore a tighter inte-
The first important conclusion is that running additional gration of the FPGA with the rest of the computing infras-
aggregation cores has close to no impact on the other cores. tructure, an issue also at the top of the list for many FPGA
The slight increase with the addition of the fourth core manufacturers. Modern FPGAs can directly interface to
comes from the need to add the wait cycle mentioned above. high-speed bus systems, such as the HyperTransport bus,
This shows that by adding multiple cores throughput is in- or even intercept the execution pipeline of general-purpose
creased as multiple streams can be processed concurrently CPUs, opening up many interesting possibilities for using
(Table 5). The second observation is that the execution the FPGA in different configurations.
times scale linearly with the size of the data set as it is to
be expected. The flat part of the curves is the same effect Acknowledgements
observed before for stream sizes smaller than 4 kB. The We would like to thank Laura and Peter Haas for their valu-
graph also indicates that since each core is working on a dif- able insights and help in improving the presentation of the
ferent stream, we are getting linear scale-out in throughput paper.
This project is funded in part by the Enterprise Comput- (FCCM), 2008.
ing Center of ETH Zurich (http://www.ecc.ethz.ch/). [14] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki.
QPipe: A Simultaneously Pipelined Relational Query
8. REFERENCES Engine. In Proc. of the 2005 ACM SIGMOD Int’l
[1] D. J. Abadi, Y. Ahmad, M. Balazinska, Conference on Management of Data, Baltimore, MD,
U. Cetintemel, M. Cherniack, J.-H. Hwang, USA, June 2005.
W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, [15] S. S. Huang, A. Hormati, D. Bacon, and R. Rabbah.
N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Liquid Metal: Object-Oriented Programming Across
Borealis Stream Processing Engine. In Conference on the Hardware/Software Boundary. In European
Innovative Data Systems Research (CIDR), Asilomar, Conference on Object-Oriented Programming, Paphos,
CA, USA, January 2005. Cyprus, July 2008.
[2] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, [16] Xtreme Data Inc. http://www.xtremedatainc.com/.
Ch. Convey, S. Lee, M. Stonebraker, N. Tatbul, and [17] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani.
S. Zdonik. Aurora: A New Model and Architecture for AA-Sort: A New Parallel Sorting Algorithm for
Data Stream Management. The VLDB Journal, 12(2), Multi-Core SIMD Processors. In Int’l Conference on
July 2003. Parallel Architecture and Compilation Techniques
[3] A. Arasu, S. Babu, and J. Widom. The CQL (PACT), Brasov, Romania, September 2007.
continuous query language: semantic foundations and [18] Intel Corp. Intel Core 2 Extreme Quad-Core Processor
query execution. The VLDB Journal, 15(2), June XQ6000 Sequence and Intel Core 2 Quad Processor
2006. Q600 Sequence Datasheet, August 2007.
[4] K. E. Batcher. Sorting Networks and Their [19] Kickfire. http://www.kickfire.com/.
Applications. In AFIPS Spring Joint Computer [20] D. E. Knuth. The Art of Computer Programming,
Conference, 1968. Volume 3: Sorting and Searching. Addison-Wesley,
[5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, Frans 2nd edition, 1998.
Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, [21] S. Manegold, P. A. Boncz, and M. L. Kersten.
Y. Dai, Y. Zhang, and Z. Zhang. Corey: An Operating Optimizing Database Architecture for the New
System for Many Cores. In USENIX Symposium on Bottleneck: Memory Access. The VLDB Journal,
Operating Systems Design and Implementation 9(3), December 2000.
(OSDI), San Diego, CA, USA, December 2008. [22] A. Mitra, M. R. Vieira, P. Bakalov, V. J. Tsotras, and
[6] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, W. A. Najjar. Boosting XML Filtering Through a
Mostafa Hagog, Y.-K. Chen, A. Baransi, S. Kumar, Scalable FPGA-based Architecture. In Conference on
and P. Dubey. Efficient Implementation of Sorting on Innovative Data Systems Research (CIDR), Asilomar,
Multi-Core SIMD CPU Architecture. Proc. VLDB CA, USA, 2009.
Endowment, 1(2), 2008. [23] R. Mueller, J. Teubner, and G. Alonso. Streams on
[7] Netezza Corp. http://www.netezza.com/. Wires – A Query Compiler for FPGAs. Proc. VLDB
[8] D. DeWitt. DIRECT—A Multiprocessor Organization Endowment, 2(1), 2009.
for Supporting Relational Database Management [24] K. Oflazer. Design and Implementation of a
Systems. IEEE Trans. on Computers, c-28(6), June Single-Chip 1-D Median Filter. IEEE Trans. on
1979. Acoustics, Speech and Signal Processing, 31, October
[9] B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: 1983.
High Performance Sorting on the Cell Processor. In [25] L. Rabiner, M. Sambur, and C. Schmidt. Applications
Proc. of the 33rd Int’l Conference on Very Large Data of a Nonlinear Smoothing Algorithm to Speech
Bases (VLDB), Vienna, Austria, September 2007. Processing. IEEE Trans. on Acoustics, Speech and
[10] B. T. Gold, A. Ailamaki, L. Huston, and Babak Signal Processing, 23(6), December 1975.
Falsafi. Accelerating Database Operators Using a [26] J. W. Tukey. Exploratory Data Analysis.
Network Processor. In Int’l Workshop on Data Addison-Wesley, 1977.
Management on New Hardware (DaMoN), Baltimore, [27] P. D. Wendt, E. J. Coyle, and N. J. Gallagher, Jr.
MD, USA, June 2005. Stack Filters. IEEE Trans. on Acoustics, Speech and
[11] N. K. Govindaraju, J. Gray, R. Kumar, and Signal Processing, 34(4), August 1986.
D. Manocha. GPUTeraSort: High Performance [28] Xilinx Inc. Virtex-II Pro and Virtex-II Pro X Platform
Graphics Co-processor Sorting for Large Database FPGAs: Complete Data Sheet, v4.2 edition, 2007.
Management. In Proc. of the 2006 ACM SIGMOD [29] J. Zhou and K. A. Ross. Implementing Database
Int’l Conference on Management of Data, Chicago, IL, Operations using SIMD Instructions. In Proc. of the
USA, June 2006. 2002 ACM SIGMOD Int’l Conference on Management
[12] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and of Data, Madison, WI, USA, June 2002.
D. Manocha. Fast Computation of Database
Operations Using Graphics Processors. In Proc. of the
2004 ACM SIGMOD Int’l Conference on Management
of data, Paris, France, 2004.
[13] D. Greaves and S. Singh. Kiwi: Synthesis of FPGA
Circuits from Parallel Programs. In IEEE Symposium
on Field-Programmable Custom Computing Machines

You might also like