Data Processing On Fpgas
Data Processing On Fpgas
ABSTRACT sors; and there will be highly specialized cores such as field-
Computer architectures are quickly changing toward hetero- programmable gate arrays (FPGAs) [13, 22]. An example of
geneous many-core systems. Such a trend opens up inter- such a heterogeneous system is the Cell Broadband Engine,
esting opportunities but also raises immense challenges since which contains, in addition to a general-purpose core, multi-
the efficient use of heterogeneous many-core systems is not ple special execution cores (synergistic processing elements,
a trivial problem. In this paper, we explore how to program or SPEs).
data processing operators on top of field-programmable gate Given that existing applications and operating systems
arrays (FPGAs). FPGAs are very versatile in terms of how already have significant problems when dealing with multi-
they can be used and can also be added as additional pro- core systems [5], such diversity adds yet another dimension
cessing units in standard CPU sockets. to the complex task of adapting data processing software to
In the paper, we study how data processing can be accel- new hardware platforms. Unlike in the past, it is no longer
erated using an FPGA. Our results indicate that efficient just a question of taking advantage of specialized hardware,
usage of FPGAs involves non-trivial aspects such as having but a question of adapting to new, inescapable architectures.
the right computation model (an asynchronous sorting net- In this paper, we focus our attention on FPGAs as one
work in this case); a careful implementation that balances of the more different elements that can be found in many-
all the design constraints in an FPGA; and the proper inte- core systems. FPGAs are (re-)programmable hardware that
gration strategy to link the FPGA to the rest of the system. can be tailored to almost any application. However, it is as
Once these issues are properly addressed, our experiments yet unclear how the potential of FPGAs can be efficiently
show that FPGAs exhibit performance figures competitive exploited. Our contribution with this work is to study the
with those of modern general-purpose CPUs while offering design trade-offs encountered when using FPGAs for data
significant advantages in terms of power consumption and processing, as well as to provide a set of guidelines for how
parallel stream evaluation. to make design choices such as:
Figure 4: Sorting networks for 8 elements. Dashed comparators are not used for the median.
Table 2: Comparator count C(N ) and depth S(N ) of different sorting networks.
IPIF
:2 + BRAM PPC405
33 128 kB Core 0 Logic
IPIF
DDR DIMM RAM Logic
data write/ Module Cntrl. Interrupt
shift clock Bridge Controller Aggregation Core 1
Serial Port UART
32
User
IPIF
Figure 6: Sliding window implementations as 8 × 32
OPB Logic
linear shift register. circuit
board FPGA Aggregation Core 2
RFIFO
BRAM
32 32
sorting network sorting network
32
32
WFIFO
BRAM
shift clock
data write
WFIFO memory−mapped
WFIFO_STATUS registers
DATA_IN memory−mapped IPIF RFIFO
AGG_OUT registers RFIFO_STATUS user logic
PLB aggregation core
IPIF
aggregation core
Figure 9: Attachment of aggregation core through
PLB
Write-FIFO and Read-FIFO queues.
Figure 8: Attachment of aggregation core through
memory-mapped registers. on the aggregation core turns out to be an inherent prob-
lem of using a general-purpose FIFO implementation (such
Information can then be sent between the aggregation core as the one provided with the Xilinx IPIF interface). Re-
and the CPU using load/store instructions. implementing the FIFO functionality in user logic can rem-
edy this deficiency, as we describe next.
Configuration 1: Slave Registers. The first approach
uses two 32-bit registers DATA IN and AGG OUT as shown in Configuration 3: Master Attachment. In the previ-
Figure 8. The IP interface is set to trigger a clock signal ous configuration, access is through a register that cannot
upon a CPU write into the DATA IN register. This signal be manipulated in 64-bit width. Instead of using a register
causes a shift in the shift register (thereby pulling the new through a bus, we can use memory mapping between the ag-
tuple from DATA IN) and a new data set to start propagating gregation core and the CPU to achieve a full 64 bit transfer
through the sorting network. A later CPU read instruction width. The memory mapping is now done on the basis of
for AGG OUT then will read out the newly computed aggregate contiguous regions rather than a single address. Two regions
value. are needed, one for input and one for output. These mem-
This configuration is simple and uses few resources. How- ory regions correspond to local memory in the aggregation
ever, it has two problems: lack of synchronization and poor core and are implemented using BRAMs.
bandwidth usage. We can improve on this approach even further by tak-
In this configuration the CPU and the aggregation core ing advantage of the fact that the transfers to/from these
are accessing the same registers concurrently with no syn- regions can be offloaded to a DMA controller. We have con-
chronization. The only way to avoid race conditions is to sidered two options: one with the DMA controller run by
add artificial time delays between the access operations. the CPU and one with the DMA controller run in (the IPIF
In addition, each tuple in this configuration requires two of) the aggregation core. Of these two options, the latter
32-bit memory accesses (one write followed by one read). one is preferrable since it frees the DMA controller of the
Given that the CPU and the aggregation core are connected CPU to perform other tasks. In the following, we call this
to a 64-bit bus (and hence could transmit up to 2 × 32 bits configuration master attachment. In Figure 10, we show all
per cycle), this is an obvious waste of bandwidth. the memory mapped registers the CPU uses to set up the
transfers, although we do not discuss them here in detail for
Configuration 2: FIFO Queues. The second configura- lack of space. The figure also shows the interrupt line used
tion we explore solves the lack of synchronization by intro- to notify the CPU that new results are available.
ducing FIFO queues between the CPU and the aggregation The master attachment configuration has the advantage
core (Figure 9). Interestingly, this is the same solution as that the aggregation core can independently initiate the
the one adopted in data stream management systems to de- write-back of results once they are ready, without having
couple operators. to synchronize with an external DMA controller. This re-
The CPU writes tuples into the Write-FIFO (WFIFO) duces latency, uses the full available bandwidth, and gives
and reads median values from the Read-FIFO queue (RFIFO). the aggregation core control over the flow of data, leaving
The two queues are implemented in the IPIF using addi- the CPU free to perform other work and thereby increasing
tional block RAM components (BRAM). The aggregation the chances for parallelism.
core independently dequeues items from the Write-FIFO
queue and enqueues the median results into the Read-FIFO
queue. Status registers in both queues allow the CPU to 6. EVALUATION
determine the number of free slots (write queue) and the We evaluated the different design options described above.
number of available result items (read queue). All experiments were done on the Xilinx XUPV2P develop-
This configuration avoids the need for explicit synchro- ment board. Our focus is on the details of the soft IP-core
nization. There is still the drawback that the interface uses and we abstract from effects caused for example by I/O (net-
only 32 bits of the 64 available on the bus. The mismatch work and disks) by performing all the processing into and
between a 64-bit access on the CPU side and a 32-bit width out of off-chip memory (512 MB DDR RAM).
sorting network
median
:2 +
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
33
Adder
32
sequencer/
sorting network
machine
state
32
switching phase SYNCHRONOUS
stable phase (idle time)
shift clock t
0 1 2 3 4 5 6 cycle
BRAM BRAM execution time = 6 cycles
CONTROL memory−mapped
Cmds
Intr
heap
even-odd
time [sec]
60 FPGA
100
40
10
20
1
DMA master attachment
slave register 0
0.1
x86-64
Cell PPE
PPC G5
PPC G4
PPC 405
16 B 256 B 4 kB 64 kB
data size
many FPGA design to rely solely on synchronous circuits streams with very high arrival rates so that the tuples can
[13]. Our results indicate, however, that for data process- be batched.
ing there are simple asynchronous designs that can signifi- Using configuration 3, we have also measured the time it
cantly reduce latency (at the cost of throughput). In terms takes for the complete median operator to process 256 MB
of transforming algorithms into asynchronous circuits, not of data consisting of 4-byte tuples. It takes 6.173 seconds to
all problems can expressed in an asynchronous way. From process all the data at a rate of more than 10 million tuples
a theoretical point of view, every problem where the only per second. This result is shown as the horizontal line in
dependence of the output signal are the input signals, can Figure 13.
be converted into an asynchronous circuit (a combinatorial
circuit). The necessary circuit can be of significant size, 6.3 FPGA Performance in Perspective
however (while synchronous circuits may be able to re-use
FPGAs can be used as co-processor of data processing en-
the same logic elements in more than one stage). A more
gines running on conventional CPUs. This, of course, pre-
practical criterion can be obtained by looking at the algo-
sumes that using the FPGA to run queries or parts of queries
rithm that the circuit mimics in hardware. As a rule of
does not result in a net performance loss. In other words,
thumb, algorithms that require a small amount of control
the FPGA must not be significantly slower than the CPU.
logic (branches or loops) and have a simple data flow pat-
Achieving this is not trivial because of the much slower clock
tern are the most promising candidates for asynchronous
rates on the FPGA.
implementations.
Here we study the performance of the FPGA compared to
that of CPUs when running on a single data stream. Later
6.2 Median Operator on we are going to consider parallelism.
We now compare two of the configurations discussed in To ensure that the choice of a software sorting algorithm
Section 5.2 and then evaluate the performance of the com- is not a factor in the comparison, we have implemented eight
plete aggregation core using the best configuration. different sorting algorithms in software and optimized them
We compare configuration 1 (slave register) with configu- for performance. Seven are traditional textbook algorithms:
ration 3 (master attachment). We use maximum-sized DMA quick sort, merge sort, heap sort, gnome sort, insertion sort,
transfers (4 kB) between external memory and the FPGA selection sort, and bubble sort. The eighth is an implemen-
block RAM to minimize the overhead spent on interrupt tation of the even-odd merge sorting network of Section 4.1
handling. We do not consider configuration 2 (FIFO queues) using CPU registers.
because it does not offer a performance improvement over
configuration 1. We ran the different algorithms on several hardware plat-
Figure 12 shows the execution time for streams of varying forms. We used an off-the-shelf desktop Intel x86-64 CPU
size up to 64 kB. While we see a linearly increasing execution (2.66 GHz Intel Core2 quad-core Q6700) and the following
time for configuration 1, configuration 2 requires a constant PowerPC CPUs: a 1 GHz G4 (MCP7457) and a 2.5 GHz
execution time of 96 µs for all data sizes up to 4 kB, then G5 Quad (970MP), the PowerPC element (PPE not SPEs)
scales linearly with increasing data sizes (this trend contin- of the Cell, and the embedded 405 core of our FPGA. All
ues beyond 64 kB). This is due to the latency incurred by implementations are single-threaded. For illustration pur-
every DMA transfer (up to 4 kB can be sent within a sin- poses, we limit our discussion to the most relevant subset of
gle transfer). 96 µs are the total round-trip time, measured algorithms.
from the time the CPU writes to the control register in or- Figure 13 shows the wall-clock time observed when pro-
der to initiate the Read-DMA transfer until it receives the cessing 256 MB (as 32-bit tuples) through the median sliding
interrupt. window operator. The horizontal line indicates the execu-
These results indicate that configuration 1 (slave registers) tion time of the FPGA implementation. Timings for the
is best for processing small amounts of data or streams with merge, quick, and heap sort algorithms on the embedded
low arrival rates. Configuration 3 (master attachment) is PowerPC core did not fit into scale (303 s, 116 s, and 174 s,
best for large amounts of data (greater than 4 kB) or data respectively). All our software implementations were clearly
Intel Core 2 Q6700: cores flip-flops LUTs slices %
Thermal Design Power (CPU only) 95 W 0 1761 1670 1905 13.9 %
Extended HALT Power (CPU only) 24 W 1 3727 6431 4997 36.5 %
Measured total power (230 V) 102 W 2 5684 10926 7965 58.2 %
Xilinx XUPV2P development board: 3 7576 15597 11004 80.3 %
Calculated power estimate (FPGA only) 1.3 W 4 9512 20121 13694 100.0 %
Measured total power (230 V) 8.3 W
Table 4: FPGA resource usage. The entry for 0
Table 3: Power consumption of an Intel Q6700- cores represents the space required to accommodate
based desktop system and the Xilinx XUPV2P all the necessary circuitry external to the aggrega-
FPGA board used in this paper. Measured values tion cores (UART, DDR controller, etc.).
are under load when running median computation.
10-3
1 aggregation core 7. SUMMARY
2 aggregation cores
10-4 3 aggregation cores In this paper we have assessed the potential of FPGAs
4 aggregation cores as co-processor for data intensive operations in the context
10-5
1kB 8kB 64kB 512kB 4MB 32MB 256MB of multi-core systems. We have illustrated the type of data
data size processing operations where FPGAs have performance ad-
vantages (through parallelism and low latency) and discuss
Figure 15: Total execution time to process multiple several ways to embed the FPGA into a larger system so
data streams using concurrent aggregation cores. that the performance advantages are maximized. Our ex-
periments show that FPGAs bring additional advantages in
terms of power consumption. These properties make FPGAs
ory bandwidth does not further limit this number. very interesting candidates for acting as additional cores in
the heterogeneous many-core architectures that are likely to
6.6 Parallelism: Performance become pervasive. The work reported in this paper is a first
We used the four aggregation cores mentioned above to but important step to incorporate the capabilities of FPGAs
run up to four independent data streams in parallel. We into data processing engines in an efficient manner. The
ran streams of increased size over configurations with an in- higher design costs of FPGA-based implementations may
creasing amount of cores. Figure 15 shows the wall-clock still amortize, for example, if a higher throughput (using
execution times for processing multiple data streams in par- multiple parallel processing cores as shown in the previous
allel, each on a separate aggregation core. Table 5 summa- section) can be obtained in a FPGA-based stream process-
rizes the execution times for a stream of 64 MB (column ing system for a large fraction of queries.
‘FPGA’). As part of future work we intend to explore a tighter inte-
The first important conclusion is that running additional gration of the FPGA with the rest of the computing infras-
aggregation cores has close to no impact on the other cores. tructure, an issue also at the top of the list for many FPGA
The slight increase with the addition of the fourth core manufacturers. Modern FPGAs can directly interface to
comes from the need to add the wait cycle mentioned above. high-speed bus systems, such as the HyperTransport bus,
This shows that by adding multiple cores throughput is in- or even intercept the execution pipeline of general-purpose
creased as multiple streams can be processed concurrently CPUs, opening up many interesting possibilities for using
(Table 5). The second observation is that the execution the FPGA in different configurations.
times scale linearly with the size of the data set as it is to
be expected. The flat part of the curves is the same effect Acknowledgements
observed before for stream sizes smaller than 4 kB. The We would like to thank Laura and Peter Haas for their valu-
graph also indicates that since each core is working on a dif- able insights and help in improving the presentation of the
ferent stream, we are getting linear scale-out in throughput paper.
This project is funded in part by the Enterprise Comput- (FCCM), 2008.
ing Center of ETH Zurich (http://www.ecc.ethz.ch/). [14] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki.
QPipe: A Simultaneously Pipelined Relational Query
8. REFERENCES Engine. In Proc. of the 2005 ACM SIGMOD Int’l
[1] D. J. Abadi, Y. Ahmad, M. Balazinska, Conference on Management of Data, Baltimore, MD,
U. Cetintemel, M. Cherniack, J.-H. Hwang, USA, June 2005.
W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, [15] S. S. Huang, A. Hormati, D. Bacon, and R. Rabbah.
N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Liquid Metal: Object-Oriented Programming Across
Borealis Stream Processing Engine. In Conference on the Hardware/Software Boundary. In European
Innovative Data Systems Research (CIDR), Asilomar, Conference on Object-Oriented Programming, Paphos,
CA, USA, January 2005. Cyprus, July 2008.
[2] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, [16] Xtreme Data Inc. http://www.xtremedatainc.com/.
Ch. Convey, S. Lee, M. Stonebraker, N. Tatbul, and [17] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani.
S. Zdonik. Aurora: A New Model and Architecture for AA-Sort: A New Parallel Sorting Algorithm for
Data Stream Management. The VLDB Journal, 12(2), Multi-Core SIMD Processors. In Int’l Conference on
July 2003. Parallel Architecture and Compilation Techniques
[3] A. Arasu, S. Babu, and J. Widom. The CQL (PACT), Brasov, Romania, September 2007.
continuous query language: semantic foundations and [18] Intel Corp. Intel Core 2 Extreme Quad-Core Processor
query execution. The VLDB Journal, 15(2), June XQ6000 Sequence and Intel Core 2 Quad Processor
2006. Q600 Sequence Datasheet, August 2007.
[4] K. E. Batcher. Sorting Networks and Their [19] Kickfire. http://www.kickfire.com/.
Applications. In AFIPS Spring Joint Computer [20] D. E. Knuth. The Art of Computer Programming,
Conference, 1968. Volume 3: Sorting and Searching. Addison-Wesley,
[5] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, Frans 2nd edition, 1998.
Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, [21] S. Manegold, P. A. Boncz, and M. L. Kersten.
Y. Dai, Y. Zhang, and Z. Zhang. Corey: An Operating Optimizing Database Architecture for the New
System for Many Cores. In USENIX Symposium on Bottleneck: Memory Access. The VLDB Journal,
Operating Systems Design and Implementation 9(3), December 2000.
(OSDI), San Diego, CA, USA, December 2008. [22] A. Mitra, M. R. Vieira, P. Bakalov, V. J. Tsotras, and
[6] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, W. A. Najjar. Boosting XML Filtering Through a
Mostafa Hagog, Y.-K. Chen, A. Baransi, S. Kumar, Scalable FPGA-based Architecture. In Conference on
and P. Dubey. Efficient Implementation of Sorting on Innovative Data Systems Research (CIDR), Asilomar,
Multi-Core SIMD CPU Architecture. Proc. VLDB CA, USA, 2009.
Endowment, 1(2), 2008. [23] R. Mueller, J. Teubner, and G. Alonso. Streams on
[7] Netezza Corp. http://www.netezza.com/. Wires – A Query Compiler for FPGAs. Proc. VLDB
[8] D. DeWitt. DIRECT—A Multiprocessor Organization Endowment, 2(1), 2009.
for Supporting Relational Database Management [24] K. Oflazer. Design and Implementation of a
Systems. IEEE Trans. on Computers, c-28(6), June Single-Chip 1-D Median Filter. IEEE Trans. on
1979. Acoustics, Speech and Signal Processing, 31, October
[9] B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: 1983.
High Performance Sorting on the Cell Processor. In [25] L. Rabiner, M. Sambur, and C. Schmidt. Applications
Proc. of the 33rd Int’l Conference on Very Large Data of a Nonlinear Smoothing Algorithm to Speech
Bases (VLDB), Vienna, Austria, September 2007. Processing. IEEE Trans. on Acoustics, Speech and
[10] B. T. Gold, A. Ailamaki, L. Huston, and Babak Signal Processing, 23(6), December 1975.
Falsafi. Accelerating Database Operators Using a [26] J. W. Tukey. Exploratory Data Analysis.
Network Processor. In Int’l Workshop on Data Addison-Wesley, 1977.
Management on New Hardware (DaMoN), Baltimore, [27] P. D. Wendt, E. J. Coyle, and N. J. Gallagher, Jr.
MD, USA, June 2005. Stack Filters. IEEE Trans. on Acoustics, Speech and
[11] N. K. Govindaraju, J. Gray, R. Kumar, and Signal Processing, 34(4), August 1986.
D. Manocha. GPUTeraSort: High Performance [28] Xilinx Inc. Virtex-II Pro and Virtex-II Pro X Platform
Graphics Co-processor Sorting for Large Database FPGAs: Complete Data Sheet, v4.2 edition, 2007.
Management. In Proc. of the 2006 ACM SIGMOD [29] J. Zhou and K. A. Ross. Implementing Database
Int’l Conference on Management of Data, Chicago, IL, Operations using SIMD Instructions. In Proc. of the
USA, June 2006. 2002 ACM SIGMOD Int’l Conference on Management
[12] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and of Data, Madison, WI, USA, June 2002.
D. Manocha. Fast Computation of Database
Operations Using Graphics Processors. In Proc. of the
2004 ACM SIGMOD Int’l Conference on Management
of data, Paris, France, 2004.
[13] D. Greaves and S. Singh. Kiwi: Synthesis of FPGA
Circuits from Parallel Programs. In IEEE Symposium
on Field-Programmable Custom Computing Machines