Merging Programming Models and On-Chip Networks To
Merging Programming Models and On-Chip Networks To
net/publication/221437264
CITATIONS READS
3 165
5 authors, including:
Matthew French
University of Southern California
50 PUBLICATIONS 707 CITATIONS
SEE PROFILE
All content following this page was uploaded by Matthew French on 19 May 2014.
Andrew G. Schmidt, William V. Kritikos, Ron Sass Erik K. Anderson and Matthew French
Reconfigurable Computing Systems Lab Information Science Institute
Univ. of North Carolina at Charlotte University of Southern California
Email: [email protected] Email: {erik,mfrench}@isi.edu
{andrewgschmidt,will.kritikos}@gmail.com
FIFO
FIFO
300
200
BLAST compares one short DNA sequence, called the
query, against a long database of DNA sequences, and
100
produces a stream of indexes where the two sequences
match. Researchers have implemented BLAST on FPGAs
0
100 1000 10000 100000 to demonstrate impressive speedups [14]. The BLAST hard-
Message Size (Bytes)
ware kernel is a reimplementation of this algorithm. It uses
Figure 3: Stream Bandwidth Across Hardware and Software one input stream for the input database, one output stream
for the results, and three blocks (one local and two off-chip)
to store query information. Each BLAST kernel added to
the system can run one query in parallel.
includes a Virtex 5 FX 130T. The PowerPC 440 processor For comparison, a constant work size of eight queries
block was used and the SWKI interacts directly with the and one database is used to test the system. The system
LocalLink DMA controller TX/RX ports. with a single BLAST kernel must sequentially evaluate the
A. Performance queries, while the system with eight kernels can evaluate
them in parallel. This test will show if the Redsharc system
The SSN’s performance is measured by the bandwidth
can scale with increasing hardware kernels. Table IV shows
between two kernels. While the HWKI abstracts these issues
the execution time results. There is no speedup in the time to
from the hardware developer, these metrics are dependent
load the queries into the blocks or read the result data back
on 1) width of the stream, 2) width of the SSN, 3) clock
from the kernels. These operations are entirely sequential
frequency of the SSN and kernels. Between two hardware
and offer no possible speedup. A linear speedup is observed
kernels, running at 200MHz and a 32b data width, the
in the time spent comparing the database to the query. Note
SSN’s bandwidth is 800 MB/s. When the receiving or
that with eight queries running in parallel the BSN must
transmitting kernel is software, the bandwidth is limited by
handle the increasing load for the query information stored
the throughput of the DMA controller and the message size.
in off-chip memory blocks.
Figure 3 shows the measured bandwidth for different stream
lengths between hardware kernels and software kernels. The C. Resource Utilization
SSN performs best with large message sizes.
The BSN’s performance is measured by the latency and This subsection presents the sizes of the BSN, SSN,
bandwidth of a read operation. Similar to the SSN, the and HWKI, the three critical hardware components that
Redsharc API abstracts synthesis-time configurations that comprise a Redsharc system. Figure 4 illustrates the BSN
may effect the bandwidth of a specific system. The settings and SSN’s usage of Lookup Tables and Flip-Flops. Note
that effect BSN’s bandwith are 1) location and width of that the SSN is purely combinatorial and as a result has no
the data 2) operating frequency of the hardware kernel flip-flops. The BSN numbers includes the routing module
and BSN (clock domain crossing adds overhead), and 3) logic and switch controller, which increases the resource
possible contention at the remote block. Table III provides count. Overall, the resources used consume a small portion
an overview of the BSN performance for the three types of of available resources for medium to large scale FPGA
data locality, given both the BSN and hardware kernels on devices. While a bus presents a smaller resource footprint, as
the same 100MHz clock. The BSN performs very favorable a trade-off the dual switches provide significant bandwidth
compared to the PLB which, in previous work, has a that is necessary to satisfy the type of high performance
measured peek performance of 25 MB/s [13] for a 32 bit applications targeted by this research.
read transaction and a 200 MB/s for 1024 bit transaction. The HWKI supports access to variable number of streams
and blocks with variable data element sizes. As such, we
B. Scalability present the resources required for each additional stream
In order to demonstrate Redsharc’s scalability a BLAST or block and assume 32-bit data widths for all ports. For
bio-informatics kernel was implemented. BLAST is a bio- the SSN, only a LUT is required for each input and output
informatics application that performs DNA comparisons. port to drive the Xilinx LocalLink signals and the input
Table IV: Performance of one to eight BLAST cores running in a Redsharc system
Blast Cores Load Queries Speedup BLAST Runtime Speedup Read Results Speedup Total Time Total Speedup
1 1643.94 µs 1× 9013.1 µs 1× 49.49 µs 1× 10706.53 µs 1×
2 1641.16 µs 1× 4512.16 µs 2× 41.74 µs 1.19× 6195.06 µs 1.73×
4 1639.78 µs 1× 2265.38 µs 3.98× 38.73 µs 1.28× 3943.90 µs 2.71×
8 1638.65 µs 1× 1134.11 µs 7.95× 36.4 µs 1.36× 2809.16 µs 3.81×
BSN and SSN Resource Usage architects may develop heterogeneous systems that span the
22500 3500
BSN used LUTs
SSN used LUTs
hardware/software domain, using a seamless abstract API,
20000 BSN used FFs
SSN used FFs 3000 without giving up performance of custom interfaces.
17500
2500 R EFERENCES
15000
Lookup Tables (LUTs)
Flip-Flops (FFs)
2000
12500 codesign for embedded systems,” Computer, vol. 38, no. 2,
10000
pp. 63–69, 2005.
1500
7500
[2] M. Jones et al., “Implementing an api for distributed adaptive
1000 computing systems,” in FCCM 1999. IEEE Computer
5000 Society, 1999, p. 222.
500
2500 [3] R. Laufer et al., “PCI-piperench and the swordapi: a system
0 0
for stream-based reconfigurable computing,” in FCCM 1999,
0 4 8 12 16 20 24 28 32
1999, pp. 200 –208.
Radix
[4] E. Lubbers and M. Platzner, “Reconos: An rtos supporting
Figure 4: Block Switch and Stream Switch Network Re- hard-and software threads,” International Conference on Field
source Utilization in terms of Lookup Tables (LUTs) and Programmable Logic and Applications (FPL), pp. 441–446,
27-29 Aug. 2007.
Flip-Flops (FFs)
[5] D. Andrews et al., “Achieving programming model abstrac-
tions for reconfigurable computing,” Very Large Scale Inte-
gration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 1,
and output stream FIFOs. The FIFO depth is configurable pp. 34–44, Jan. 2008.
by the hardware developer so the number of BRAMs is [6] M. B. Gokhale et al., “Stream-oriented fpga computing in
variable. For the BSN more logic is needed to support local the streams-c high level language,” in FCCM 2000. IEEE
and remote block requests. Each local block requires 176 Computer Society, 2000, p. 49.
Flip-Flops and 300 LUTs whereas each Remote block only [7] D. Unnikrishnan et al., “Application specific customization
requires 161 Flip-Flops and 163 LUTs. These represent a and scalability of soft multiprocessors,” in FCCM 2009, April
2009, pp. 123–130.
minimal amount of resources needed to support the high
bandwidth memory transactions while maintaining a com- [8] J. Liang et al., “An architecture and compiler for scalable on-
mon memory interface to the hardware kernel. chip communication,” IEEE Trans. Very Large Scale Integr.
Syst., vol. 12, no. 7, pp. 711–726, 2004.
VI. C ONCLUSION [9] L. Shannon and P. Chow, “Simplifying the integration of pro-
cessing elements in computing systems using a programmable
Programming MCSoPC that span hardware and software controller,” in FCCM 2005. IEEE Computer Society, 2005,
is not a trivial task. While abstract programming models pp. 63–72.
have been shown to ease the programmer burden of cross-
[10] P. Mattison and W. Thies, “Streaming virtual machine speci-
ing the hardware/software boundary, their abstraction layer fication, version 1.2, technical report,” January 2007.
incurs a heavy burden on performance. Redsharc solves this
[11] 128-Bit Processor Local Bus Architecture Specifications, Ver-
problem by merging an abstract programming model with sion 4.7 ed., IBM.
on-chip networks that directly implement the programming
[12] Xilinx, “Locallink interface specification,” www.xilinx.com/
model.
products/design resources/conn central/locallink member/
The Redsharc API is based on a streaming program- sp006.pdf.
ming model but also incorporates random access blocks
[13] A. Schmidt and R. Sass, “Characterizing effective memory
of memory. Two on-chip networks were implemented to bandwidth of designs with concurrent high-performance com-
facilitate the stream and block API calls. Our results showed puting cores,” in FPL 2007, Aug. 2007, pp. 601 –604.
that the SSN and BSN have comparable bandwidth to [14] S. Datta and R. Sass, “Scalability studies of the blastn scan
state of the art technology and scales nearly linearly with and ungapped extension functions,” in ReConFig 2009, 2009.
parallel hardware kernels. Ergo, programmers and system
Vie