0% found this document useful (0 votes)
15 views

Merging Programming Models and On-Chip Networks To

The paper introduces Redsharc, a programming model and on-chip network solution for multi-core systems on a programmable chip (MCSoPC). Redsharc uses an abstract API to allow kernels implemented in either software or hardware to communicate via streaming or block-based interfaces. It also includes two on-chip networks to implement this API at high performance. Results show the networks can achieve high bandwidths for streaming and random access regardless of the number of kernels.

Uploaded by

21521950
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Merging Programming Models and On-Chip Networks To

The paper introduces Redsharc, a programming model and on-chip network solution for multi-core systems on a programmable chip (MCSoPC). Redsharc uses an abstract API to allow kernels implemented in either software or hardware to communicate via streaming or block-based interfaces. It also includes two on-chip networks to implement this API at high performance. Results show the networks can achieve high bandwidths for streaming and random access regardless of the number of kernels.

Uploaded by

21521950
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221437264

Merging Programming Models and On-chip Networks to Meet the


Programmable and Performance Needs of Multi-core Systems on a
Programmable Chip

Conference Paper · December 2010


DOI: 10.1109/ReConFig.2010.55 · Source: DBLP

CITATIONS READS

3 165

5 authors, including:

Matthew French
University of Southern California
50 PUBLICATIONS 707 CITATIONS

SEE PROFILE

All content following this page was uploaded by Matthew French on 19 May 2014.

The user has requested enhancement of the downloaded file.


Merging Programming Models and On-Chip Networks to Meet the Programmable
and Performance Needs of Multi-Core Systems on a Programmable Chip

Andrew G. Schmidt, William V. Kritikos, Ron Sass Erik K. Anderson and Matthew French
Reconfigurable Computing Systems Lab Information Science Institute
Univ. of North Carolina at Charlotte University of Southern California
Email: [email protected] Email: {erik,mfrench}@isi.edu
{andrewgschmidt,will.kritikos}@gmail.com

Abstract— targets MCSoPC. An abstract API, as described by Jerraya in


The REconfigurable Data-Stream Hardware Software AR- [1], allows cores to exchange data without knowing how the
Chitecture (Redsharc) is a programming model and network- opposite core is implemented. In a Redsharc system compu-
on-a-chip solution designed to scale to meet the performance
needs of multi-core systems on a programmable chip. Redsharc tational units, known as kernels, are implemented as either
uses an abstract API that allows programmers to develop software threads running on a processor, or hardware cores
systems of simultaneously executing kernels, in software or running in the FPGA fabric. Regardless of location kernels
hardware, that communicate over a seamless interface. To communicate and synchronize using the Redsharc’s abstract
support high performance systems with numerous hardware API. Redsharc’s API is based both on a streaming model to
kernels, Redsharc incorporates two on-chip networks that
directly mimic the API. Our results show that Redsharc pass data using unidirectional queues and a block model that
running at 200 MHz, 32-bit data widths can achieve 800 allows kernels to exchange index based bidirectional data.
MBps per stream from hardware to hardware, 480 MBps for Section III explains the Redsharc API in detail.
a software to hardware stream, and between 20 MBps and Redsharc’s second contribution is the development of fast
400 MBps for random access data in blocks, regardless of the
and scalable on-chip networks to implement the Redsharc
number of kernels in the system.
API, with special consideration to the hardware / software
nature of MCSoPC. The Stream Switch Network (SSN), dis-
I. I NTRODUCTION cussed in Section IV-A, is a runtime reconfigurable crossbar
Available FPGA resources continue to track Moore’s on-chip network designed to carry streams of data between
Law. High end chips now provide 100’s of millions of heterogeneous cores. The Block Switch Network (BSN),
equivalent transistors in the form of reconfigurable logic, discussed in Section IV-B, is a routable crossbar on-chip
memory, multipliers, processors, and a litany of increasingly network designed to exchange index data elements between
sophisticated hard IP cores. As a result engineers are turning cores and blocks of memory.
to Multi-Core Systems on a Programmable Chip (MCSoPC) A series of benchmarks were ran on a Xilinx ML510 de-
solutions to leverage these FPGA resources. MCSoPC allow velopment board to demonstrate Redsharc’s performance. A
system designers to mix hard processors, soft processors, 3rd BLAST bioinformatics kernel was also ported to a Redsharc
Party IP, or custom hardware cores all within a single FPGA. core to demonstrate the full use of the Redsharc API and to
A major challenge of MCSoPC is how to achieve inter- show scalability. Results are listed in Section V.
kernel communication without sacrificing performance. This
problem is compounded by the realization that kernels may II. R ELATED W ORK
use different computational and communication models;
threads running on a processor communicate much differ- The idea of creating an API to abstract communication
ently than cores running within the FPGA fabric. Further- between heterogeneous computational units has its roots
more, standard on-chip interconnects for FPGAs do not in both the Adaptive Computing Systems (ACS) [2] and
scale well and cannot be optimized for specific programming PipeRench [3] projects. These projects focused on processor
models; contention on a bus can quickly limit performance. to off-chip FPGA accelerator communication and control,
Redsharc’s contribution has two parts. First, introduction they existed prior to the realization of MCSoPC.
of an abstract programming model and API that specifically More recently ReconOS [4] and hthreads [5] implemented
a programming model based on Pthreads to abstract the
The views, opinions, and/or findings contained in this article/presentation hardware/software boundary. They successfully demonstrate
are those of the author/presenter and should not be interpreted as represent-
ing the official views or policies, either expressed or implied, of the Defense the feasibility of abstract programming models in hetero-
Advanced Research Projects Agency or the Department of Defense geneous systems. However, their middleware layer requires
too many FPGA resources to practically implement. Further- element are variable types in Redsharc. The control ker-
more, their communication bandwidth is limited due to a re- nels creates variables of these types and initializes them for
liance on proprietary buses. Redsharc solves these problems use during runtime. In Redsharc, the streams and blocks
by developing an abstract API better suited for MCSoPC and a kernel communicates with is set at runtime with the
developing custom on-chip networks respectively supporting kernelInit command. A dependency is used in conjunc-
the API. tion with a block to allow one kernel to write elements to
The stream model is common within FPGA programming. a block and prohibit any reading kernel from starting until
Projects such as [6], [7] use high level stream APIs or stream the write is complete. Only the control kernel is aware of
languages with compilation tools to automatically generate each worker kernel’s location, either hardware or software.
MCSoPC. Unlike Redsharc these efforts focus on a pure Due to the added complexity control kernels may only be
streaming programming model and do not include support implemented in software.
for random access memory. Furthermore, these models do The worker kernel API is listed in Table II. While
not permit the mixing of heterogeneous hard processors, soft kernelEnd is used to communicate with the control kernel
processors, or custom FPGA kernels as may be needed to when work is completed, the remainder of the API is
meet stringent application demands. dedicated to stream or block communication. In much the
Researchers have also customized streaming networks to same way that a function in an object oriented language may
communicate between heterogeneous on-chip cores using be overloaded for different variable types, the Redsharc API
abstract interfaces. For example, aSOC [8] is notable for is independent of data element width, block location, or the
communication between heterogeneous computational units, transmitting or receiving kernel’s implementation.
but targets ASICs instead of FPGAs. Finally, SIMPPL [9] API implementations must be developed for both software
developed a FPGA based streaming networks but do not con- and hardware kernels. The software API layer, known as the
sider heterogeneous computational units. Moreover, existing software kernel interface (SWKI), is implemented as a tra-
on-chip networks only support pure streaming applications ditional software library. User code links against the library
and do not include support for random access memory. when generating executable files. Different processors (hard
or soft) may implement the SWKI in different ways.
III. R EDSHARC API The hardware API layer, known as the hardware kernel
interface (HWKI), is implemented as a VHDL entity that
Developing an abstract API for MCSoPC is not a trivial is included during synthesis. There is one HWKI for each
task. The API must support parallel operations, be realizable hardware kernel. The HWKI is a thin wrapper that connects
in all computational domains, resource friendly, and flexible hardware kernels to the SSN and BSN. Described in more
to incorporate a large number of application domains. detail in Section IV, the HWKI implements the Redsharc
To find the middle ground in these conflicting require- stream API as a series of interfaces similar to FIFO ports,
ments Redsharc is based on the Stream Virtual Machine and the Redsharc block API as a series of interfaces similar
API (SVM) [10]. SVM was originally designed as an to BRAM ports.
intermediate language between high level stream languages
and low level instruction sets of various architectures being IV. R EDSHARC ’ S N ETWORKS ON A C HIP
developed by the DARPA PCS program. SVM has no The Stream Switch Network (SSN) and Block Switch
preference to the computational model and only specifies Network (BSN) were necessitated by performance and
how kernels communicate with each other. SVM is primarily scalability. Programming models that are built on top of
based on the relatively simple stream model, but additionally existing general purpose on-chip networks such as IBM’s
includes supports for communicating with blocks, or random CoreConnect [11] must translate higher level API calls to
access chunks of data. These features make it an ideal lower level network procedures often with a large overhead.
candidate for porting to MCSoPC. For example, even a simple streamPeek() operation
The Redsharc API recognizes two types of kernels, worker may require two bus read transactions, the first to check if
and control. There is only one control kernel in the system the stream is not empty, the second to retrieve the value
and it is responsible for creating and managing streams, of the top element. In previous work, Liang [8] showed
blocks, and worker kernels. There may be multiple worker custom on-chip networks can outperform general purpose
kernels that perform functions on data elements presented in networks. Therefore, the SSN and BSN were designed and
streams or blocks. A stream is a unidirectional flow of data implemented specifically to support the Redsharc API and
elements between two kernels. A block is a index set of data thereby improving scalability and performance.
elements shared between two or more kernels. In Redsharc
a data element can be any variable of size 2n . A. Stream Switch Network
The control kernel API, in C syntax, is presented in The Redsharc Stream Switch Network is a dedicated on-
Table I. The symbols: kernel, stream, block, and chip network designed to transport streams of data between
Table I: Redsharc control Kernel’s API calls
Control’s API Call Description
kernelInitNull(kernel *k) Initializes k to a default state.
kernelInit(kernel *k, streams *s, blocks *b) Configures the streams in array s and blocks in array s for communication with k.
kernelAddDependence(kernel *k1, kernel*k2) Makes k2 dependent on the completion of k1.
kernelRun(kernel *k) Adds k to the list of schedulable kernels
kernelPause(kernel *k) Removes k from the list of schedulable kernels.
streamInitFifo(stream *s) Initializes s
blockInit(block *b) Initializes b.

Table II: Redsharc worker kernel’s API calls


Worker’s API Call Description
void kernelEnd() Indicates the kernel has completed its work.
void streamPush(element *e, stream *s) Pushes e onto a stream s.
void streamPushWithEOS(element *e, stream *s) Pushes e onto a stream s with an end of stream indicator.
void streamPushMulticast(element *e, stream *s, ...) Pushes e to multiple streams.
void streamPushMulticastWithEOS(element *e, stream *s, ...) Pushes e to multiple streams with an end of stream indicator
void streamPop(element *e, stream *s) Pops the top element from s and stores the value in e.
void streamPeek(element *e, stream *s) Reads the top element from s storing the value in e without popping.
int streamPeekEOS(stream *s) Returns 1 if the top element in s has an end of stream indicator.
void blockWrite(element *e, int index, block *b) Writes e to b at index
void blockRead(element *e, int index, block *b) Reads the value of b at index and stores results in e.

communicating kernels. The SSN connects to all hard- Processor Block


ware kernels via the HWKI and the processor using DMA RX Bus TX
controllers. Software kernels communicate with the SSN DMA I/F DMA
through the SWKI which uses DMA descriptors to send and

FIFO
FIFO

receive data. The SSN is composed of a full crossbar switch, Switch


Config
configuration registers, FIFO buffers, and flow control logic. Registers
The SSN’s crossbar switch is implemented using a pa-
rameterize VHDL model which sets the number of inputs SSN
and outputs to the switch at synthesis time. Each output Crossbar
port of the crossbar is implemented as large multiplexers
in the FPGA fabric. The inputs and outputs of the switch Switch
are buffered with FIFOs allowing each hardware kernel FIFO Hardware FIFO
to run within its own clock domain. Data is transferred
FIFO Kernel 0 FIFO
from a kernel’s output FIFO to another kernel’s input FIFO
whenever data is available in the output FIFO and room
exists in the input FIFO. The flow control signals used in
the SSN are based on the LocalLink specification [12]. The FIFO
FIFO
Hardware FIFO
SSN’s internal data width is set at synthesis time and defaults
FIFO Kernel N
to 32 bits. To fulfill the Redsharc API requirement that each
stream have a variable width, the FIFOs on the edges of
the SSN translate the stream’s data from 32 bits to the Figure 1: Stream Switch Network
kernel’s data width requirement. The SSN runs by default
at 200MHz.
Figure 1 shows the streams and kernels in a SSN system. Hardware kernels are presented with standard FIFO inter-
The input streams are shown on the left side of the figure faces as their Redsharc stream API implementation. While
flowing into hardware kernels, with the output streams on the abstracted from hardware developers the FIFOs are directly
right. Hardware kernels may have any number of input and part of the SSN.
output streams. Also shown are the streams connecting the Software kernels communicate with the SWKI library.
processor’s DMA controllers ports which connect directly The SWKI in turn communicates with hardware kernels by
to the switch, along with the switch configuration registers interacting with the DMA controllers. The DMA controllers
which are accessible from the system bus. can read or write data from off-chip memory into a Lo-
Input Ports Output Ports
the right connect to the input port of the hardware kernels.
Port 0 Port 0
Hardware kernels can consist of any number local, remote
Port 1 Port 1 and off-chip memory ports, giving the kernel a higher degree
Port 2 Crossbar Port 2 of parallelism with memory access.
Port 3 Switch Port 3 “Local” blocks, commonly implemented as BRAM, are
. .
. . accessible to a hardware kernels with low latency (≈2
. .
clock cycles). They are instantiated as part of a hardware
Port N Port N
kernel’s HWKI. A “remote” block is on-chip memory that
Switch Controller is located in a different hardware kernel’s HWKI, but
RM
accessible through the BSN. This is made possible through
RM RM RM RM
.
.
.
0
dual-ported BRAMs. One port is dedicated to the local
1 2 3 N
hardware kernel, the second port is dedicated to the BSN.
“Off-chip” blocks are allocated in volatile off-chip RAM
(e.g. DDR2 SDRAM). The BSN communicates directly
Figure 2: Block Switch Network’s Router consisting of a full with the memory controller. While hardware kernels still
crossbar switch, routing modules per each input port and a share this connection for block communication, requests
single switch controller used to configure the inputs to the are serialized at the last possible point helping to improve
requested outputs overall performance.
The BSN abstracts away the different block types to
provide hardware developers with a common interface via
the HWKI. The HWKI block interface is an extension of the
calLink port. The SWKI’s stream API is written to send a common BRAM interface with the addition of “ready for
pointer and length to the DMA engine for the amount of data,” “valid,” “request size,” and “done.” The added signals
data to send or receive and the location of that data. An offer the ability to burst larger amounts of data in and out
interrupt occurs when the DMA transaction has finished. of memory. The use of the ready for data and valid signals
Using the DMA controllers the processor is more efficient is necessary since requests to remote and off-chip memory
when there is a large amount of data to push from software may take an undetermined amount of time.
to hardware or vice versa, which is often the case with The BSN also abstracts away data (element) sizes from
streaming applications. The SWKI is also responsible for the kernel developer. In conventional bus based systems the
configuring the SSN’s switch to connect the processor with developer may need to access 32-bit data from one location
the receiving or transmitting hardware kernel. and 128-bit data from a second location. As a result, the
An advantage of the crossbar switch is multi-cast stream developer must be aware of the amount of data needed for
pushes are possible by simply having several output ports each request type. In the Redsharc system, the BSN gives the
reading data from the same input port. The control kernel developer the exact data size needed and handles the internal
can change the switch configuration at runtime to modify transfers accordingly. With the BSN a developer can still
application functionality or for load balancing optimizations. transfer 128-bit data, but instead of actively transmitting the
four 32-bit words, only a single transaction is required. The
B. Block Switch Network BSN still transfers all 128-bits; however, it does so internally
The purpose of the Block Switch Network is to implement as a burst of four 32-bit words.
the low level communication needed by the Redsharc block Another advantage is the block location can be moved
API. Redsharc blocks are created at synthesis time but (e.g. from a local block to a off-chip block) at synthesis time
allocated at runtime by the control kernel. Blocks may be based on available resources without requiring the kernel
specified to be on-chip as part of a HWKI or a section of off- developer to redesign their code. To the kernel developer
chip volatile memory. The BSN is implemented as a routable the interface remains fixed and only the connection within
crossbar switch and permits access from any kernel to any the HWKI changes.
block. Software kernels access blocks through the SWKI. If
Figure 2 shows the BSN’s router. The router consists of a the block is located off-chip the SWKI routes the request
full crossbar switch, switch controller and routing modules. through the PPC’s memory port (avoiding bus arbitration). If
When a kernel requests data, the routing module decodes the block is located within the BSN, the request is translated
the address request and notifies the switch controller. The to memory mapped commands to the BSN on the PLB.
switch controller checks the availability of the output port
and, if not in use, configures the switch. The input ports are V. R ESULTS
shown on the left side of the figure which are connected The following results were taken from Redsharc systems
to the output of the hardware kernels. The output ports on running on a Xilinx ML510 development board which
SSN DMA Transfer Rates
600
Table III: BSN latency and bandwidth for each block type
Software to Hardware
Hardware to Software Block Type 32b Data Width 1024b Data Width
local 2cc (200MB/s) 2cc (6400MB/s)
500
remote 16cc (25MB/s) 47cc (272MB/s)
off-chip 25cc (16MB/s) 56cc (228MB/s)
400
MBytes/Sec

300

200
BLAST compares one short DNA sequence, called the
query, against a long database of DNA sequences, and
100
produces a stream of indexes where the two sequences
match. Researchers have implemented BLAST on FPGAs
0
100 1000 10000 100000 to demonstrate impressive speedups [14]. The BLAST hard-
Message Size (Bytes)
ware kernel is a reimplementation of this algorithm. It uses
Figure 3: Stream Bandwidth Across Hardware and Software one input stream for the input database, one output stream
for the results, and three blocks (one local and two off-chip)
to store query information. Each BLAST kernel added to
the system can run one query in parallel.
includes a Virtex 5 FX 130T. The PowerPC 440 processor For comparison, a constant work size of eight queries
block was used and the SWKI interacts directly with the and one database is used to test the system. The system
LocalLink DMA controller TX/RX ports. with a single BLAST kernel must sequentially evaluate the
A. Performance queries, while the system with eight kernels can evaluate
them in parallel. This test will show if the Redsharc system
The SSN’s performance is measured by the bandwidth
can scale with increasing hardware kernels. Table IV shows
between two kernels. While the HWKI abstracts these issues
the execution time results. There is no speedup in the time to
from the hardware developer, these metrics are dependent
load the queries into the blocks or read the result data back
on 1) width of the stream, 2) width of the SSN, 3) clock
from the kernels. These operations are entirely sequential
frequency of the SSN and kernels. Between two hardware
and offer no possible speedup. A linear speedup is observed
kernels, running at 200MHz and a 32b data width, the
in the time spent comparing the database to the query. Note
SSN’s bandwidth is 800 MB/s. When the receiving or
that with eight queries running in parallel the BSN must
transmitting kernel is software, the bandwidth is limited by
handle the increasing load for the query information stored
the throughput of the DMA controller and the message size.
in off-chip memory blocks.
Figure 3 shows the measured bandwidth for different stream
lengths between hardware kernels and software kernels. The C. Resource Utilization
SSN performs best with large message sizes.
The BSN’s performance is measured by the latency and This subsection presents the sizes of the BSN, SSN,
bandwidth of a read operation. Similar to the SSN, the and HWKI, the three critical hardware components that
Redsharc API abstracts synthesis-time configurations that comprise a Redsharc system. Figure 4 illustrates the BSN
may effect the bandwidth of a specific system. The settings and SSN’s usage of Lookup Tables and Flip-Flops. Note
that effect BSN’s bandwith are 1) location and width of that the SSN is purely combinatorial and as a result has no
the data 2) operating frequency of the hardware kernel flip-flops. The BSN numbers includes the routing module
and BSN (clock domain crossing adds overhead), and 3) logic and switch controller, which increases the resource
possible contention at the remote block. Table III provides count. Overall, the resources used consume a small portion
an overview of the BSN performance for the three types of of available resources for medium to large scale FPGA
data locality, given both the BSN and hardware kernels on devices. While a bus presents a smaller resource footprint, as
the same 100MHz clock. The BSN performs very favorable a trade-off the dual switches provide significant bandwidth
compared to the PLB which, in previous work, has a that is necessary to satisfy the type of high performance
measured peek performance of 25 MB/s [13] for a 32 bit applications targeted by this research.
read transaction and a 200 MB/s for 1024 bit transaction. The HWKI supports access to variable number of streams
and blocks with variable data element sizes. As such, we
B. Scalability present the resources required for each additional stream
In order to demonstrate Redsharc’s scalability a BLAST or block and assume 32-bit data widths for all ports. For
bio-informatics kernel was implemented. BLAST is a bio- the SSN, only a LUT is required for each input and output
informatics application that performs DNA comparisons. port to drive the Xilinx LocalLink signals and the input
Table IV: Performance of one to eight BLAST cores running in a Redsharc system
Blast Cores Load Queries Speedup BLAST Runtime Speedup Read Results Speedup Total Time Total Speedup
1 1643.94 µs 1× 9013.1 µs 1× 49.49 µs 1× 10706.53 µs 1×
2 1641.16 µs 1× 4512.16 µs 2× 41.74 µs 1.19× 6195.06 µs 1.73×
4 1639.78 µs 1× 2265.38 µs 3.98× 38.73 µs 1.28× 3943.90 µs 2.71×
8 1638.65 µs 1× 1134.11 µs 7.95× 36.4 µs 1.36× 2809.16 µs 3.81×

BSN and SSN Resource Usage architects may develop heterogeneous systems that span the
22500 3500
BSN used LUTs
SSN used LUTs
hardware/software domain, using a seamless abstract API,
20000 BSN used FFs
SSN used FFs 3000 without giving up performance of custom interfaces.
17500
2500 R EFERENCES
15000
Lookup Tables (LUTs)

[1] A. A. Jerraya and W. Wolf, “Hardware/software interface

Flip-Flops (FFs)
2000
12500 codesign for embedded systems,” Computer, vol. 38, no. 2,
10000
pp. 63–69, 2005.
1500

7500
[2] M. Jones et al., “Implementing an api for distributed adaptive
1000 computing systems,” in FCCM 1999. IEEE Computer
5000 Society, 1999, p. 222.
500
2500 [3] R. Laufer et al., “PCI-piperench and the swordapi: a system
0 0
for stream-based reconfigurable computing,” in FCCM 1999,
0 4 8 12 16 20 24 28 32
1999, pp. 200 –208.
Radix
[4] E. Lubbers and M. Platzner, “Reconos: An rtos supporting
Figure 4: Block Switch and Stream Switch Network Re- hard-and software threads,” International Conference on Field
source Utilization in terms of Lookup Tables (LUTs) and Programmable Logic and Applications (FPL), pp. 441–446,
27-29 Aug. 2007.
Flip-Flops (FFs)
[5] D. Andrews et al., “Achieving programming model abstrac-
tions for reconfigurable computing,” Very Large Scale Inte-
gration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 1,
and output stream FIFOs. The FIFO depth is configurable pp. 34–44, Jan. 2008.
by the hardware developer so the number of BRAMs is [6] M. B. Gokhale et al., “Stream-oriented fpga computing in
variable. For the BSN more logic is needed to support local the streams-c high level language,” in FCCM 2000. IEEE
and remote block requests. Each local block requires 176 Computer Society, 2000, p. 49.
Flip-Flops and 300 LUTs whereas each Remote block only [7] D. Unnikrishnan et al., “Application specific customization
requires 161 Flip-Flops and 163 LUTs. These represent a and scalability of soft multiprocessors,” in FCCM 2009, April
2009, pp. 123–130.
minimal amount of resources needed to support the high
bandwidth memory transactions while maintaining a com- [8] J. Liang et al., “An architecture and compiler for scalable on-
mon memory interface to the hardware kernel. chip communication,” IEEE Trans. Very Large Scale Integr.
Syst., vol. 12, no. 7, pp. 711–726, 2004.
VI. C ONCLUSION [9] L. Shannon and P. Chow, “Simplifying the integration of pro-
cessing elements in computing systems using a programmable
Programming MCSoPC that span hardware and software controller,” in FCCM 2005. IEEE Computer Society, 2005,
is not a trivial task. While abstract programming models pp. 63–72.
have been shown to ease the programmer burden of cross-
[10] P. Mattison and W. Thies, “Streaming virtual machine speci-
ing the hardware/software boundary, their abstraction layer fication, version 1.2, technical report,” January 2007.
incurs a heavy burden on performance. Redsharc solves this
[11] 128-Bit Processor Local Bus Architecture Specifications, Ver-
problem by merging an abstract programming model with sion 4.7 ed., IBM.
on-chip networks that directly implement the programming
[12] Xilinx, “Locallink interface specification,” www.xilinx.com/
model.
products/design resources/conn central/locallink member/
The Redsharc API is based on a streaming program- sp006.pdf.
ming model but also incorporates random access blocks
[13] A. Schmidt and R. Sass, “Characterizing effective memory
of memory. Two on-chip networks were implemented to bandwidth of designs with concurrent high-performance com-
facilitate the stream and block API calls. Our results showed puting cores,” in FPL 2007, Aug. 2007, pp. 601 –604.
that the SSN and BSN have comparable bandwidth to [14] S. Datta and R. Sass, “Scalability studies of the blastn scan
state of the art technology and scales nearly linearly with and ungapped extension functions,” in ReConFig 2009, 2009.
parallel hardware kernels. Ergo, programmers and system

Vie

You might also like