VENICE A Compact Vector Processor For FPGA Applications
VENICE A Compact Vector Processor For FPGA Applications
DDR2
Altera Avalon
Fabric
II/f
D$ CPU
put with a small number (1 to 4) of ALUs. By increasing Custom Instruction Port
2kB - 2MB
per logic block (speedup per ALM) than the previous best
SVP, VEGAS [1], and 5.2× better than Altera’s fastest scalar
processor, the Nios II/f. As a result, less area is needed to
achieve a fixed level of performance, reducing device cost
or allowing more room to be left for other application logic.
Programming VENICE requires little specialized knowl-
edge, utilizing the C programming langauge with simple ex-
tensions. Changes to algorithms require a simple recompile
!
taking a few seconds rather than several minutes or hours ! "
for FPGA synthesis. In particular, the removal of separate
vector address registers from scalar registers, streamlining Figure 2. Speedup (geomean of 9 Benchmarks) vs Area Scaling
of instructions, and support for full speed unaligned oper-
ations make VENICE easier to program than VEGAS. In
addition, a compiler from Microsoft’s Accelerator language formance while using less area than VEGAS. Figure 2
to VENICE C code has been developed [2]. demonstrates this with a speedup versus area plot. The
A block diagram of the VENICE (Vector Extensions to speedup and area axes are normalized to the Nios II/f
NIOS Implemented Compactly and Elegantly) architecture is performance and ALM count. VENICE dominates VEGAS
shown in Figure 1. The VENICE vector engine implements in both area and speed, achieving a maximum speedup of
a wide, double-clocked scratchpad memory which holds all 20.6× Nios II/f at an area overhead of 4.0×.
vector data. Operating concurrently with the vector ALUs R EFERENCES
and the Nios core (to hide latency), a DMA engine transfers
[1] C. H. Chou, A. Severance, A. D. Brant, Z. Liu, S. Sant, and
data between the scratchpad and main memory. There are G. Lemieux. VEGAS: Soft vector processor with scratchpad
a configurable number of vector lanes (32-bit vector ALUs) memory. In FPGA, pages 15–24, 2011.
that provide scalable parallelism. Each 32-bit ALU supports
subword SIMD operations on halfwords or bytes, thus [2] Z. Liu, A. Severance, S. Singh, and G. G. Lemieux. Accelerator
compiler for the venice vector processor. In Proceedings of the
doubling or quadrupling the parallelism available with these ACM/SIGDA international symposium on Field Programmable
smaller data types, respectively. Gate Arrays, FPGA ’12, pages 229–232, New York, NY, USA,
The VENICE processor is designed to offer higher per- 2012. ACM.