p97 PDF
p97 PDF
p97 PDF
{yiannac,steffan,jayar}@eecg.utoronto.ca
ABSTRACT
Keywords
Embedded systems are often implemented on FPGA devices and 25% of the time [2] include a soft processor
a processor built using the FPGA reprogrammable fabric.
Because of their prevalence and exibility, soft processors are
compelling targets for customizationalthough current soft
processors provide few architectural variations. Recent work
has proposed augmenting soft processors with customizable
vector processing support, enabling designers to easily scale
performance by exploiting the data parallelism available in
an application. However this approach provides only coarsegrain scaling, by successively doubling the number of vector
datapaths for less than double the performance.
In this work we further augment soft vector processors
with more ne-grain architectural modications: we add
support for (i) vector chaining and (ii) heterogeneous vector
lanes, allowing the soft vector processor to be customized
to not only the data-level parallelism available in an application, but to the functional unit demand. We evaluate
the area and wall clock performance with full hardware
implementations on state-of-the-art FPGAs and nd that
chaining can provide between 15-45% average performance
for less area than doubling the lanes, and that heterogeneous
lanes can save 6-13% area with little or no performance
loss in some cases. Finally, we implement 1200 soft vector
processors variants and nd that the peak performance per
area compared to our base vector processor can be increased
by an average of 13% and up to 34% when choosing the best
variant per application.
1. INTRODUCTION
FPGAs are commonly used to implement embedded
systems because of their low cost and fast time-to-market.
Approximately 25% of FPGA designs contain a processor
implemented in the FPGA reprogrammable fabric [2], such
as the Altera Nios II or Xilinx Microblaze. These soft
processors provide a software design environment for quickly
implementing system components which do not require
highly-optimized hardware implementations and can instead
be implemented in a soft processor that is customized
to achieve the desired performance/area/power. Current
commercial soft processors are based on simple singleissue pipelines with few architectural variations, motivating
research on congurable soft processor architectures that
enable further customization.
While the customization of traditional hard processors
has been thoroughly studied, the trade-os on an FPGA
substrate can be vastly dierent yet accurately measured
including area, clock speed, and power. As a result,
several architectural axes have been recently studied in
a soft processor context including: (i) single-issue inorder pipelines [17] which provide a limited design space;
(ii) VLIW pipelines [11] which are limited due to port
limitations on FPGA block memories; (iii) multi-threaded
pipelines [6, 7, 13] and multiprocessors [14, 15] which exploit
thread-level parallelism but require parallelization of the
software; and (iv) vector processors [20, 21] which can scale
performance by instantiating multiple vector lanes (the perelement datapaths of a vector processor) to exploit the datalevel parallelism in an application. However, the exibility of
recently proposed soft vector processors is primarily limited
to scaling the number of vector lanes by powers-of-two, to
avoid division and multiplication operations in the control.
For example, lane scaling provides only seven dierent
congurations between a one-lane soft vector processor that
consumes a fraction of the smallest FPGA device and a 64lane conguration that lls one of the largest FPGA devices
currently availablehence a system designer is provided
with only very coarse-grain (powers-of-two) control over
performance/area trade-os when choosing an appropriate
soft vector processor instantiation.
Other Architecture
General Terms
Measurement, Performance, Design
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CASES09, October 1116, 2009, Grenoble, France.
Copyright 2009 ACM 978-1-60558-626-7/09/10 ...$10.00.
97
In this work we extend soft vector processors with architectural features that allow for more ne-grain customization over current single-issue soft vector processors.
Specically we target the varying functional unit demand
across applications by implementing (i) vector chaining,
while parameterizing the number of vector instructions that
can be simultaneously executed; and (ii) heterogeneous
lanes, that parameterize the functional units that exist
within individual lanes.
Vector Chaining A vector processor with support for
vector chaining can begin execution of the element operations of one vector instruction before completing all the
element operations of a previous vector instruction [10].
To support this simultaneous execution of multiple vector
instructions, many operands must be read/written from/to
the vector register le simultaneously. The conventional
solution is to exploit a many-ported register lebut this
design is not well-suited to an FPGA substrate since the
block memories on FPGAs are normally limited to only two
ports. Instead we propose to support chaining through a
banked register le where the number of banks determines
how many vector instructions can be in-ight. For some
benchmarks, this results in signicantly better utilization
of the existing functional units and even motivates the
replication of functional units in high demand.
Heterogeneous Lanes Typically the lanes in a vector
processor are identical, requiring all functional units to
exist even when data-parallel code uses only some of them.
We introduce the ability to have a given functional unit
supported in only a subset of the lanes, thus supporting
heterogeneous lanes where some lanes are missing certain
functional unit types. For those lanes, operations are timemultiplexed onto the lanes which do support the required
functional unit. A designer can therefore create the exact
number of desired functional units, similar to what would
be done in a custom hardware design.
We evaluate these modications using a full in-hardware
implementation on a Stratix III FPGA executing data
parallel EEMBC benchmarks. We show that with chaining
we gain signicant performance with more modest area cost
than doubling the number of lanes. We show that heterogeneous lanes can provide area savings over homogeneous
lanes with little or no performance degradation. Finally,
compared to all previously possible congurations, we gain
up to 34% performance-per-area after exhaustively searching
the design space to minimize performance-per area on a perapplication basis.
Our goal is to enable ne-grain customization of soft
processors, allowing an FPGA-based embedded systems
designer to use a few architectural parameters to specify a
soft processor optimized to specic application and system
design requirements; as a result, the amount of laborious
hardware design is reduced. In the long term we envision
that FPGA CAD tools will employ soft processor generators
in conjunction with heuristics for automatically mapping
applications to architectural congurations under a given
performance/area constraint.
1.3 Contributions
This paper makes the following contributions: (i) we
implement VESPA on a state-of-the-art Stratix III FPGA
device while accurately measuring area, clock frequency, and
cycle performance of our modications using full EEMBC
benchmarks executed from o-chip DDR2 memory; (ii) we
propose and evaluate an FPGA-specic implementation of
vector chaining with the required register le bandwidth
facilitated exclusively via banking; (iii) we propose and
investigate heterogeneous vector lanes in a soft vector
processor; and (iv) we exhaustively explore a design space
of 1200 VESPA congurations and show that these modications allow for more ne-grain architectural customization
as well as better performance per area.
98
Vector Coprocessor
Lane 1
Vector
Lane 2
Issue
Lane L
Memory
Crossbar
ISA
Scalar
MIPS
Dcache
Memory
Icache
Prefetch
Arbiter
DRAM
Parameter
Vector Lanes
Memory Crossbar Lanes
Multiplier Lanes*
Register File Banks*
ALU per Bank*
Maximum Vector Length
Vector Lane Bit-Width
Each Vector Instruction
ICache Depth (KB)
ICache Line Size (B)
DCache Depth (KB)
DCache Line Size (B)
DCache Miss Prefetch
Vector Miss Prefetch
Bank
Queue
Instr
Icache
Scalar
Pipeline
Decode
M
U
X
Dcache
RF
M
U
X
A
L
U
WB
Vector
Register
File
Vector
Pipeline
VC
RF
(even
elments)
Logic
VS
RF
Decode
VS
WB
Replicate
(odd
elments)
VC
WB
Hazard
check
VR
VR
RF
RF
Addr
Addr
Gen
Gen
AA
LL
UU
M
U
X
A
L
U
M
U
X
M
U
X
Mult/Shift
M
U
X
Mem
Unit
M
U
X
Mem
Unit
SatuSaturate
rate
xx &&satur.
satur.
MM
UU
XX
VR
VR
WB
WB
Rshift
Rshift
2.
Value Range
1,2,4,8,16,. . .
1,2,4,8,. . . L
1,2,4,8,. . . L
1,2,4,. . .
true/false
2,4,8,16,. . .
1,2,3,4,. . . , 32
on/o
4,8,. . .
16,32,64,. . .
4,8,. . .
16,32,64,. . .
1,2,3,. . .
1,2,3,. . .
Bank 0
Bank 1
Vector
Control
Pipeline
Symbol
L
M
X
B
APB
MVL
W
ID
IW
DD
DW
DPK
DPV
IMPLEMENTING FINE-GRAIN
CUSTOMIZATIONS
99
Input
Queue
Lane 1
Output
Queue
Lane 1
Lane 2
Lane 2
Lane 3
Lane 3
Lane 4
Lane 4
Figure 4: Heterogeneous lanes support for multipliers on a VESPA with 4 lanes and X=1.
3. MEASUREMENT METHODOLOGY
In this section we describe our infrastructure used for
executing, verifying, and evaluating the new VESPA features. Specically, we describe our hardware platform, processor system, verication process, CAD tool measurement
methodology, benchmarks, and compiler.
Hardware Platform All processors are fully synthesized
and implemented on an FPGA system.
We use the
Terasic DE3-340 board equipped with a single Stratix III
EP3SL340H1152C3 which is one of the largest state-of-theart FPGAs currently available. We also use a 1GB DDR2533 memory device for the storage of instructions and data
in a program.
Processor System Each design consists of the VESPA
soft vector processor with separate rst-level direct-mapped
instruction and data caches and the Altera DDR2 full-rate
memory controller that connects to the DDR2 DIMM. The
VESPA congurations are capable of 100-110MHz clock
rates on the mid-speed 3S340C3 device. However we clock
all designs at 100 MHz and the memory system at 266 MHz
and then correct the wall clock time using the highest clock
frequency achievable by that design on a faster 3S340C2.
This allows us to model the performance of high-end FPGAs
without owning them. The time dilation eects between the
processor and memory from this correction generally do not
aect the results signicantly.
Testing All instances of VESPA are fully tested in
hardware using the built-in checksum values encoded into
each EEMBC benchmark. Debugging is performed using
Modelsim and is guided by comparing traces of all writes to
the scalar and vector register les. This trace is extracted
from RTL simulation using Modelsim and compared against
an analogous trace obtained from instruction-set simulation
using the MINT [16] MIPS simulator augmented with the
VIRAM extensions. Altera SignalTap II is used for inhardware debugging.
100
Benchmark
autcor
conven
rgbcmyk
rgbyiq
fbital
viterb
ip checksum
imgblend
filt3x3
FPGA CAD Tools A key value of performing FPGAbased processor research directly on an FPGA is that we
can attain high quality measurements of the area consumed
and the clock frequency achievedthese are provided by
the FPGA CAD tools. We use aggressive timing constraints
to maximize the CAD tools eort for default optimization
settings but with register retiming and register duplication
enabled. Through experimentation we found that these
settings provided the best area, delay, and runtime tradeo. We also performed 8 such runs for every vector
conguration to average out the non-determinism in modern
CAD algorithms. The relative silicon areas of each FPGA
resource relative to a single Adaptive Logic Module (ALM)
was supplied to us by Altera [5] for the Stratix II. We
extrapolated this for Stratix III and used these equivalent
areas to calculate the total silicon area consumed on the
Stratix III measured in units of equivalent ALMsthe
silicon area of a single ALM including its routing.
Input
size (B)
1024
517
1628973
1156800
1536
688
40960
153600
76800
Output
size (B)
64
1024
2171964
1156800
512
44
40
76800
76800
Largest Vector
Element
32 bits
1 bit
8 bits
16 bits
16 bits
16 bits
32 bits
16 bits
16 bits
25
20
1 Lane
15
2 Lanes
10
4 Lanes
8 Lanes
16 Lanes
32 Lanes
0.5
e
m
iT
0.25
lec
yC
1 Lane
2 Lanes
4 Lanes
8 Lanes
0.125
16 Lanes
32 Lanes
0.0625
4.
4096
8192
16384
32768
65536
Area
Figure 6: Performance/area design space of 1-32
lane VESPA cores with full memory support.
scaling still exists between 1 and 16 lanes on our state-ofthe-art hardware platform. We also measure the eect of
32 lanes for the rst time and notice that the performance
scaling continues for benchmarks which have the available
data parallelism. We see 10x average performance for 16
lanes, and 14x for 32 lanes with a peak of 24x.
Figure 6 shows the area/performance space for these
congurations and highlights the coarse-grain nature of
using vector lanes to trade area and performance. The area
cost of increasing the number of lanes can be substantial,
for example growing from 8 to 16 lanes requires more than
10000 ALMs worth of silicon. While this powerful parameter
allows VESPA to take leaps in the area/performance space,
our new architectural parameters enable more ne-grain
area/performance trade-os as shown in the next section.
COARSE-GRAIN TRADE-OFFS:
VECTOR LANES
101
p
u
d
ee
p
S
el
cy
C
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1.8
1.6
4 Banks, 4 ALUs
(Area=1.92x)
p
u
d
ee
Sp
lec
yC
4 Banks, 1 ALU
(Area=1.59x)
2 Banks, 2 ALUs
(Area=1.34x)
2 Banks, 1 ALU
(Area=1.27x)
1.4
1.2
4 Banks, 4 ALUs
4 Banks, 1 ALU
0.8
2 Banks, 2 ALUs
0.6
2 Banks, 1 ALU
0.4
0.2
0
1 Lane
4 Lanes
8 Lanes
16 Lanes
5.
2 Lanes
of the 1 bank conguration, our benchmarks and singleissue pipeline with locking cache can not exploit this peak
performance.
Figure 8 shows that the speedup achieved from banking is
reduced as the lanes are increased. Chaining allows multiple
vector instructions to be executed if both the appropriate
functional unit and register bank are available. But because
only one instruction is fetched per cycle, chaining is only
eective when the vector instructions are long enough to
stall the vector pipeline, in other words, when the length of
a vector is greater than the number of lanes. As the number
of lanes increases, vector instructions are completed more
quickly providing less opportunity for overlapping execution.
In the slowest vector processor speedups from banking can
average as high as 60% across our benchmarks, while in
the fastest banking achieves only 23% speedup. The 1
lane vector processor represents a peak speedup achievable
under extremely high load with long vector operations on
the vector coprocessor.
The vector register le is comprised of many FPGA block
RAMs. Given block RAMs with maximum width WBRAM
and total depth of DBRAM , and using the parameters from
Table 1, the number of block RAMs is equal to the greater
of L W B/WBRAM or 32M V L W/DBRAM . For vector
processors with many lanes, making the rst expression
greater, adding more banks increases the number of block
RAMs used. For example increasing from 1 to 4 banks with
no ALU replication on a 16 lane VESPA with MVL=128 adds
38% area just in block RAMs and 56% in total. On a design
with many unused block RAMs this increase can be justied,
moreover the added capacity of the block RAMs can be
fully utilized by the vector processor with a corresponding
increase in MVL.
Figure 9 shows the wall clock time versus area space of
the no chaining (solid dots) congurations from 1 to 16
lanes, identical to Figure 6. We overlay two vector chaining
congurations on the same gure and observe that the points
with 2 banks appear about one third of the way to the next
solid dot, proving that chaining can trade area/performance
at ner increments than doubling lanes. Note that the
4 bank congurations are omitted since the area cost is
signicant and the additional performance is often modest
compared to 2 banks. Since we have complete measurement
capabilities of the area and performance we are able to
identify that vector chaining in this case is indeed a tradeo and not a global improvement (it did not move VESPA
toward the origin of the gure).
FINE-GRAIN TRADE-OFFS
102
120
1 Lane
em 0.5
IT
kc 0.25
olC
la 0.125
W
)z
H 100
(M 80
cy
n
e 60
u
q
er
F 40
kc
20
lo
C
2 Banks, 2 ALUs
2 Lanes
2 Banks, 1 ALU
1 Bank, 1 ALU
4 Lanes
8 Lanes
16 Lanes
0.0625
2
Area
97
93
93
X=1
X=2
X=4
X=8
X=16
X=32
1.2
1
97
0.2
97
0
1
pu
de 0.8
e
pS 0.6
lec 0.4
yC
95
X=1 (Area=0.87)
X=2 (Area=0.87)
X=4 (Area=0.88)
X=8 (Area=0.9)
X=16 (Area=0.94)
X=32 (Area=1)
Another option for ne-grain area/performance tradeos is to use lane congurations that are not powers of
two, resulting in cumbersome control logic which involves
multiplication and division operations. Since the control
logic is often critical, and the additional area overhead
signicant, this approach would likely generate inferior
congurations that, in terms of Figure 9, would form a
curve further from the origin than the processors with lanes
that are powers of two. Chaining, on the other hand is
shown to directly compete with these congurations, and in
Section 6 is shown to even improve performance per unit
area. Note that instruction scheduling in software could
further improve the performance of vector chaining, but in
many of our benchmarks only very little rescheduling was
either necessary or possible, so we did not manually schedule
instructions to exploit chaining.
103
e 32.00
m
iT
kc 16.00
loC
l 8.00
a
W 4.00
de
izl
a 2.00
m
ro
N 1.00
Memory
ISA
Compute
Parameter
Vector Lanes
Memory Crossbar Lanes
Multiplier Lanes*
Register File Banks*
ALU per Bank*
Maximum Vector Length
Vector Lane Bit-Width
Each Vector Instruction
ICache Depth (KB)
ICache Line Size (B)
DCache Depth (KB)
DCache Line Size (B)
DCache Miss Prefetch
Vector Miss Prefetch
Symbol
L
M
X
B
APB
MVL
W
ID
IW
DD
DW
DPK
DPV
Explored
1,2,4,8,16,32
L, L/2
L, L/2
1,2,4
true/false
128, 256
8, 32
16, 64
o, 8*VL
1024
16384
65536
e 32
im
T 16
kc
ol
C 8
la
W4
de
zli
a 2
rm
oN
1
4096
4096
8192
16384
32768
65536
104
7. CONCLUSIONS
8. REFERENCES
[1] The Embedded Microprocessor Benchmark
Consortium. http://www.eembc.org.
[2] T. Allen. Altera Corporation. Private Communication,
2009.
[3] K. Asanovic. Vector Microprocessors. PhD thesis,
University of California-Berkeley, 1998.
[4] J. Cho, H. Chang, and W. Sung. An fpga based simd
processor with a vector memory unit. Circuits and
Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE
International Symposium on, pages 4 pp., 21-24 May
2006.
[5] R. Cli. Altera Corporation. Private Communication,
2005.
[6] R. Dimond, O. Mencer, and W. Luk. CUSTARD - A
Customisable Threaded FPGA Soft Processor and
Tools . In International Conference on Field
Programmable Logic (FPL), August 2005.
[7] B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown.
A multithreaded soft processor for sopc area
reduction. In IEEE Symposium on
Field-Programmable Custom Computing Machines,
pages 131142, Washington, DC, USA, 2006.
[8] M. Hasan and S. Ziavras. Fpga-based vector
processing for solving sparse sets of equations. In
105
[9]
[10]
[11]
[12]
[13]
[14]
[15]
106