Study On Column Wise Design Compaction For Reconfigurable Systems
Study On Column Wise Design Compaction For Reconfigurable Systems
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
Amodrlia
is the number of columns reserved for the de- The second design we have investigated is a
sign, multiplied by the number of slices per column LDPC decoder, also taken from an Opencores pro-
The minimum internal fragmentation F, is:
N,acedcoi. ject [3]. Low-density parity-check (LDPC) codes are
forward error correction codes invented by Robert
Gallager in the early 60s. The synthesized decoder
consumes 148 slices and therefore can also he
The analysis concerning power consumption is mapped to a single column. However, in contrast to
made by the Xilinx XPower tool, which provides the the 8-hit divider, the decoder would reach an internal
dynamic and static power consumption of an fragmentation of only 7.5%. This means, only 12
implemented design. For all power analyses of a slices out of 160 would not be occupied. Although
design, the clocks as well as all input signals were we used the same area constraints as for the divider it
toggled by the highest frequency that was reported by was not possible to route the single column decoder
tbe tools. Additionally, the toggle rates of all internal with any of the timing constraints. This is due to the
synchronous signals were set to 50%. combination of the extreme aspect ratio and the high
2.1. Small Designs slice utilization. With the help of the FPGA Editor it
could he clearly seen that there was lack of vertical
The smallest design we have investigated is an wiring resources especially in the middle of the de-
&bit divider. The syntbesizable Verilog source sign. An overview of all results is given in table l .
code, taken from an Opencores project L31, is pa- The next design is a 16-bit divider, which is iden-
rameterizable and fully pipelined. It takes an &hit tical to the mentioned %hit divider, except for the
dividend and a 4-hit divisor as inputs and returns a 4- data width. This divider consumes 271 slices, leading
bit quotient, a 4-hit remainder, a division-by-zero to a minimum of two XCV2000E CLB columns. The
flag, and an overflow flag. As the hardware descrip- internal fra-mentation for a two column implementa-
tion is easy to parameterize, we used the divider to tion is only 15.3% which means 271 out of 320 slices
investigate several design sizes. The synthesized di- are occupied. This time we used 2, 3.4, 8,20,45 and
vider including tbe hns interface (17 slices) consumes 120 columns to implement the divider. Again, except
95 slices, which fit in a single XCV2000E CLB col- for the two column implementation, the results are
umn (N,ijc~s,~oi= 1601, resulting in an internal frag- quite similar to each other. However, for the narrow-
mentation of about 41%. We implemented the di- est implementation the timing increases from about
vider in I , 2. 4. 8. 20, 45. and 120 (whole chip) col- 16 ns to almost 25 ns which is approximately 57%
umns. Naturally, a pipelined divider has a two- (see table 1). The same happens to the power con-
dimensional structure and thus is expected to cause sumption. which increases from about I 1 8 mW to
problems when implementing it in a column-wise 138 mW which is an increase of about 17% in com-
(one-dimensional) way. Surprisingly. all implementa- parison to the best implementation. Although there is
tions, even the single column one, could be mapped still an increase in the resulting timing, it is not as big
and routed completely. As expected, the most inter- as for the small %bit divider.
esting parts of the diagram are the results for n m o w
implementations (see figure 1). For the 120 column
implementation down to the 2 column implementa- ~ ~ ~ ~ . . . . ~ ~ ~ ~ . . .
tion the worst path delay is always between 9.2 and
9.7 ns, hut for the single column implementation it
increases to 17.9 ns (see table I). This is an increase
of about 9490 in comparison to the hest implementa-
tions, presumably due to the increased capacitance of
the longer signal wires as the available area is cut in
half when stepping down from 2 to I column. A simi-
lar effect can he noticed for the power consumption.
Except for the single column implementation. the 22 m /B 16 ,I 12 10 8 e 1 2 0
Number ot CLB Wlumnr
power consumption is always around 100 mW. How-
ever, the power consumption increases to 108 mW Figure 1: Results for the %bit divider
for the narrowest implementation. which is about 8% Furthermore. we investigated an FIR-filter (Finite
in comparison to the best implementation. Both ef- Impulse Response) which is common in DSP appli-
fects can he explained by the extreme aspect ratio of cations 131. It consists of a delay bank (filter taps)
the one column implementation, leading to longer and a sum-of-products, which is pipelined with a
wires and hi-ber switch box utilization. However, register after every multiplier and adder. The filter
spending an extra column for the divider leads to a has an order of 3 and an input and coefficient preci-
much smaller increase of only 5% for the timing and sion of %bit. Together with the bus interface it con-
I % for the power consumption in comparison to the sumes 306 slices which fit into two CLB columns.
hest implementation. This two column implementation has an internal
414
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
fragmentation of only 4.4% which is even less than gated controller has three inputs of 16-hit width, two
the intemal fragmentation of the two column 16-bit inner states and three outputs of 32-hit width. The
divider implementation (15.3%). Due to the high resulting implementation consumes 1055 slices,
slice utilization it was questionable if such a,design which fit into 7 CLB columns. This narrowest im-
would be still routable, although the structure of an plementation leads to an internal fragmentation of
FIR filter is rather one-dimensional, and therefore is only 5.8%. The best timing for the controller was
expected to suit a column-wise compaction. The 17.3 ns (70 columns) and again the worst timing of
same area constraints as for the 16-bit divider were 19 ns resulted from the narrowest implementation (7
used and the two column filter could he routed com- columns). However, this time the increase was only
pletely. The resulting worst path delays for the 3 col- 10.6%. The power consumption increased from
umn up to the 120 column implementations range 794 mW (120 columns) to 873 mW which is only
from 12.3 ns to 13.9 ns, but again the timing for the 9.9% (see table I ) .
narrowest implementation increases by about 48% to The next design we investigated is a Rijndael
18.2 ns (see table 1). The power consumption for the Encryption Block. The Rijndael algorithm was se-
two column implementation increases by ahout 9% to lected by the U S . National Institute of Standards and
601 mW in comparison to the best implementation Technology as the Advanced Encryption Standard
(45 columns). Although the FIR filter has higher slice (AES) which is used for securing sensitive material.
utilization in the 2 column implementation than the The verilog hardware description of the Rijndael
16-bit divider, it results in a lower increase in timing encryption algorithm has also been taken from an
and power consumption. This can be explained by OpenCores project [3]. This Rijndael implementation
the two-dimensional structure of the divider men- can perform a complete encrypt sequence in 12 clock
tioned above, whch does not suit a column-wise im- cycles. The synthesized Rijndael encryption block
plementation. including the bus interface consumes 2120 slices,
which fit into 14 XCV2000E CLB columns. The
2.2. Medium Designs resulting intemal fragmentation for such an imple-
The first design of the medium-sized category is a mentation is only 5.4%. The increase of the resulting
32-bit Divider. Again, this divider is similar to the timing for the narrowest implementation (20.5 ns) in
ones before, except that the data width has been in- comparison to the hest implementation (1 8.3 ns) does
creased to 32-bit. This leads to a design size of 844 not exceed 12%. The power consumption increased
slices which can be mapped to 6 CLB columns re- by 6% from 2508 mW for the 120 column implemen-
sulting in a module size of 960 and a minimum inter- tation to 2658 mW for the 14 column implementation
nal fragmentation of 12.1%. As the design size in- (see also table I ) .
creases the negative effect on timing and power con-
sumption decreases. For this 32-bit divider the timing
2.3. Large Designs
for the narrowest (6 columns) implementation in- In the large designs category we have analysed
creases only by 22% in comparison to the best im- two designs. The first one is an accelerator for octree
plementation (20 columns) and the power consump- based 3D-graphics called Octchip [51. The internal
tion increases by I I % (see table 1). structure of this design is comparable to a CPU; it
The task of a Digital Linear Controller is to in- consists of an instruction and decode unit, a large
fluence the dynamic behaviour of a plant. Therefore, register block and several execution state machines.
the controller has to perform several matrix multipli- One speciality is the so called Bitblock unit, which
cations and additions in order to compute the next performs several binary operations on two or three
output out of the inputs and the inner states. The 64-bit registers. The hardware implementation of
hardware implementation we used performs these these operations (copy. move. flip. rotate, and. or
calculations extensively in parallel [41. The investi- etc.) results in a large block of multiplexers and
Table 1: Results of the design study
LDPC 16-bil 32-bil Linear
8-bil Divider Decoder’ Divider FIR Filler Divider Controller AES Rijndael Oclchip S-CoreCPU
415
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
wires. The design size of the whole Octchip includ- highest timing increase of 94% was determined for a
ing the bus interface is 3778 slices. Consequently, the one column implementation of the &bit divider.
design can he mapped to 24 CLB columns, leading to However, this divider consumes almost 60% of the
an internal fragmentation of 1.61%. This means only slices available in one column and amazingly it was
62 slices out of 3840 are not occupied. In this case still completely routable. The increase in power con-
the timing of the narrowest implementation (39.1 ns) sumption was always helow 20% for all narrow im-
was even 2% better than the 120 columns implemen- plementations of the small designs. The main reason
tation, hut still 2.3% worse than the best implementa- for the increase in timing and power consumption is
tion (cf. table l). The power consumption increased the extreme aspect ratio of these narrow implementa-
from 1944 (55 columns) to 1995 mW, which is only tions, which lead to a high switch box and routing
ahout 3%. resource utilization as well as long signal wires.
Finally, the biggest design we investigated is our Spending an extra column for these small designs
own 32-bit RISC Processor S-Core 161. The S-Core heavily increases the available resources, which leads
is a one-address machine with a loadlstore architec- to a higher internal fragmentation, but also to a much
ture, two banks of 16 32-bit registers, and a three- better timing and power consumption (increase
stage pipeline. An implementation of the S-Core smaller 10%). However. this problem disappears
VHDL description consumes 5730 slices in a Virtex when these small designs become part of a bigger
XCV20M)E FPGA. Consequently, at least 36 CLB design, which is more likely.
columns are necessary, leading to a minimum inter- The negative effects of column-wise compaction
nal fragmentation of only 0.5%. For this design we were much smaller for medium-sized designs. The
recognized the least effects on the resulting timing worst timing was always less than 25% (mostly be-
and power consumption when implementing it in the low 15%) higher than the respective hest timing. The
most compact way (36 columns). The timing for the increase of power consumption was below 12% for
narrowest implementation (55.9 ns) increased by all designs.
only 1 % in comparison to the whole-chip implemen- Finally, for the biggest designs (Octchip and S-
tation and it increased only by 4.3% in comparison to Core) the negative influence of a compact implemen-
the best implementation (45 columns). Regarding tation almost disappeared completely. Both, the in-
power consumption no increase could he recognized crease of timing and power consumption were always
in comparison to the whole-chip implementation. helow 5%. Evidently, the bigger the design size, the
Throughout all implementations, the power consump- less the effects on the resulting timing and power
tion ranpes from 994 mW to 1050 mW. consumption. However, the structure of the design
can have a great effect as well, as could be seen when
comparing the results of the 16-hit divider and the
FIR filter.
AI1 in all the results of the study are very promis-
ing, almost all designs were routahle and the effects
on the timing and power consumption are either rea-
sonable or non existent.
4. References
[ I ] H. Kalte. M. Porrmann, and U. Rilckert. Syslem-on-
programmable-chip approach enabling online line-
grained ID-placement. In Proc. of RAW 2004. Santa
~~ ~~
416
Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.