0% found this document useful (0 votes)
30 views4 pages

Study On Column Wise Design Compaction For Reconfigurable Systems

Column-wise module placement can improve the performance of reconfigurable systems. This study investigates the effects of column-wise module implementation on frequency and power consumption. Nine designs of varying sizes were mapped to different column ranges on a Xilinx FPGA. Both maximum frequency and internal fragmentation, which measures resource utilization, were analyzed for each implementation.

Uploaded by

yame asfia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Study On Column Wise Design Compaction For Reconfigurable Systems

Column-wise module placement can improve the performance of reconfigurable systems. This study investigates the effects of column-wise module implementation on frequency and power consumption. Nine designs of varying sizes were mapped to different column ranges on a Xilinx FPGA. Both maximum frequency and internal fragmentation, which measures resource utilization, were analyzed for each implementation.

Uploaded by

yame asfia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Study on Column Wise Design Compaction for Reconfigurable Systems

H. Kalte, G. Lee M. Pomnann, U. Ruckert


University of Western Australia Heinz Nixdorf Institute,
School of Computer Science & System and Circuit Technology
Sofhvare Engineering Universiq of Paderborn,
35 Stirling Highway, Crawley WA 6009 33102 Paderborn, Germany
Email: (heiko, gel) @ esse. u wa.edu.au Email: (porrmann, nieckert) @hni.upb.de

Abstract the modules. Therefore, the tri-state bus can be seg-


mented by the integration of bridges. By this means
Some of currently available Field Programmable multiple independent bus systems can be created in
Care Arrays (FPGAsl can be reconfigured pnrriallv, order to increase the overall communication band-
which makes it possible to build up dynamic systems width. A more detailed description of the approach
rhar can be adapred 10 changing demnnds during can he found in [I].
runtime. One basic aspect of such a system is the A basic requirement of our approach is an effi-
wag rhe dynamic hardware modules are placed on cient implementation of the modules in a column
the FPGA. As most FPGAs offer parrial reconfigura- wise manner. Especially implementing rather small
rion in a column wise manner, a I D placement of designs in big FPGAs end up in a tall and narrow
column wise implemenred modules seems to be shape that is expected to have an effect on the result-
promising. Within this paper we present a design ing maximum frequency and the power consumption
study rhar determines the effects of a column wise of the designs. The results of a design study that in-
module implementarion on the resulring frequency vestigates these effects is presented in the following.
and power consumption'.
2. Design Study
1. Introduction In order to investigate the effects of a column-
Dynamically reconfigurahle systems suffer neither wise design implementation on the maximum fre-
from the inflexibility of application specific inte- quency and the power consumption. we implemented
grated circuits (ASICs) nor from the mostly sequen- 9 different designs on a Xilinx Virtex XCV2000E
tial processing of CPUs. Arbitrary functions can be device with speed grade -6. The designs are taken
implemented in hardware and can be downloaded as from areas such as arithmetic, linear control systems,
an exchangeable module in a dynamically recon- cryptography, digital signal processing (DSP). graph-
figurable system. ics and CPU. The design sizes range from 0.470 up to
We have developed a reconfigurahle system ap- 29% of the slice resources available in a XCV2000E
proach that enables online fine-grained 1 D-placement device. In order to group designs of similar size, we
of dynamically exchangeable modules [I]. In our have categorized them in small (< 2 CLB cols.), me-
approach all modules occupy the full height of the dium (3-15 CLB cols.) and large (16-120 CLB cols).
reconfigurable array while the width varies with the Synthesis as well as map and route have been done
complexity of the modules. Thus, the placement of by Xilinx ISE 6.1 tools. In order to restrict the
the modules is simplified to a one-dimensional prob- placement and routing to a certain area, a tool flow
lem, easing online placement and defragmentation has been used which is similar to the modular design
strategies. We propose a classical tri-state bus system flow described in [2]. To ensure accurate results, we
as an inter module communication infrastructure. have determined seven different column ranges for
This bus system is build up out of neighboring wire each design and for each area constraint at least three
segments and tri-state buffers and spans the whole different implementations with varying timing con-
chip in a horizontd manner. As this communication straints have been realized. Via an additional bus
infrastructure is completely homogeneous it enables interface. each design has been connected to a hon-
dynamic relocation of pre-synthesized modules along zontal communication bus system during the imple-
the horizontal bus structure. Besides the development mentation process (see [ I ] and figure 2 ) .
of online allocation, de-allocation and re-location Besides the resulting wurst parh dela? which is
mechanisms. we also proposed solutions to the aspect the reciprocal of the maximum design frequency. we
of defragmentation. One special feature in our ap- use the term intemal fragmenration to measure the
proach is the possibility to dynamically adopt the resource utilization of each implementation. This
communication infrastructure to the requirements of term describes the ratio between design sire and
module size. The design size is the raw number
of slices the design consists of, and the module size

0-7803-8652-3/04/$20.00 0 2004 IEEE 413 ICFPT 2004

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
Amodrlia
is the number of columns reserved for the de- The second design we have investigated is a
sign, multiplied by the number of slices per column LDPC decoder, also taken from an Opencores pro-
The minimum internal fragmentation F, is:
N,acedcoi. ject [3]. Low-density parity-check (LDPC) codes are
forward error correction codes invented by Robert
Gallager in the early 60s. The synthesized decoder
consumes 148 slices and therefore can also he
The analysis concerning power consumption is mapped to a single column. However, in contrast to
made by the Xilinx XPower tool, which provides the the 8-hit divider, the decoder would reach an internal
dynamic and static power consumption of an fragmentation of only 7.5%. This means, only 12
implemented design. For all power analyses of a slices out of 160 would not be occupied. Although
design, the clocks as well as all input signals were we used the same area constraints as for the divider it
toggled by the highest frequency that was reported by was not possible to route the single column decoder
tbe tools. Additionally, the toggle rates of all internal with any of the timing constraints. This is due to the
synchronous signals were set to 50%. combination of the extreme aspect ratio and the high
2.1. Small Designs slice utilization. With the help of the FPGA Editor it
could he clearly seen that there was lack of vertical
The smallest design we have investigated is an wiring resources especially in the middle of the de-
&bit divider. The syntbesizable Verilog source sign. An overview of all results is given in table l .
code, taken from an Opencores project L31, is pa- The next design is a 16-bit divider, which is iden-
rameterizable and fully pipelined. It takes an &hit tical to the mentioned %hit divider, except for the
dividend and a 4-hit divisor as inputs and returns a 4- data width. This divider consumes 271 slices, leading
bit quotient, a 4-hit remainder, a division-by-zero to a minimum of two XCV2000E CLB columns. The
flag, and an overflow flag. As the hardware descrip- internal fra-mentation for a two column implementa-
tion is easy to parameterize, we used the divider to tion is only 15.3% which means 271 out of 320 slices
investigate several design sizes. The synthesized di- are occupied. This time we used 2, 3.4, 8,20,45 and
vider including tbe hns interface (17 slices) consumes 120 columns to implement the divider. Again, except
95 slices, which fit in a single XCV2000E CLB col- for the two column implementation, the results are
umn (N,ijc~s,~oi= 1601, resulting in an internal frag- quite similar to each other. However, for the narrow-
mentation of about 41%. We implemented the di- est implementation the timing increases from about
vider in I , 2. 4. 8. 20, 45. and 120 (whole chip) col- 16 ns to almost 25 ns which is approximately 57%
umns. Naturally, a pipelined divider has a two- (see table 1). The same happens to the power con-
dimensional structure and thus is expected to cause sumption. which increases from about I 1 8 mW to
problems when implementing it in a column-wise 138 mW which is an increase of about 17% in com-
(one-dimensional) way. Surprisingly. all implementa- parison to the best implementation. Although there is
tions, even the single column one, could be mapped still an increase in the resulting timing, it is not as big
and routed completely. As expected, the most inter- as for the small %bit divider.
esting parts of the diagram are the results for n m o w
implementations (see figure 1). For the 120 column
implementation down to the 2 column implementa- ~ ~ ~ ~ . . . . ~ ~ ~ ~ . . .
tion the worst path delay is always between 9.2 and
9.7 ns, hut for the single column implementation it
increases to 17.9 ns (see table I). This is an increase
of about 9490 in comparison to the hest implementa-
tions, presumably due to the increased capacitance of
the longer signal wires as the available area is cut in
half when stepping down from 2 to I column. A simi-
lar effect can he noticed for the power consumption.
Except for the single column implementation. the 22 m /B 16 ,I 12 10 8 e 1 2 0
Number ot CLB Wlumnr
power consumption is always around 100 mW. How-
ever, the power consumption increases to 108 mW Figure 1: Results for the %bit divider
for the narrowest implementation. which is about 8% Furthermore. we investigated an FIR-filter (Finite
in comparison to the best implementation. Both ef- Impulse Response) which is common in DSP appli-
fects can he explained by the extreme aspect ratio of cations 131. It consists of a delay bank (filter taps)
the one column implementation, leading to longer and a sum-of-products, which is pipelined with a
wires and hi-ber switch box utilization. However, register after every multiplier and adder. The filter
spending an extra column for the divider leads to a has an order of 3 and an input and coefficient preci-
much smaller increase of only 5% for the timing and sion of %bit. Together with the bus interface it con-
I % for the power consumption in comparison to the sumes 306 slices which fit into two CLB columns.
hest implementation. This two column implementation has an internal

414

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
fragmentation of only 4.4% which is even less than gated controller has three inputs of 16-hit width, two
the intemal fragmentation of the two column 16-bit inner states and three outputs of 32-hit width. The
divider implementation (15.3%). Due to the high resulting implementation consumes 1055 slices,
slice utilization it was questionable if such a,design which fit into 7 CLB columns. This narrowest im-
would be still routable, although the structure of an plementation leads to an internal fragmentation of
FIR filter is rather one-dimensional, and therefore is only 5.8%. The best timing for the controller was
expected to suit a column-wise compaction. The 17.3 ns (70 columns) and again the worst timing of
same area constraints as for the 16-bit divider were 19 ns resulted from the narrowest implementation (7
used and the two column filter could he routed com- columns). However, this time the increase was only
pletely. The resulting worst path delays for the 3 col- 10.6%. The power consumption increased from
umn up to the 120 column implementations range 794 mW (120 columns) to 873 mW which is only
from 12.3 ns to 13.9 ns, but again the timing for the 9.9% (see table I ) .
narrowest implementation increases by about 48% to The next design we investigated is a Rijndael
18.2 ns (see table 1). The power consumption for the Encryption Block. The Rijndael algorithm was se-
two column implementation increases by ahout 9% to lected by the U S . National Institute of Standards and
601 mW in comparison to the best implementation Technology as the Advanced Encryption Standard
(45 columns). Although the FIR filter has higher slice (AES) which is used for securing sensitive material.
utilization in the 2 column implementation than the The verilog hardware description of the Rijndael
16-bit divider, it results in a lower increase in timing encryption algorithm has also been taken from an
and power consumption. This can be explained by OpenCores project [3]. This Rijndael implementation
the two-dimensional structure of the divider men- can perform a complete encrypt sequence in 12 clock
tioned above, whch does not suit a column-wise im- cycles. The synthesized Rijndael encryption block
plementation. including the bus interface consumes 2120 slices,
which fit into 14 XCV2000E CLB columns. The
2.2. Medium Designs resulting intemal fragmentation for such an imple-
The first design of the medium-sized category is a mentation is only 5.4%. The increase of the resulting
32-bit Divider. Again, this divider is similar to the timing for the narrowest implementation (20.5 ns) in
ones before, except that the data width has been in- comparison to the hest implementation (1 8.3 ns) does
creased to 32-bit. This leads to a design size of 844 not exceed 12%. The power consumption increased
slices which can be mapped to 6 CLB columns re- by 6% from 2508 mW for the 120 column implemen-
sulting in a module size of 960 and a minimum inter- tation to 2658 mW for the 14 column implementation
nal fragmentation of 12.1%. As the design size in- (see also table I ) .
creases the negative effect on timing and power con-
sumption decreases. For this 32-bit divider the timing
2.3. Large Designs
for the narrowest (6 columns) implementation in- In the large designs category we have analysed
creases only by 22% in comparison to the best im- two designs. The first one is an accelerator for octree
plementation (20 columns) and the power consump- based 3D-graphics called Octchip [51. The internal
tion increases by I I % (see table 1). structure of this design is comparable to a CPU; it
The task of a Digital Linear Controller is to in- consists of an instruction and decode unit, a large
fluence the dynamic behaviour of a plant. Therefore, register block and several execution state machines.
the controller has to perform several matrix multipli- One speciality is the so called Bitblock unit, which
cations and additions in order to compute the next performs several binary operations on two or three
output out of the inputs and the inner states. The 64-bit registers. The hardware implementation of
hardware implementation we used performs these these operations (copy. move. flip. rotate, and. or
calculations extensively in parallel [41. The investi- etc.) results in a large block of multiplexers and
Table 1: Results of the design study
LDPC 16-bil 32-bil Linear
8-bil Divider Decoder’ Divider FIR Filler Divider Controller AES Rijndael Oclchip S-CoreCPU

could n a ~be routed completely Inc...Increase Nar...NarrOWeSl sec ...second

415

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.
wires. The design size of the whole Octchip includ- highest timing increase of 94% was determined for a
ing the bus interface is 3778 slices. Consequently, the one column implementation of the &bit divider.
design can he mapped to 24 CLB columns, leading to However, this divider consumes almost 60% of the
an internal fragmentation of 1.61%. This means only slices available in one column and amazingly it was
62 slices out of 3840 are not occupied. In this case still completely routable. The increase in power con-
the timing of the narrowest implementation (39.1 ns) sumption was always helow 20% for all narrow im-
was even 2% better than the 120 columns implemen- plementations of the small designs. The main reason
tation, hut still 2.3% worse than the best implementa- for the increase in timing and power consumption is
tion (cf. table l). The power consumption increased the extreme aspect ratio of these narrow implementa-
from 1944 (55 columns) to 1995 mW, which is only tions, which lead to a high switch box and routing
ahout 3%. resource utilization as well as long signal wires.
Finally, the biggest design we investigated is our Spending an extra column for these small designs
own 32-bit RISC Processor S-Core 161. The S-Core heavily increases the available resources, which leads
is a one-address machine with a loadlstore architec- to a higher internal fragmentation, but also to a much
ture, two banks of 16 32-bit registers, and a three- better timing and power consumption (increase
stage pipeline. An implementation of the S-Core smaller 10%). However. this problem disappears
VHDL description consumes 5730 slices in a Virtex when these small designs become part of a bigger
XCV20M)E FPGA. Consequently, at least 36 CLB design, which is more likely.
columns are necessary, leading to a minimum inter- The negative effects of column-wise compaction
nal fragmentation of only 0.5%. For this design we were much smaller for medium-sized designs. The
recognized the least effects on the resulting timing worst timing was always less than 25% (mostly be-
and power consumption when implementing it in the low 15%) higher than the respective hest timing. The
most compact way (36 columns). The timing for the increase of power consumption was below 12% for
narrowest implementation (55.9 ns) increased by all designs.
only 1 % in comparison to the whole-chip implemen- Finally, for the biggest designs (Octchip and S-
tation and it increased only by 4.3% in comparison to Core) the negative influence of a compact implemen-
the best implementation (45 columns). Regarding tation almost disappeared completely. Both, the in-
power consumption no increase could he recognized crease of timing and power consumption were always
in comparison to the whole-chip implementation. helow 5%. Evidently, the bigger the design size, the
Throughout all implementations, the power consump- less the effects on the resulting timing and power
tion ranpes from 994 mW to 1050 mW. consumption. However, the structure of the design
can have a great effect as well, as could be seen when
comparing the results of the 16-hit divider and the
FIR filter.
AI1 in all the results of the study are very promis-
ing, almost all designs were routahle and the effects
on the timing and power consumption are either rea-
sonable or non existent.

4. References
[ I ] H. Kalte. M. Porrmann, and U. Rilckert. Syslem-on-
programmable-chip approach enabling online line-
grained ID-placement. In Proc. of RAW 2004. Santa
~~ ~~

Figure 2: FPGA Editor screenshot Fe. New Mexico, 2004.


Figure 2 shows an FPGA Editor screenshot of the [2] Xilinx Application Notes 290. Two Flows for Partial
horizontal communication infrastructure, a 36 col- Reconfiguration: Module Based or Small Bit Manipu-
umn S-Core implementation and a 2 column R R laiions. 2002. hitp://www.xilinx.com.
filter implementation. [31 Opencores: htrp://www.OpenCores.org
[4] K. Danne. C. Bohda and H. Kalte. Run-time Exchange
3. Summary/Conclusion of Mechatronic Controllers Using Pmial Hardware
One basic requirement for all reconfigurable sys- Reconfiguration. In Proc. of FPL2003, Lisbon, Sept.
tem approaches that are based on a one-dimensional 2003.
module placement is an efficient implementation of [SI H. Kalte. M. Porrmann. U. ROckert: Using a Dynami-
the dynamically exchangeable modules in a column cally Reconfigurable System to Accelerate Octree
Based 3D Graphics. In Proc. of the PDPTA, Las Ve-
wise manner. Therefore we investigated the effect of
gas. Nevada. 2000.
a column wise module implementation on the result- [6] D. Laogen. J.X. Niemann. M. Porrmann, H. Kalte, U.
ing frequency and power consumption. Ruckert: Implementation of a RlSC Processor Core for
As expected we recognized an increase of the re- SoC Designs-FPGA Prototype vs. ASIC Implementa-
sulting worst path delay and power consumption for tion. IEEE Workshop Heterogeneous Reconfigurahle
narrow implementations of very small designs. The Systems on Chip (SoC).Hamburg. Germany, 2002.

416

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on June 26,2022 at 19:10:19 UTC from IEEE Xplore. Restrictions apply.

You might also like