An Approach To LUT Based Multiplier For Short
An Approach To LUT Based Multiplier For Short
5, June 2018
Abstract—Short Word Length (SWL) DSP systems offer In SWL systems, the primary component is Sigma-Delta
good performance as they process less data- typically up to three Modulator (SDM) that operates at higher oversampling ratio
bits. Short Word Length systems may be designed using the (OSR) than Nyquist rate, and was first reported in [6] for FIR
FPGAs. FPGAs come with many built-in primitives like Look-up filter coefficients optimization. Recently much work is
tables, Flip-flops, additional Carry logic, Memories and DSP reported on logic utilization in general purpose DSP
elements.All these primitives give alternative approaches for algorithms through reduction in multiplier complexity- known
FPGA based system design. This paper presents a way to use the as short word length (SWL) DSP systems, including LMS-like
Look-up tables to design three bit (3×3) constant coefficient algorithms [7-9].
unsigned integral multiplier for Short Word Length DSP
systems.Besides, the feasibility of using Block ram and DSP Besides SWL, literature reveals various implementations
elements for Short Word Length DSP system (multiplier) is also of conventional DSP systems, specially the multiplier, using
carried out as an alternative implementation approach.Result the built-in DSP elements (DSP48) and block RAM.
suggests the proposed way be the better one when compared with
other two implementations. For example, in [10] authors have presented FPGA based
hardware architecture for floating point multiplier (single
Keywords—Built-in Core; Combinational Logic Blocks; Multiplier; precision (SP),double precision (DP), double-extended
FPGA. precision (DEP) and quadruple precision (QP)
implementation).In the design authors have used 25x18
I. INTRODUCTION DSP48E blocks available on the Xilinx Virtex-5 for the
FPGAs consist of basic building blocks, including Look- mantissa multiplication.
Up Tables (LUT’s), Flip-Flop’s (FF’s) and Additional Carry Similarly, in [11] author is proposing FPGA based
Logic, besides more complex on-chip block, as Memories hardware architecture for quadruple precision (QP) division
(e.g., block RAM’S), DSP elements (e.g., Multiplier) and arithmetic. Here the mantissa division is employed using a
High Speed Transceivers [1]. series expansion methodology integrated with a wide integer
Along with the power that FPGA consumes and the path multiplier (further optimized for FPGA implementations)
delay observed, the overall resource usage is also an facilitating the built-in DSP blocks efficiently.
important measure of hardware cost [2]. In [12] author has used DSP blocks in a compact FPGA-
It is better to use built-in primitives that give optimized based micro-coded coprocessor for exponentiation in
performance when compared with implementations on asymmetric encryption.
configuration logic blocks (CLBs)-LUTs [3, 4]. But in all Memory based approach, yielding faster output uses
means, CLBs play a superior role in many implementations memories (RAMs, ROMs) or Look-up tables (LUTs) that
when an application needs more resources than on FPGA store pre-computed values and can be readout for
board. Like system on chip (SOC), network on chip (NOC), or multiplication operation [13]. The constant coefficient
even in a large DSP system design, constrains are observed if multiplier method and very well known Distributed arithmetic
resources accede the available limit [5]. are the prime examples of memory-based multiplier methods
Besides, it is convenient to use the LUTS for the Short Word [14]. For example, Xilinx in [15] has shown an efficient
Length (SWL) DSP systems where the data is represented in technique to implement fixed coefficient multipliers using
three bits-as the design based on those complex primitives Look-up tables (LUTs) of FPGA’s .
may result in resource wastage. This paper presents a way to use the Look-up tables
(configured as memory blocks) to design three bit (3×3)
constant co-efficient unsigned integral multiplier for Short In the approach described above, the memory holds the
Word Length DSP systems. Besides, the feasibility of using product value of a constant multiplier X. If the X is kept
block RAM and DSP elements for Short Word Length DSP changeable or adaptive, the total memory needed would
system (multiplier) is also carried out as an alternative depend upon the values which the variable X may take. For
implementation approach. example, in our design X can run from 0 to 7; hence needing
As the design is carried out for Short Word Length DSP eight different memories.
systems, the length of multiplier and multiplicand is set to
three bits (3×3) yielding them to be fixed in the range 0-7 Furthermore, in binary representation, new data may be
(23=8). Furthermore, the coefficients are set to be unsigned obtained easily by just shifting the bits either left or right.
integers; hence resulting in positive product values. Other goodness in base2 systems is that, the doubling of any
value is easy to get by post fixing the zero as the LSB, as
The major difference between the conventional constant shown in table 1: moving from 2 to 4 in decimal is possible in
coefficient memory based multiplier and proposed one is the binary with appending zero in the end of base2 representation
amount of memory consumed. In the conventional (3×3) of 2; similar is the case with moving from 3 to 6, 6 to 12, and
design for example, for each constant multiplier (0-7), 5 to 10 so on so forth.
respective product values would be pre-calculated and stored
in eight block memories. While, in proposed design, only one This aspect gives us the opportunity to design memory
LUT based memory module is consumed. This LUT based based area optimized systems, especially multiplier, as with
memory holds only the product values of the fixed coefficient storing the pre-calculated product values of constant multiplier
multiplier 2 and for other coefficients same product values are like 2 or 3 gives us the opportunity to get the product values of
modified at the output as per proposed algorithm steps. higher factors of constant multiplier. This scheme works very
well for some data but issue is with the numbers whose least
II. CONVENTIONAL AND PROPOSED ALGORITHMS common factor is not available for example 5, 7, 9, 11.
In subsequent paragraphs conventional and proposed
algorithms are described. B. Proposed Algorithm
In the proposed algorithm decimal 2 is taken as the least
A. Conventional Memory Based Multiplire Operation multiple of all the data.
Let X to be a positive binary number with word-length L;
there can be 2L possible values of X, and accordingly, there
can be 2L possible values of product C = A • X. Therefore, for TABLE II. PRODUCT VALUE FOR MULTIPLIER 2 AND MULTIPLICANDS 0-7
memory-based multiplication, a LUT of 2L words; consisting
Multiplier Multiplicand Product Value
of pre-computed product values corresponding to all possible 2 0 0
values of X is conventionally used. 2 1 2
2 2 4
2 3 6
2 4 8
2 5 10
2 6 12
2 7 14
Fig.1. Typical approach to memory based multiplier
Hence, in memory pre-calculated product values for the
The product-word A•Xi is stored at the location whose constant multiplier 2 are stored and the product of all other
address is the same as Xi for 0 ≤ 2L − 1, such that if L-bit multipliers (0, 1, 3, 4, 5, 6, and 7) is achieved by modifying
binary value of Xi is used as address for the LUT, then the the output in some ways using two combinational functions:
corresponding product value A•Xi is available as its output not and concatenation.
[16-18].
0 0000 8 1000
1 0001 9 1001
2 0010 10 1010
3 0011 11 1011
4 0100 12 1100
5 0101 13 1101
6 0110 14 1110
7 0111 15 1111
Fig.2. Block diagram of LUT based memory to store the product value
BR4BD: Binary representation of 4 bit data
Algorithm
the bit flipped value (when the nearby value of the product is
present). The second multiplier is used if the algorithm comes Table 3 represents the simulation results of the designs. It
in its third step, i-e, when the bit filliping does not work, but is very much clear from the results that the proposed algorithm
the bit concatenation is required in order to double the data. works very well in contrast to block RAM based designs. As,
in that approach, even for the small bit width multipliers more
C. The Not Gate memory elements are needed. Like in our design of 3x3
Amongst the two combinational Logics, one is the not gate. multiplier, six block RAMs are required to store all possible
The function of not gate in this design is just to bring the flip product results; while, our design consumes no block RAM.
in the last bit of the data taken from the LUT based memory. Furthermore, this design is also less efficient in terms of
Flipping the last bit brings the effect of adding one in the frequency achieved.
memory output data. Similar is the case with DSP 48 based design. Though, this
implementation consumes only one block, but results in
D. The Concatenation Operator resource wastage as DSP48 can work on large size of
The second combinational operation is to concatenate. As it is operands (18 bit log) but misfortunately for small size
discussed above that for 3x3 multiplier the final output would operands ( similar to our case of 3 bit long ) same resources
be 6 bits. The word length of the data stored in the memory is are utilized.
4. So, availing the opportunity to complete the final count to 6
Another important parameter is the delay observed in
bits, we can append 2 bits in the last of the data taken from the
DSP48 based design that is much larger than the proposed
memory at maximum: to making the data double and even design.
triple when required.
On other hands, the proposed design consumes 9 LUTs,
Block Ram based design but significant number of multiplexers; still it offering higher
In block RAM based design, total six block RAMS were clock frequency and lower delay than other two designs.
instantiated to store the product values for each constant
V. CONCLUSION
multiplier 2, 3, 4, 5, 6 and 7. As it is obvious that for any
multiplicand and zero multiplier the output would be zero and This work presents three different implementations of
for one as multiplier the value of multiplicand will be the three bit (3×3) constant coefficient unsigned integral
value of product. Hence, these two conditions were multiplier for Short Word Length DSP Systems. All three
implemented using logic. designs were carried out using the built-in primitives of Xilinx
Spartan 6 FPGA. The block RAM based design has the issue
with the word length (as the bits of multiplier and multiplicand
DSP48 based design increase, the total memory would increase).
In this design strategy, the DSP48 block was instantiated to
perform the required product. Here, contrary to above two In the short word length systems, the concern with
implementations, the DSP48 was not restricted to carryout memory based design is the amount of minimum memory that
three bit multiplication. we can configure in a given FPGA. For example, 32 bit
memory needed to store the product values of constant
IV. SIMULATION RESULTS multiplier 2 (considering eight multiplicand, i-e, 0-7) would be
In this work, three different three bit (3×3) constant incorporated within no less then block RAM of 9K (The
coefficient unsigned integral multiplier were designed; two minimum configurable block Memory in Sprtan 6 FPGA);
using FPGA built in primitives a) Block Ram and b) DSP48 hence resulting in wastage of unused memory.
based Multiplier, and third with proposed design using LUT Similarly, as mentioned before, the DSP48 can effectively
based implementation. be used for larger multiplications; hence once used in Short
The designs were carried out on Xilinx Spartan 6 FPGA Word Length Based system would result in in-efficient
using Xilinx ISE 13.2. The implementation results of three resource utilization.
different strategies are given in table 3. So, to use memory based multiplication or using the
In the design of block RAM multiplier, total of 8 memory DSP48 for Short Word Length systems is suggested as less
modules are needed logically for 3x3 multiplier (considering feasible. Consequently, the choice is to use the customized
0, 1, 2, 3, 4, 5, 6, 7 as constant multipliers), but the design (LUT based) implementations, as one proposed here.
consists only 6 memory modules-as, for case of 0 and 1 the The proposed multiplication algorithm may be used in any
multiplier and multiplicand will be the result of product area of DSP, besides Short Word Length processing.
respectively.
Point to the future work is to observe this multiplier by
TABLE III. RESULTS OF FPGA BASED DESIGNED MULTIPLIERS incorporating it in FIR Filter and Adaptive Filter design.
Primitives LUTs DSP RAM Mux Max:delay Freq:
: (ns) (MHz)
BRAM 3 0 6 0 4.372 228.72
DSP48 0 1 0 0 9.173 109.01
Proposed 9 0 0 85 3.648 274.12
References
2017 IEEE 8th Latin American Symposium on,
2017, pp. 1-4.
[1] C. Shi, J. Hwang, S. McMillan, A. Root, and V. [13] K. Shanthi and N. Nagarajan, "HIGH SPEED AND
Singh, "A system level resource estimation tool for AREA EFFICIENT FPGA IMPLEMENTATION
FPGAs," Field Programmable Logic and OF FIR FILTER USING DISTRIBUTED
Application, pp. 424-433, 2004. ARITHMETIC," Journal of Theoretical and
[2] M. R. Singh and A. Rajawat, "A Review of FPGA- Applied Information Technology, vol. 62, 2014.
based design methodologies for efficient hardware [14] H. P. Singh, R. Sarin, and S. Singh,
Area estimation," IOSR Journals (IOSR Journal of "Implementation of high speed FIR filter using
Computer Engineering), vol. 1, pp. 1-6. serial and parallel distributed arithmetic
[3] C. Lavin, M. Padilla, S. Ghosh, B. Nelson, B. algorithm," International Journal of Computer
Hutchings, and M. Wirthlin, "Using hard macros to Applications, vol. 25, pp. 26-32, 2011.
reduce FPGA compilation time," 2010, pp. 438- [15] K. Chapman, "Fast integer multipliers fit in
441. FPGAs," EDN magazinee’s Desighn Ideas, www.
[4] A. Palchaudhuri and R. S. Chakraborty, "A Fabric ednmag. com, 1993.
Component Based Approach to the Architecture [16] A. Srinivasalu and G. R. Reddy, "Optimization of
and Design Automation of High-Performance memory based LUT Multiplier," Research and
Integer Arithmetic Circuits on FPGA," in Development (IJECIERD), vol. 3, pp. 125-132,
Computational Intelligence in Digital and Network 2013.
Designs and Applications, ed: Springer, 2015, pp. [17] B. Jeevanarani and T. Sreenivas, "Memory− Based
33-68. Realiza− tion of FIR Digital Filter by Look− Up}
[5] A. Corporation, "AN 584: Timing Closure Table Optimization," International Journal of
Methodology for Advanced FPGA Designs," Engineering Research and Appti− cations, vQl,
2014.12.19. vol. 2, pp. 1003-1009, 2012.
[6] N. Benvenuto, L. Franks, and F. Hill Jr, "Dynamic [18] R. H. Turner and R. F. Woods, "Highly efficient,
programming methods for designing FIR filters limited range multipliers for LUT-based FPGA
using coefficients-1, 0 and+ 1," Acoustics, Speech architectures," IEEE Transactions on Very Large
and Signal Processing, IEEE Transactions on, vol. Scale Integration (VLSI) Systems, vol. 12, pp.
34, pp. 785-792, 1986. 1113-1118, 2004.
[7] A. Z. Sadik and Z. M. Hussain, "Short word-length
LMS filtering," 2007, pp. 1-4.
[8] T. D. Memon, P. Beckett, and A. Z. Sadik, "Power-
area-performance characteristics of FPGA-based
sigma-delta fir filters," Journal of Signal
Processing Systems, vol. 70, pp. 275-288, 2013
2013.
[9] A. C. Thompson, Techniques in Single-Bit Digital
Filtering: RMIT University, 20040., 2004.
[10] M. K. Jaiswal and H. K.-H. So, "DSP48E efficient
floating point multiplier architectures on FPGA,"
in VLSI Design and 2017 16th International
Conference on Embedded Systems (VLSID), 2017
30th International Conference on, 2017, pp. 1-6.
[11] M. K. Jaiswal and H. K.-H. So, "Architecture for
quadruple precision floating point division with
multi-precision support," in Application-specific
Systems, Architectures and Processors (ASAP),
2016 IEEE 27th International Conference on,
2016, pp. 239-240.
[12] L. Rodriguez-Flores, M. Morales-Sandoval, R.
Cumplido, C. Feregrino-Uribe, and I. Algredo-
Badillo, "A compact FPGA-based microcoded
coprocessor for exponentiation in asymmetric
encryption," in Circuits & Systems (LASCAS),