0% found this document useful (0 votes)
21 views5 pages

An Approach To LUT Based Multiplier For Short

This paper presents a method for designing a Look-Up Table (LUT) based multiplier for Short Word Length (SWL) DSP systems, focusing on a 3x3 unsigned integral multiplier. The proposed design utilizes fewer resources compared to conventional methods by storing pre-calculated product values in a single LUT, allowing for efficient multiplication operations. Results indicate that this approach outperforms traditional implementations in terms of memory usage and efficiency for small bit-width multipliers.

Uploaded by

Pradeep K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

An Approach To LUT Based Multiplier For Short

This paper presents a method for designing a Look-Up Table (LUT) based multiplier for Short Word Length (SWL) DSP systems, focusing on a 3x3 unsigned integral multiplier. The proposed design utilizes fewer resources compared to conventional methods by storing pre-calculated product values in a single LUT, allowing for efficient multiplication operations. Results indicate that this approach outperforms traditional implementations in terms of memory usage and efficiency for small bit-width multipliers.

Uploaded by

Pradeep K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL.13, NO.

5, June 2018

An Approach to LUT Based Multiplier for Short


Word Length DSP Systems
Tayab D Memon Aneela Pathan
1
Department of Electronic Engineering Institute of Information and Communication
Mehran University of Engineering and Technology, Technology
Jamshoro, Pakistan Mehran University of Engineering and Technology,
[email protected] Jamshoro, Pakistan
2
Department of Electronic Engineering,
Quaid-e-Awam University College of Engineering,
Science and Technology (QUCEST), Larkana,
Pakistan
[email protected]

Abstract—Short Word Length (SWL) DSP systems offer In SWL systems, the primary component is Sigma-Delta
good performance as they process less data- typically up to three Modulator (SDM) that operates at higher oversampling ratio
bits. Short Word Length systems may be designed using the (OSR) than Nyquist rate, and was first reported in [6] for FIR
FPGAs. FPGAs come with many built-in primitives like Look-up filter coefficients optimization. Recently much work is
tables, Flip-flops, additional Carry logic, Memories and DSP reported on logic utilization in general purpose DSP
elements.All these primitives give alternative approaches for algorithms through reduction in multiplier complexity- known
FPGA based system design. This paper presents a way to use the as short word length (SWL) DSP systems, including LMS-like
Look-up tables to design three bit (3×3) constant coefficient algorithms [7-9].
unsigned integral multiplier for Short Word Length DSP
systems.Besides, the feasibility of using Block ram and DSP Besides SWL, literature reveals various implementations
elements for Short Word Length DSP system (multiplier) is also of conventional DSP systems, specially the multiplier, using
carried out as an alternative implementation approach.Result the built-in DSP elements (DSP48) and block RAM.
suggests the proposed way be the better one when compared with
other two implementations. For example, in [10] authors have presented FPGA based
hardware architecture for floating point multiplier (single
Keywords—Built-in Core; Combinational Logic Blocks; Multiplier; precision (SP),double precision (DP), double-extended
FPGA. precision (DEP) and quadruple precision (QP)
implementation).In the design authors have used 25x18
I. INTRODUCTION DSP48E blocks available on the Xilinx Virtex-5 for the
FPGAs consist of basic building blocks, including Look- mantissa multiplication.
Up Tables (LUT’s), Flip-Flop’s (FF’s) and Additional Carry Similarly, in [11] author is proposing FPGA based
Logic, besides more complex on-chip block, as Memories hardware architecture for quadruple precision (QP) division
(e.g., block RAM’S), DSP elements (e.g., Multiplier) and arithmetic. Here the mantissa division is employed using a
High Speed Transceivers [1]. series expansion methodology integrated with a wide integer
Along with the power that FPGA consumes and the path multiplier (further optimized for FPGA implementations)
delay observed, the overall resource usage is also an facilitating the built-in DSP blocks efficiently.
important measure of hardware cost [2]. In [12] author has used DSP blocks in a compact FPGA-
It is better to use built-in primitives that give optimized based micro-coded coprocessor for exponentiation in
performance when compared with implementations on asymmetric encryption.
configuration logic blocks (CLBs)-LUTs [3, 4]. But in all Memory based approach, yielding faster output uses
means, CLBs play a superior role in many implementations memories (RAMs, ROMs) or Look-up tables (LUTs) that
when an application needs more resources than on FPGA store pre-computed values and can be readout for
board. Like system on chip (SOC), network on chip (NOC), or multiplication operation [13]. The constant coefficient
even in a large DSP system design, constrains are observed if multiplier method and very well known Distributed arithmetic
resources accede the available limit [5]. are the prime examples of memory-based multiplier methods
Besides, it is convenient to use the LUTS for the Short Word [14]. For example, Xilinx in [15] has shown an efficient
Length (SWL) DSP systems where the data is represented in technique to implement fixed coefficient multipliers using
three bits-as the design based on those complex primitives Look-up tables (LUTs) of FPGA’s .
may result in resource wastage. This paper presents a way to use the Look-up tables
(configured as memory blocks) to design three bit (3×3)

978-1-5386-5689-1/18/$31.00 ©2018 IEEE 276


The 2018 International Conference on Signals and Systems (ICSigSys)

constant co-efficient unsigned integral multiplier for Short In the approach described above, the memory holds the
Word Length DSP systems. Besides, the feasibility of using product value of a constant multiplier X. If the X is kept
block RAM and DSP elements for Short Word Length DSP changeable or adaptive, the total memory needed would
system (multiplier) is also carried out as an alternative depend upon the values which the variable X may take. For
implementation approach. example, in our design X can run from 0 to 7; hence needing
As the design is carried out for Short Word Length DSP eight different memories.
systems, the length of multiplier and multiplicand is set to
three bits (3×3) yielding them to be fixed in the range 0-7 Furthermore, in binary representation, new data may be
(23=8). Furthermore, the coefficients are set to be unsigned obtained easily by just shifting the bits either left or right.
integers; hence resulting in positive product values. Other goodness in base2 systems is that, the doubling of any
value is easy to get by post fixing the zero as the LSB, as
The major difference between the conventional constant shown in table 1: moving from 2 to 4 in decimal is possible in
coefficient memory based multiplier and proposed one is the binary with appending zero in the end of base2 representation
amount of memory consumed. In the conventional (3×3) of 2; similar is the case with moving from 3 to 6, 6 to 12, and
design for example, for each constant multiplier (0-7), 5 to 10 so on so forth.
respective product values would be pre-calculated and stored
in eight block memories. While, in proposed design, only one This aspect gives us the opportunity to design memory
LUT based memory module is consumed. This LUT based based area optimized systems, especially multiplier, as with
memory holds only the product values of the fixed coefficient storing the pre-calculated product values of constant multiplier
multiplier 2 and for other coefficients same product values are like 2 or 3 gives us the opportunity to get the product values of
modified at the output as per proposed algorithm steps. higher factors of constant multiplier. This scheme works very
well for some data but issue is with the numbers whose least
II. CONVENTIONAL AND PROPOSED ALGORITHMS common factor is not available for example 5, 7, 9, 11.
In subsequent paragraphs conventional and proposed
algorithms are described. B. Proposed Algorithm
In the proposed algorithm decimal 2 is taken as the least
A. Conventional Memory Based Multiplire Operation multiple of all the data.
Let X to be a positive binary number with word-length L;
there can be 2L possible values of X, and accordingly, there
can be 2L possible values of product C = A • X. Therefore, for TABLE II. PRODUCT VALUE FOR MULTIPLIER 2 AND MULTIPLICANDS 0-7
memory-based multiplication, a LUT of 2L words; consisting
Multiplier Multiplicand Product Value
of pre-computed product values corresponding to all possible 2 0 0
values of X is conventionally used. 2 1 2
2 2 4
2 3 6
2 4 8
2 5 10
2 6 12
2 7 14
Fig.1. Typical approach to memory based multiplier
Hence, in memory pre-calculated product values for the
The product-word A•Xi is stored at the location whose constant multiplier 2 are stored and the product of all other
address is the same as Xi for 0 ≤ 2L − 1, such that if L-bit multipliers (0, 1, 3, 4, 5, 6, and 7) is achieved by modifying
binary value of Xi is used as address for the LUT, then the the output in some ways using two combinational functions:
corresponding product value A•Xi is available as its output not and concatenation.
[16-18].

TABLE I. DECIMAL AND BINARY REPRESENTATION OF THE DATA


Decimal BR4BD Decimal BR4BD
Value Value

0 0000 8 1000
1 0001 9 1001
2 0010 10 1010
3 0011 11 1011
4 0100 12 1100
5 0101 13 1101
6 0110 14 1110
7 0111 15 1111
Fig.2. Block diagram of LUT based memory to store the product value
BR4BD: Binary representation of 4 bit data

978-1-5386-5689-1/18/$31.00 ©2018 IEEE 277


The 2018 International Conference on Signals and Systems (ICSigSys)

In Fig.2, LUT based memory, storing pre-calculated Flow diagram


product values for constant multiplier 2 is shown. The
memory address is 3-bit wide represented by W that has The flow diagram for the addressing scheme and
maximum 8 address locations with 2n approach. Each product combinational logic is given as under:
value stored is of four bits wide represented with L; hence,
making total of 32 bits for total 8 addresses.

As for 3x3 multiplications the output should of 6 bits


(W+L). But in our approach, we only need 4 bits; as shown in
fig.2. Therefore, we can apply 2 bits at last as appendix to get
the required output. The proposed algorithm for this type of
multiplier is given as under.

Algorithm

This algorithm starts by storing product value in LUT


memory where multiplier is 2 and multiplicand is 0, 1, 2, 3, 4,
5, 6, and 7. The Address is the value of Multiplicand.
Step 1: If the Multiplier is 2 and multiplicand is any value
from the range 0-7, then the product value stored at that
particular memory location (defined by multiplicand) is net
output. But, if the Multiplier is other than 2 and multiplicand Fig.3. Flow diagram of proposed algorithm
is from 0-7 then re-look in memory to find if the required
output is already calculated. If yes, take the output from that III. SYSMEM DESIGN IN FPGA
particular address.
Three FPGA based designs and implementations are discussed
Example: let’s suppose the multiplier be 2 and multiplicand be as follows:
5, then 2x5=10 and 10 (10101)2 is already stored at memory
location 5. But if the multiplier is not 2 and product value is Proposed LUT Based Design
still available in memory (like 3x4=12 (1100)2 already stored
at location 6), then simply get the product value at output In Fig.4, FPGA based system design is shown. It can be
stored at that particular location.
observed that the circuit consists of 4 functional elements:
Step 2: If the Multiplier is other than 2 and the value is Memory module, Multiplexer, Not gate and Concatenation
not available in memory; look for nearby value (one less or operator.
one greater of expected product) and flip the last bit.
Example: 3x5=15 and this value is not available in
memory, so take the nearby value 14 (1110)2 and flip the last
bit to make 14(1110)2 to 15(1111)2.
Step 3: If the nearby value is not present, look for any least
factor of the product value and append the bit(s) in the last.
For example,4x5=20. Here 10 (1010)2 is the least factor of 20
(10100)2. So take the output 10 available at memory location
5 and append zero in the last to get the double of 10 that is 20.
Similarly, suppose multiplier is 3 and multiplicand is 7.
Here the product would be 21(10101)2. But 21 are not Fig.4. FPGA based system design
available in memory. So, this can be achieved by taking 10
(1010)2 at the output and then appending 1 in the last. This A. The Memory Module
will transform the 10 (1010)2 in to 21(10101)2. The Look-up table based memory has the capacity of 32 bit.
Step 4: If above steps does not give the required output, Total 8 product values of constant multiplier 2 are stored at 8
append two bits in the last to get the required data. For memory locations of Look-up table. The input to this memory
example,in case of 7x5 resulting product value is 35 (100011)2 module is generated by 3 bit address. The address is the
and when we append (11)2in the last to 8 (1000)2, we can multiplicand being selected from the range of 0 to 7 in the
easily get the output 35(100011). case of 3×3 multiplier.
B. The Multiplexer
In the design, two multiplexers are used. The function of first
multiplexer is to select between the pre-calculated product and

978-1-5386-5689-1/18/$31.00 ©2018 IEEE 278


The 2018 International Conference on Signals and Systems (ICSigSys)

the bit flipped value (when the nearby value of the product is
present). The second multiplier is used if the algorithm comes Table 3 represents the simulation results of the designs. It
in its third step, i-e, when the bit filliping does not work, but is very much clear from the results that the proposed algorithm
the bit concatenation is required in order to double the data. works very well in contrast to block RAM based designs. As,
in that approach, even for the small bit width multipliers more
C. The Not Gate memory elements are needed. Like in our design of 3x3
Amongst the two combinational Logics, one is the not gate. multiplier, six block RAMs are required to store all possible
The function of not gate in this design is just to bring the flip product results; while, our design consumes no block RAM.
in the last bit of the data taken from the LUT based memory. Furthermore, this design is also less efficient in terms of
Flipping the last bit brings the effect of adding one in the frequency achieved.
memory output data. Similar is the case with DSP 48 based design. Though, this
implementation consumes only one block, but results in
D. The Concatenation Operator resource wastage as DSP48 can work on large size of
The second combinational operation is to concatenate. As it is operands (18 bit log) but misfortunately for small size
discussed above that for 3x3 multiplier the final output would operands ( similar to our case of 3 bit long ) same resources
be 6 bits. The word length of the data stored in the memory is are utilized.
4. So, availing the opportunity to complete the final count to 6
Another important parameter is the delay observed in
bits, we can append 2 bits in the last of the data taken from the
DSP48 based design that is much larger than the proposed
memory at maximum: to making the data double and even design.
triple when required.
On other hands, the proposed design consumes 9 LUTs,
Block Ram based design but significant number of multiplexers; still it offering higher
In block RAM based design, total six block RAMS were clock frequency and lower delay than other two designs.
instantiated to store the product values for each constant
V. CONCLUSION
multiplier 2, 3, 4, 5, 6 and 7. As it is obvious that for any
multiplicand and zero multiplier the output would be zero and This work presents three different implementations of
for one as multiplier the value of multiplicand will be the three bit (3×3) constant coefficient unsigned integral
value of product. Hence, these two conditions were multiplier for Short Word Length DSP Systems. All three
implemented using logic. designs were carried out using the built-in primitives of Xilinx
Spartan 6 FPGA. The block RAM based design has the issue
with the word length (as the bits of multiplier and multiplicand
DSP48 based design increase, the total memory would increase).
In this design strategy, the DSP48 block was instantiated to
perform the required product. Here, contrary to above two In the short word length systems, the concern with
implementations, the DSP48 was not restricted to carryout memory based design is the amount of minimum memory that
three bit multiplication. we can configure in a given FPGA. For example, 32 bit
memory needed to store the product values of constant
IV. SIMULATION RESULTS multiplier 2 (considering eight multiplicand, i-e, 0-7) would be
In this work, three different three bit (3×3) constant incorporated within no less then block RAM of 9K (The
coefficient unsigned integral multiplier were designed; two minimum configurable block Memory in Sprtan 6 FPGA);
using FPGA built in primitives a) Block Ram and b) DSP48 hence resulting in wastage of unused memory.
based Multiplier, and third with proposed design using LUT Similarly, as mentioned before, the DSP48 can effectively
based implementation. be used for larger multiplications; hence once used in Short
The designs were carried out on Xilinx Spartan 6 FPGA Word Length Based system would result in in-efficient
using Xilinx ISE 13.2. The implementation results of three resource utilization.
different strategies are given in table 3. So, to use memory based multiplication or using the
In the design of block RAM multiplier, total of 8 memory DSP48 for Short Word Length systems is suggested as less
modules are needed logically for 3x3 multiplier (considering feasible. Consequently, the choice is to use the customized
0, 1, 2, 3, 4, 5, 6, 7 as constant multipliers), but the design (LUT based) implementations, as one proposed here.
consists only 6 memory modules-as, for case of 0 and 1 the The proposed multiplication algorithm may be used in any
multiplier and multiplicand will be the result of product area of DSP, besides Short Word Length processing.
respectively.
Point to the future work is to observe this multiplier by
TABLE III. RESULTS OF FPGA BASED DESIGNED MULTIPLIERS incorporating it in FIR Filter and Adaptive Filter design.
Primitives LUTs DSP RAM Mux Max:delay Freq:
: (ns) (MHz)
BRAM 3 0 6 0 4.372 228.72
DSP48 0 1 0 0 9.173 109.01
Proposed 9 0 0 85 3.648 274.12

978-1-5386-5689-1/18/$31.00 ©2018 IEEE 279


The 2018 International Conference on Signals and Systems (ICSigSys)

References
2017 IEEE 8th Latin American Symposium on,
2017, pp. 1-4.
[1] C. Shi, J. Hwang, S. McMillan, A. Root, and V. [13] K. Shanthi and N. Nagarajan, "HIGH SPEED AND
Singh, "A system level resource estimation tool for AREA EFFICIENT FPGA IMPLEMENTATION
FPGAs," Field Programmable Logic and OF FIR FILTER USING DISTRIBUTED
Application, pp. 424-433, 2004. ARITHMETIC," Journal of Theoretical and
[2] M. R. Singh and A. Rajawat, "A Review of FPGA- Applied Information Technology, vol. 62, 2014.
based design methodologies for efficient hardware [14] H. P. Singh, R. Sarin, and S. Singh,
Area estimation," IOSR Journals (IOSR Journal of "Implementation of high speed FIR filter using
Computer Engineering), vol. 1, pp. 1-6. serial and parallel distributed arithmetic
[3] C. Lavin, M. Padilla, S. Ghosh, B. Nelson, B. algorithm," International Journal of Computer
Hutchings, and M. Wirthlin, "Using hard macros to Applications, vol. 25, pp. 26-32, 2011.
reduce FPGA compilation time," 2010, pp. 438- [15] K. Chapman, "Fast integer multipliers fit in
441. FPGAs," EDN magazinee’s Desighn Ideas, www.
[4] A. Palchaudhuri and R. S. Chakraborty, "A Fabric ednmag. com, 1993.
Component Based Approach to the Architecture [16] A. Srinivasalu and G. R. Reddy, "Optimization of
and Design Automation of High-Performance memory based LUT Multiplier," Research and
Integer Arithmetic Circuits on FPGA," in Development (IJECIERD), vol. 3, pp. 125-132,
Computational Intelligence in Digital and Network 2013.
Designs and Applications, ed: Springer, 2015, pp. [17] B. Jeevanarani and T. Sreenivas, "Memory− Based
33-68. Realiza− tion of FIR Digital Filter by Look− Up}
[5] A. Corporation, "AN 584: Timing Closure Table Optimization," International Journal of
Methodology for Advanced FPGA Designs," Engineering Research and Appti− cations, vQl,
2014.12.19. vol. 2, pp. 1003-1009, 2012.
[6] N. Benvenuto, L. Franks, and F. Hill Jr, "Dynamic [18] R. H. Turner and R. F. Woods, "Highly efficient,
programming methods for designing FIR filters limited range multipliers for LUT-based FPGA
using coefficients-1, 0 and+ 1," Acoustics, Speech architectures," IEEE Transactions on Very Large
and Signal Processing, IEEE Transactions on, vol. Scale Integration (VLSI) Systems, vol. 12, pp.
34, pp. 785-792, 1986. 1113-1118, 2004.
[7] A. Z. Sadik and Z. M. Hussain, "Short word-length
LMS filtering," 2007, pp. 1-4.
[8] T. D. Memon, P. Beckett, and A. Z. Sadik, "Power-
area-performance characteristics of FPGA-based
sigma-delta fir filters," Journal of Signal
Processing Systems, vol. 70, pp. 275-288, 2013
2013.
[9] A. C. Thompson, Techniques in Single-Bit Digital
Filtering: RMIT University, 20040., 2004.
[10] M. K. Jaiswal and H. K.-H. So, "DSP48E efficient
floating point multiplier architectures on FPGA,"
in VLSI Design and 2017 16th International
Conference on Embedded Systems (VLSID), 2017
30th International Conference on, 2017, pp. 1-6.
[11] M. K. Jaiswal and H. K.-H. So, "Architecture for
quadruple precision floating point division with
multi-precision support," in Application-specific
Systems, Architectures and Processors (ASAP),
2016 IEEE 27th International Conference on,
2016, pp. 239-240.
[12] L. Rodriguez-Flores, M. Morales-Sandoval, R.
Cumplido, C. Feregrino-Uribe, and I. Algredo-
Badillo, "A compact FPGA-based microcoded
coprocessor for exponentiation in asymmetric
encryption," in Circuits & Systems (LASCAS),

978-1-5386-5689-1/18/$31.00 ©2018 IEEE 280

You might also like