Design and Implementation of LUT Optimization Using APC-OMS System
Design and Implementation of LUT Optimization Using APC-OMS System
Abstract: The multiplication is major arithmetic operation in signal processing and in ALUs .The multiplier uses look-
up-table (LUT) as memory for their computations. However, we do not find any significant work on LUT optimization for
memory-based multiplication. A new approach to LUT design was presented, where only the odd multiple storage (OMS)
scheme. In addition to that the antisymmetric product coding (APC) approach, the LUT size is reduced to half and
provides a reduction. When APC approach is combined with the OMS technique, the twos complement operations could
be simplified since the input address and LUT output could always be transformed into odd integers, and thus reduces the
LUT size to one fourth of the conventional LUT. The proposed LUT multipliers for word size L=W=5 bits are coded in
VHDL and synthesized in Xilinx 14.2. It is found that the proposed LUT-based multiplier involves comparable area and
time complexity for a word size of 5-bits.
Index Terms: Digital signal processing (DSP) chip, lookup table (LUT)-based computing, memory-based computing.
1. INTRODUCTION
Digital signal processing algorithms typically require for the system to work, the DSP operation must be
a large number of mathematical operations to be completed within some fixed time, and deferred
performed quickly and repetitively on a set of data. processing is not viable. Digital signal processing:
Signals are constantly converted from analog to
In-order to reach a certain criteria memory based
digital, manipulated digitally, and then converted
computation plays a vital role in DSP (digital signal
again to analog form, as diagrammed below. Many
DSP applications have constraints on latency; that is, processing) application.
IJCERT2014 470
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
FILTER DESIGNING: which may exceed 90%, of total SoC content. It has
Finite impulse response (FIR) digital filter is widely also been found that the transistor packing density of
used as a basic tool in various signal processing and SRAM is not only high, but also increasing much
image processing applications. The order of an FIR faster than the transistor density of logic devices.
filter primarily determines the width of the transition-
band, such that the higher the filter order, the sharper 1.1 BINARY MULTIPLICATION:
is the transition between a pass-band and adjacent Multiplication in binary is similar to its decimal
counterpart. Two numbers A and B can be multiplied
stop-band. Many applications in digital
by partial products: for each digit in B, the product of
Communication (channel equalization, frequency
that digit in A is calculated and written on a new line,
channelization), speech processing (adaptive noise
cancelation), seismic signal processing (noise shifted leftward so that its rightmost digit lines up
elimination), and several other areas of signal with the digit in B that was used. The sum of all these
processing require large order FIR filters. Since the partial products gives the final result.
IJCERT2014 471
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
and 50% of saving in area-delay product over the corresponding CSD multipliers.
1.3 ANTI -SYMMETRIC PRODUCT approach is combined with the OMS technique, the
CODING: twos complement operations could be very much
Anti symmetric product coding is the technique used simplified since the input address and LUT output
to process the multiplication based on LUT could always be transformed into odd integers.
multiplication which reduces the size of conventional However, the OMS technique in [9] cannot be
lut by 50 % .The anti symmetric product coding is combined with the APC scheme in [10], since the
based on the antisymmetric coding i.e the 2's APC words generated according to [10] are odd
complement phenomenon which is used to reduce numbers. Moreover, the OMS scheme in [9] does not
the LUT size by half.For simplicity of presentation, provide an efficient implementation when combined
we assume both X and A to be positive integers.2 The with the APC technique. In this brief, we therefore
product words for different values of X for L = 5 are present a different form of APC and combined that
shown in Table I. It may be observed in this table that with a modified form of the OMS scheme for efficient
the input word X on the first column of each row is memory- based multiplication.
the two's complement of that on the third column of
the same row. In addition, the sum of product values The product values on the second and fourth
corresponding to these two input values on the same columns of Table I therefore have a negative mirror
row is 32A. Let the product values on the second and symmetry. This behavior of the product words can
fourth columns of a row be u and v, respectively. be used to reduce the LUT size, where, instead of
Since one can write u = [(u + v)/2 - (v - u)/2] and v = storing u and v, only [(v - u)/2] is stored for a pair of
[(u + v)/2 + (v - u)/2], for (u + v) = 3 2 A, The APC input on a given row. The 4-bit LUT addresses and
approach, although providing a reduction in LUT corresponding coded words are listed on the fifth
size by a factor of two, incorporates substantial and sixth columns of the table, respectively. Since the
overhead of area and time to perform the twos representation of the product is derived from the
complement operation of LUT output for sign anti-symmetric behavior of the products, we can
modification and that of the input operand for input name it as anti-symmetric product code.
mapping. However, we find that when the APC
IJCERT2014 472
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
The 4-bit address X'= x3'x2'x1'x0' of the APC word is (x3x2x1x0) is the four less significant bits of X, and
given by X' = XL, if x4 = 1=X'L , if x4 = 0 where XL = XL' is the two's complement of XL.
Fig 3. Optimized implementation of the sign modification of the odd LUT output.
1.4 LUT -BASED MULTIPLICATION two's complement operation of LUT output for sign
USING APC - OMS MODIFIED modification and that of the input operand for input
OPTIMIZATION TECHNIQUE mapping. However, we find that when the APC
approach is combined with the OMS technique, the
The APC approach, although providing a reduction two's complement operations could be very much
in LUT size by a factor of two, incorporates simplified since the input address and LUT output
substantial overhead of area and time to perform the could always be transformed into odd integers.
1.5 LUT COMBINED APC-OMS BASED MULTIPLICAT-ION TECHNIQUE
IJCERT2014 473
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
10 11 11A 0 10 11 P5 =11A 0 10 1
1 1 0 1 13 A Q 110 1 P6 = 13.4 0 110
1 1 1 1 ISA 0 1111 P7 = 15.4 0 111
The proposed APC-OMS combined design of the 6. Add/Substractor (Sign Determination) module Xin
LUT for L = 5 and for any coefficient width W is generation module (based on antisymmetric
shown in Fig. 2.4. It consists of an LUT of nine words process): A input of 5-bit length is given as input to
of (W + 4)-bit width, a four- to-nine-line address this module. It used to generate antisymetric of last
decoder, a barrel shifter, an address generation 4-bits (Xin(3 to 0)) when the msb of Xini.eXin(4) is 0
circuit, and a control circuit for generating the RESET and and process the same input when the msb of Xin
signal and control word (s1s0) for the barrel shifter. is 1 hence only 16 combinations will be achived for
The recomputed values of A x (2i + 1) are stored as 5-bit of input as in table 1.
Pi, for i = 0, 1, 2, . . . , 7, at the eight consecutive
locations of the memory array, as specified in Table 3. IMPLEMENTATION
II, while 2A is stored for input X = (00000) at LUT
A barrel shifter is often implemented as a cascade of
address "1000," as specified in Table III. The decoder
parallel 21 multiplexers. For a 4-bit barrel shifter, an
takes the 4-bit address from the address generator
intermediate signal is used which shifts by two bits,
and generates nine word-select signals, i.e., {wi, for 0
or passes the same data, based on the value of S[1].
< i < 8}, to select the referenced word from theLUT.
This signal is then shifted by another multiplexer,
The 4-to-9-line decoder is a simple modification of 3-
which is controlled by S[0]:
to-8-line decoder.The control bits s0 and s1 to be
used by the barrel shifter to produce the desired im = IN, if S[1] == 0 = IN << 2, if S[1] == 1
number of shifts of the LUT output are generated by
the control circuit, according to the relations. OUT = im, if S[0] == 0
2.1 Basic Components of LUT It is used to add the intermediate results to 16A to
get the final output .It may make output 0 when clr
Optimization:
is high.
The modules contributed for combined APC-OMS
u = *(u + v)/2 (v u)/2+ and
based LUT optimization technique are
IJCERT2014 474
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
Fig 4. 4-bit_ripple_carry_adder-subtracter
4. LUT APC - OMS Optimization Top address decoder, a barrel shifter, an address
Model output * LUT " APC-OMS generation circuit, and a control circuit for generating
the RESET signal and control word (s1s0) for the
The APC approach, although providing a reduction barrel shifter. The recomputed values of A x (2i + 1)
in LUT size by a factor of two, incorporates are stored as Pi, for i = 0, 1, 2, . . . , 7, at the eight
substantial overhead of area and time to perform the consecutive locations of the memory array, as
two's complement operation of LUT output for sign specified in Table II, while 2A is stored for input X =
modification and that of the input operand for input (00000) at LUT address "1000," as specified in Table
mapping.The proposed APC-OMS combined design III. The decoder takes the 4-bit address from the
of the LUT for L = 5 and for any coefficient width W address generator and generates nine word-select
is shown in Fig. 2.4. It consists of an LUT of nine signals, i.e., {wi, for 0 < i < 8}, to select the referenced
words of (W + 4)-bit width, a four- to-nine-line word from the LUT.
IJCERT2014 475
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
Here we observe that they will Antisymmetry in the we will store only odd coefficients in the look up
address for the LSB 4 bits. We will get all the address table .Thus we reduce the number of coefficients by
from 0 to 15 for 0 to 31.Thus we reduce the memory half again. On total we have reduced the number
locations required to store coefficients by half. Then coefficients by quarter.
IJCERT2014 476
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
5. RTL SCHEMATIC:
5. SIMULATION RESULTS:
IJCERT2014 477
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
multiplying coefficients. The proposed LUT- Process., vol. 39, no. 10, pp. 723733, Oct. 1992.
IJCERT2014 478
www.ijcert.org
ISSN (Online): 2349-7084
GLOBAL IMPACT FACTOR 0.238
ISRA JIF 0.351
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING IN RESEARCH TRENDS
VOLUME 1, ISSUE 6, DECEMBER 2014, PP 470-479
[4] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and *7+ P. K. Meher, Systolic designs for DCT using a
T.Stouraitis, A systolic array architecture for the lowcomplexity concurrent convolutional
formulation,IEEE Trans. Circuits Syst. Video Tech-
discrete sine transform, IEEE Trans. Signal
nol., vol. 16,no. 9, pp. 10411050, Sep. 2006.
Process.,vol. 50, no. 9, pp. 23472354, Sep. 2002.
*8+ P. K. Meher, Memory-based hardware for
[5] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen,
resourceconstrained digital signal processing
A memory-efficient realization of cyclic convolution
systems, in Proc. 6th Int. Conf. ICICS, Dec. 2007,pp.
and its application to discrete cosine transform,
14.
IEEE Trans. Circuits Syst. Video Technol., vol. 15, no.
3,pp. 445453, Mar. 2005. *9+ P. K. Meher, New approach to LUT
implementation and accumulation for memory-based
[6] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and multiplication,in Proc. IEEE ISCAS, May 2009, pp.
T.Stouraitis, Systolic algorithms and a memory-
453456.
based design approach for a unified architecture for
the computation of DCT/DST/IDCT/IDST, IEEE *10+ P. K. Meher, New look-up-table optimizations
Trans.Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. for memory-based multi- plication, in Proc. ISIC,
11251137, Jun. 2005. Dec. 2009, pp. 663666
IJCERT2014 479
www.ijcert.org