0% found this document useful (0 votes)
13 views7 pages

Da Ramsow

The document discusses the implementation of Distributed Arithmetic (DA) for designing FIR filters, particularly in the context of FPGA architectures. It highlights the advantages of DA in reducing hardware requirements by using lookup tables to perform multiply-accumulate operations, thus enhancing efficiency in discrete wavelet transform (DWT) applications. Additionally, it addresses challenges such as increased memory requirements with larger filters and proposes a modified architecture to optimize latency and throughput.

Uploaded by

sowmyakb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Da Ramsow

The document discusses the implementation of Distributed Arithmetic (DA) for designing FIR filters, particularly in the context of FPGA architectures. It highlights the advantages of DA in reducing hardware requirements by using lookup tables to perform multiply-accumulate operations, thus enhancing efficiency in discrete wavelet transform (DWT) applications. Additionally, it addresses challenges such as increased memory requirements with larger filters and proposes a modified architecture to optimize latency and throughput.

Uploaded by

sowmyakb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

DA Scheme for DWT:

Distributed Arithmetic is one of the signals processing technique used to design and implement FIR filters.
It is a bit level rearrangement of multiply accumulate to hide multiplications. DA helps in reducing the
total hardware required for multiply-accumulate operations which made it suitable for FPGA designs [21].
DA is used for computing sum of products using shift and add operations thus avoiding multipliers.
Equation 1 describes an FIR filter of length n.
N-1

Y= ∑A X k k Eq. (1)
k=0
Where, Y = response of the network, A k is filter coefficient and X k is input variable. DA performs
multiplication using lookup table based schemes. Distributed arithmetic is an efficient method for
computing the inner product operation which constitutes the core of the discrete wavelet transform. In this
section we briefly describe the mathematical derivation of the distributed arithmetic algorithm.

Mathematical derivation of distributed arithmetic is extremely simple; a mix of Boolean and ordinary
algebra. Let the variable Y hold the result of an inner product operation between a data vector x and a
coefficient vector a. The conventional representation the inner product operation is given as follows:

Where the input data words xi have been represented by the 2’s complement number presentation in
order to bound number growth under multiplication. The variable xij is the jth bit of the xi word which is
Boolean, B is the number of bits of each input data word and x0i is the sign bit. Interchange the order of
summation of Eq. (4), we get:

Distributed arithmetic is based on the observation that the function Fj can only take 2N different values
that can be pre-computed offline and stored in a look-up table. Bit j of each data xij is then used to
address this look-up table. Eq. (5) clearly shows that the only three different operations required for
calculating the inner product. First, a look-up to obtain the value of Fj, then addition or subtraction, and
finally a division by two that can be realized by a shift. In its most obvious and direct form, distributed
arithmetic computations are bit-serial in nature, i.e., each bit of the input samples must be indexed in
turn before a new output sample becomes available. When the input samples are represented with B
bits of precision, B clock cycles are required to complete an inner-product calculation. An example of a
distributed arithmetic implementation of a 4-element inner product operation is shown in Figure 1
along with the conventional implementation of the same product operation.
DA implementation of an FIR filter

Distributed arithmetic(DA) implementation of an FIR filter consists of a look- up table (LUT), a


cascade of shift registers and a scaling accumulator.

LUT based DA implementation

The LUT stores all possible partial products over the FIR filter coefficient. Input samples are presented
to the input parallel -to-serial shift register at the input signal sample rate. As the input sample is
serialized, the bit-wide output is presented to the bit-serial shift register cascade,1-bit at a time. The
cascade stores the input sample history in a bit-serial format and is used in forming the required inner-
product computation. The bit outputs of the shift register cascade are used as address inputs to the look-
up table. Partial results from the look-up table are summed by the scaling accumulator to form a final
result at the filter output port. Since the LUT size in a distributed arithmetic implementation increases
exponentially with the number of coefficients, the LUT access time can be a bottleneck for the speed of
the whole system when the LUT size becomes large. Hence we decomposed the 8-bit LUT shown in
Figure 6 into two 4-bit LUTs, and added their outputs using a two-input accumulator. The modified
partitioned-LUT architecture is shown in Figure 7.

DA is used for calculating the MAC operations which is common in DSP algorithms like
convolution and correlation. DA is a slow process as it is bit-serial in nature. It is said to be fast, if the
vector elements are same as the wordsize. The process involved here is, precomputing the values and
storing the result in the LUT with the input as address. By reducing the LUT size, the area is saved and
also the system performance is said to be increased.
common optimizations are involved in reducing the LUT size i.e.) unreasonable amount of memory is
reduced by this method. Thus, the two types are breaking up filter into smaller units and offset binary
coding.

Breaking up the Filter


The memory requirement of the above said DA is increased by increasing the size of the filter. In
order to say, a 64-tap DA FIR filter requires 2 64 entries in the DA LUT. So, it is overcome by breaking
up the filter into smaller base DA filtering units that utilize tractable memory sizes and then summing up
the outputs of these units.
Thus the diagram shows that outputs are summed and then it is given to the scaling process. Next
to the scaling process, accumulation is carried out whereby the feedback is involved after this
process.The output is feedbacked to the scaling process.
to realize DWT architecture. DA logic is adopted for realizing FIR filters that occupy LUTs on FPGA.
DWT based on DA approach have been extensively adopted for FPGA implementation. The relation
between input x and output y in a FIR filter can be expressed as sum of product

(2)
Where xk is a 2’s-complement binary number scaled such that | xk |<1, Ak is fixed filter coefficients and
yk output of filter in 2’s-complement binary number. The input xk : {bk0, bk1, bk2……, bk(N-1) }, is
represented using word length=N and bk0 is the sign bit. Thus input can be expressed as
N −1

xk = −bk 0 + ∑bkn 2
−n
(3)
n =1
Substituting 3 in 2,
K N −1
y= A −b +
∑ k k 0 ∑bkn 2− n (4)
k =1 n=1

Simplifying 3, gives rise to


K
N −1 K
y=∑ ∑ Ak bkn 2− n + ∑ Ak (−bk 0 )(5)
n =1 k =1 k =1

Where K=Number of taps (inputs) and N is the word length of


data.
Figure 4 shows the hardware architecture for DA based filter design. Inputs x are used as addresses of
ROM and the partial products that are computed are accessed and accumulated at the output. The partial
products stored in the memory are shown in Table 2.
Figure 4 Hardware for DA based filter
FPGA architectures have LUTs for implementation of complex logic applications. Also 75% of the
resources on FPGAs being LUTs, it is required to utilize the LUTs efficiently to realize DWT
architecture. DA logic is adopted for realizing FIR filters that occupy LUTs on FPGA. DWT based on
DA approach have been extensively adopted for FPGA implementation. The relation between input x and
output y in a FIR filter can be expressed as sum of product
The basic DA architecture is as shown in Figure 5. With 8 input registers forming the address of the
memory, 256 partial products are computed and stored in ROM. The data stored in input registers [W, V,
U, T, S, R, Q, and P] each of 16 bits are serially loaded into the SISO registers. To load the set of 8
registers it requires 16x8 clock cycles. During this phase the input registers are configured as SISO. Once
the data is loaded into the registers, the LSB of all the 8 registers are connected to the address bus of the
LUT. The LSBs that are used as addresses enable the corresponding memory location. The data available
at that location is read out and is accumulated in the adder/subtractor unit. The output obtained at every
clock cycle is shifted right and is stored into the accumulator. The contents of input registers are shifted
serially out, this requires 16 clock cycles. After 16 clock cycles the contents of the accumulator will consist
of the final output Y(n) and the contents of SISO registers are reloaded. To compute the output sample
Y(n+1), new set of input is loaded into the SISO register, this requires another 16 clock cycles. Once the
new set of data is loaded the output sample Y(n+1) is computed in 16 clock cycles. Thus the latency of the
network is (16*8 + 16) clock cycles and throughput is 32 clock cycles. The basic FPGA architecture
consists of configurable logic blocks (CLB), each CLB consists of 4 LUTs, thus can be configured as 16x4
ROM, in order to store data of size 256x8, it required to configure 32 LUTs or 8 CLBs. Thus the basic DA
architecture eliminates multipliers required to compute filter outputs, thus replacing them by ROM.

Figure 5 Basic DA architecture

The DA-DWT architecture is built using the structure shown in Figure 5. As there are 9
filter coefficients in the low pass it requires a ROM of size 512x8 (filter coefficients
represented by 8 bits), and for the high pass a memory of size 128x8. The latency in
computing low pass filter output is 160 (16 bit input register) clock cycles and through put
of 32 clock cycles and for the high pass output, latency is 128 clock cycles and through put
of 32 clock cycles. The limitations in this basic architecture are that the architecture has
higher latency and also occupies more memory space (LUT). In order to reduce the latency
and increase throughput, a modified architecture is proposed.
MODIFIED DA BASED DWT
There are 9 filter coefficients for low pass and 7 filter coefficients for high pass during the
analysis phase. For reconstruction, there are 7 coefficients for low pass and 9 coefficients
for high pass, thus 9/7 is bi-orthogonal and is symmetric. To represent the fraction numbers
shown in table 2, it requires 14 bit numbers, thus for FPGA implementation, it is required
to represent the filter coefficient using fixed point or floating point number. In this work,
we have used fixed point number representation. The filter coefficients are first scaled
using a scaling factor of 1024 for low pass filter and a scaling factor of 256 for the high
pass filter. The scaled values are rounded to the nearest integer value. The scaled and
rounded number is represented using twos complement number, thus can be used to
represent both signed and unsigned numbers. Table 3 presents the modified filter
coefficients.

You might also like