Da Ramsow
Da Ramsow
Distributed Arithmetic is one of the signals processing technique used to design and implement FIR filters.
It is a bit level rearrangement of multiply accumulate to hide multiplications. DA helps in reducing the
total hardware required for multiply-accumulate operations which made it suitable for FPGA designs [21].
DA is used for computing sum of products using shift and add operations thus avoiding multipliers.
Equation 1 describes an FIR filter of length n.
N-1
Y= ∑A X k k Eq. (1)
k=0
Where, Y = response of the network, A k is filter coefficient and X k is input variable. DA performs
multiplication using lookup table based schemes. Distributed arithmetic is an efficient method for
computing the inner product operation which constitutes the core of the discrete wavelet transform. In this
section we briefly describe the mathematical derivation of the distributed arithmetic algorithm.
Mathematical derivation of distributed arithmetic is extremely simple; a mix of Boolean and ordinary
algebra. Let the variable Y hold the result of an inner product operation between a data vector x and a
coefficient vector a. The conventional representation the inner product operation is given as follows:
Where the input data words xi have been represented by the 2’s complement number presentation in
order to bound number growth under multiplication. The variable xij is the jth bit of the xi word which is
Boolean, B is the number of bits of each input data word and x0i is the sign bit. Interchange the order of
summation of Eq. (4), we get:
Distributed arithmetic is based on the observation that the function Fj can only take 2N different values
that can be pre-computed offline and stored in a look-up table. Bit j of each data xij is then used to
address this look-up table. Eq. (5) clearly shows that the only three different operations required for
calculating the inner product. First, a look-up to obtain the value of Fj, then addition or subtraction, and
finally a division by two that can be realized by a shift. In its most obvious and direct form, distributed
arithmetic computations are bit-serial in nature, i.e., each bit of the input samples must be indexed in
turn before a new output sample becomes available. When the input samples are represented with B
bits of precision, B clock cycles are required to complete an inner-product calculation. An example of a
distributed arithmetic implementation of a 4-element inner product operation is shown in Figure 1
along with the conventional implementation of the same product operation.
DA implementation of an FIR filter
The LUT stores all possible partial products over the FIR filter coefficient. Input samples are presented
to the input parallel -to-serial shift register at the input signal sample rate. As the input sample is
serialized, the bit-wide output is presented to the bit-serial shift register cascade,1-bit at a time. The
cascade stores the input sample history in a bit-serial format and is used in forming the required inner-
product computation. The bit outputs of the shift register cascade are used as address inputs to the look-
up table. Partial results from the look-up table are summed by the scaling accumulator to form a final
result at the filter output port. Since the LUT size in a distributed arithmetic implementation increases
exponentially with the number of coefficients, the LUT access time can be a bottleneck for the speed of
the whole system when the LUT size becomes large. Hence we decomposed the 8-bit LUT shown in
Figure 6 into two 4-bit LUTs, and added their outputs using a two-input accumulator. The modified
partitioned-LUT architecture is shown in Figure 7.
DA is used for calculating the MAC operations which is common in DSP algorithms like
convolution and correlation. DA is a slow process as it is bit-serial in nature. It is said to be fast, if the
vector elements are same as the wordsize. The process involved here is, precomputing the values and
storing the result in the LUT with the input as address. By reducing the LUT size, the area is saved and
also the system performance is said to be increased.
common optimizations are involved in reducing the LUT size i.e.) unreasonable amount of memory is
reduced by this method. Thus, the two types are breaking up filter into smaller units and offset binary
coding.
(2)
Where xk is a 2’s-complement binary number scaled such that | xk |<1, Ak is fixed filter coefficients and
yk output of filter in 2’s-complement binary number. The input xk : {bk0, bk1, bk2……, bk(N-1) }, is
represented using word length=N and bk0 is the sign bit. Thus input can be expressed as
N −1
xk = −bk 0 + ∑bkn 2
−n
(3)
n =1
Substituting 3 in 2,
K N −1
y= A −b +
∑ k k 0 ∑bkn 2− n (4)
k =1 n=1
The DA-DWT architecture is built using the structure shown in Figure 5. As there are 9
filter coefficients in the low pass it requires a ROM of size 512x8 (filter coefficients
represented by 8 bits), and for the high pass a memory of size 128x8. The latency in
computing low pass filter output is 160 (16 bit input register) clock cycles and through put
of 32 clock cycles and for the high pass output, latency is 128 clock cycles and through put
of 32 clock cycles. The limitations in this basic architecture are that the architecture has
higher latency and also occupies more memory space (LUT). In order to reduce the latency
and increase throughput, a modified architecture is proposed.
MODIFIED DA BASED DWT
There are 9 filter coefficients for low pass and 7 filter coefficients for high pass during the
analysis phase. For reconstruction, there are 7 coefficients for low pass and 9 coefficients
for high pass, thus 9/7 is bi-orthogonal and is symmetric. To represent the fraction numbers
shown in table 2, it requires 14 bit numbers, thus for FPGA implementation, it is required
to represent the filter coefficient using fixed point or floating point number. In this work,
we have used fixed point number representation. The filter coefficients are first scaled
using a scaling factor of 1024 for low pass filter and a scaling factor of 256 for the high
pass filter. The scaled values are rounded to the nearest integer value. The scaled and
rounded number is represented using twos complement number, thus can be used to
represent both signed and unsigned numbers. Table 3 presents the modified filter
coefficients.