4xDSP IC DA
4xDSP IC DA
4xDSP IC DA
DISTRIBUTED ARITHMETIC
Distributed arithmetic is an efficient procedure for computing inner products between a fixed and a variable data vector. The basic principle is owed to Croisier et al. (Patent), and Peled and Liu have independently presented a similar method. Consider the sum-of-products (inner products)
y = a x =
T
i=1
Wd 1 a i x i0 + k=1
x ik 2 k
where xik is the kth bit in xi. By interchanging the order of the two summations we get
Wd 1 a i x i0 +
i=1
ai xi y =
where the coefficients, ai, i = 1, 2, ..., N are fixed. A twos-complement representation is used for the data components which are scaled so that |xi | 1. which can be written
i=1
k=1 i=1
a i x ik 2 k
y = F 0 ( x 10, x 20, , x N 0 ) +
Wd 1 k=1
F k ( x 1k, x 2k, , x Nk )2
where
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
F k ( x 1k, x 2k, , x Nk ) =
i=1
a i x ik
y = 0 + F W 1 2 1 + + F 2 2 1 + F 1 2 1 F 0 d
F is a function of N binary variables, the ith variable being the kth bit in the data xi. Since Fk can take on only a finite number of values, 2N, it can be computed and stored in a look-up table. This table can be implemented using a ROM (Read-Only Memory). Using Horners method for evaluating a polynomial for x = 0.5, we can rewrite
y = F 0 ( x 10, x 20, , x N 0 ) + Wd 1 k=1 y = 0 + F W 1 2 1 + + F 2 2 1 + F 1 2 1 F 0 d
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
Inputs, x1, x2,, xN are shifted bit-serially out from the shift registers with the least-significant bit first. Bits xik are used as an address to the ROM storing the look-up table.
x1 xN WROM ROM 2 words WROM Add/Sub
N
F k ( x 1k, x 2k, , x Nk )2
The computational time is Wd clock cycles. The word length in the ROM, WROM, depends on the Fk with the largest magnitude and the coefficient word length, Wc, and
W ROM W c + log 2( N )
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
Example 11.11 Determine the values that to be stored in ROM for the inner product
y = a1 x1 + a2 x2 + a3 x3 33 85 where a 1 = -------- = (0.0100001)2C, a 2 = -------- = (0.1010101)2C, and 128 128 11 a 3 = -------- = (1.1110101)2C. 128 x1 x2 x3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
DSP Integrated Circuits
Lars Wanhammar
The shift-accumulator must be able to add correctly the largest possible value obtained in the accumulator register and in the ROM. The largest value in the accumulator register is obtained when the largest (magnitude) value stored in the ROM is repeatedly accumulated.
y = 0 + F W 1 2 1 + + F 2 2 1 + F 1 2 1 F 0 d
Fk 0 a3 a2 a2 + a3 a1 a1 + a3 a1 + a2 a1 + a2 + a3
Thus, at the last clock cycle, corresponding to the sign bit, the value in REG is
y = ( ( ( 0 + F max )2 1 + F max )2 1 + + F max )2 1 F max
Hence, the shift-accumulator must be able to add two numbers of magnitude Fmax. The necessary number range is 1. The word length in the shift-accumulator must be extended with one guard bit for overflow detection = 1 + 8 bit word = 9 bits.
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
Example 11.12 A second-order section in direct form I can be implemented by using only a single PE based on distributed arithmetic.
Shift Reg. x(n)
x(n) a0 T T y(n1) y(n)
x(n1)
x(n2)
F k ( x 1k, x 2k, , x Nk )2
x(n1) T x(n2)
a1
b1
ROM D D D
Pipelining
a2
b2
y(n2)
T y(n2)
In both cases, the same type of shift-accumulator can be used. Hence, the distributed arithmetic unit essentially consists of a serial/parallel multiplier augmented by a ROM.
A set of D flip-flops has been placed between the ROM and the shift-accumulator to allow the two operations to overlap in time, i.e., the two operations are pipelined. The number of words in the ROM is only 25 = 32.
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
10
Example 11.13 An linear-phase FIR structure can also be implemented using distributed arithmetic. Assume that N is even.
x(n) T T T T T T T T T T T
N/2 bit-serial adders (subtractors) are used to sum the symmetrically placed values in the delay line. This reduces the number of terms in the inner product. Only 64 words are required whereas 212 = 4096 words are required for the general case, e.g., a nonlinear-phase FIR filter. For higher-order FIR filters the reduction in the number of terms by 50% is essential. The number of words in the ROM is 2N where N is the number of terms in the inner product. The chip area for the ROM is small for inner products with up to 5 to 6 terms. The basic approach is useful for up to 10 to11 terms.
FA D
FA D
FA D
FA D
FA D
FA D
Pipelining
D
ROM D D
SHIFT-ACCUMULATOR y(n)
The logic circuitry has been pipelined by introducing D flip-flops between the adders (subtractors) and the ROM, and between the ROM and the shift-accumulator.
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
11
12
FA D
FA D
FA D
FA D
FA D
ROM
ROM
ROM
ROM
Set
Bit-parallel or digit-serial adder tree
Sub
Add
For F0, the clock cycle corresponding to the sign bit of the data, F0 should be subtracted. This is done by adding F0, i.e., inverting all the bits in F0 using the XOR gates and the signal s, and adding one bit in the least-significant position.
Add
[email protected] http://www.es.isy.liu.se/
13
14
After F0 has been added, the most significant part of the inner product must be shifted out of the accumulator. This can be done by accumulating zeros. The number of clock cycles for one inner product is Wd+WROM. A more efficient scheme is to free the carrysave adders in the accumulator by loading the sum and carry bits of the carrysave adders into two shift registers. The outputs from these can be added by a single carrysave adder.
f0 =1 =1 f1 f2 =1 =1 f3 =1 LSP f4 s
This scheme effectively doubles the throughput since two inner products are computed concurrently for a small increase in chip area.
f0 =1 =1 f1 =1 f2 =1 f3 =1 LSP f4 s
MUX
D D
FA D
FA D
FA D
FA D
FA D
Output
MSP
MUX
MUX
MUX
MUX
FA D
FA D
FA D
FA D
FA D
MUX
FA 1
MUX
MUX
MUX
MUX
MUX
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
15
16
Memory Coding The second approach is based on a special coding of the ROM content. Memory size can be halved by using the ingenious scheme based on the identity
x1 x2 xN/2
ROM 2 words
N/2
ROM 2 words
N/2
xN/2+1 xN/2+2 xN
1 x = -- [ x ( x ) ] 2
Add
Reg.
1 x = -- x 0 + 2
Wd 1 k=1
xk 2 k x0 +
Wd 1 k=1
xk 2 k + 2
W d + 1
Add/Sub
= ( x 0 x 0 )2 1 +
Wd 1 k=1
( x k x k )2 k 1 2
W d
Notice that (xk x k ) can only take on the values 1 or +1. Inserting this expression into the inner product yields
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
17
18
Wd 1 y = k=1
F k ( x 1k, , x Nk )2 k 1 F 0 ( x 10, , x N )2 1 + F ( 0, , 0 )2 N
W d
Notice that only half the values are needed, since the other half can be obtained by changing the signs. To explore this redundancy we make the following address modification shown to the right in the table below.
u1 = x1 x2 u2 = x1 x3 x1 x2 x3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Fk a1 a2 a3 a1 a2 + a3 a1 + a2 a3 a1 + a2 + a3 +a1 a2 a3 +a1 a2 + a3 +a1 + a2 a3 +a1 + a2 + a3 A/S = x 1 x sign bit u1 u2 A/S 0 0 A 0 1 A 1 0 A 1 1 A 1 1 S 1 0 S 0 1 S 0 0 S
Antisymmetry
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
19
20
COMPLEX MULTIPLIERS
x1 x2 xN =1 ROM 2 words WROM xSign-bit =1 Add/Sub
N1
Classical solution require 3 real multiplications and two adder networks. Let
X = a + jb
=1
WROM Add/Sub
and K = c + jd
where K is the fixed coefficient and X is the variable. Once again we use the identity
1 1 x = -- [ x ( x ) ] = -- x 0 + 2 2 Wd 1 Wd 1 W d + 1 k x + xk 2 xk 2 k + 2 0 k=1 k=1 W d =
Distributed arithmetic with halved ROM. Only N1 variables are used to address the memory.
= ( x0 x0 ) 2 1 +
Wd 1 k=1
( x k x k )2 k 1 2
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
21
22
W d F 1 ( a i, b i )2 i 1 + F 1 ( 0, 0 )2 = F 1 ( a o, b 0 )2 1 + + i=1 Wd 1
Wd 1 W d 1 + + j F 2 ( a 0, b 0 )2 F 2 ( a i, b i )2 i 1 + F 2 ( 0, 0 )2 i=1
Wd 1 W 1 + d ( bi bi )2 i 1 d 2 d + d ( b 0 b 0 )2 i=1
Hence, the real and imaginary parts of the product can be computed using just two distributed arithmetic units. The binary functions F1 and F2 can be stored in a ROM, addressed by the bits ai and bi. The ROM content is
ai 0 0 1 1
[email protected] http://www.es.isy.liu.se/
Wd 1 W + j d ( a 0 a 0 )2 1 d ( ai ai )2 i 1 d 2 d + i=1
Wd 1 W + j c ( b 0 b 0 )2 1 + c ( bi bi )2 i 1 c 2 d = i=1
bi 0 1 0 1
F1 (c d) (c + d) (c + d) (c d)
F2 (c + d) (c d) (c d) (c + d)
Antisymmetry
[email protected] http://www.es.isy.liu.se/
23
24
It is obvious from the table that only two coefficients are needed, (c+d) and (cd). The appropriate coefficients can be directed to the accumulators via a 2:2multiplexer. If a i b i = 1 the F values are applied directly to the accumulators, and if a i b i = 0 the F values are interchanged. The F values are either added to, or subtracted from, the accumulators registers depending on the data bits ai and bi.
(C + D) (C D) ai + bi = 1 ai + bi = 0 ai + bi F1 ai Add/Sub Shift-Accumulator Real part MUX F2 Shift-Accumulator Imaginary part Add/Sub bi
IMPROVED SHIFT-ACCUMULATOR
The last term in the real part (and the same for the imaginary part) KA
Wd 1 W d 1 + F 1 ( a i, b i )2 i 1 + F 1 ( 0, 0 )2 = F 1 ( a o, b 0 )2 + imaginary part i=1
shall be added to the first term in the sum, FWd1, at the same level of significance. This can be accomplished by initially setting the carry D flipflops to F(0, 0,..., 0), as illustrated below where only the upper part of the shift-accumulator part is shown.
f0 f1 f2 f3 Add/Sub =1 =1 =1 =1 D
FA
FA
FA
FA
FA
LSP
D 1 F0
AC BD
AD + BC
F1
F2
F3
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
25
26
200 mm
250 mm
440 mm
540 mm
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
27
28
The PE has a built-in coefficient generator that can generate all twiddle factors in the range 0 to 128, which is sufficient for a 1024-point FFT. The layout using AMS 0.8-m double metal CMOS process is shown below. It is clear that the coefficient generator and the complex multiplier occupy most of the area. The area is 1.47 mm2.
Control Add/Sub
outIm1
Coefficent generator
Shimming delays
Round
1
Start D D D
Complex multiplier
SRff
Coefficient generator
D D D D D
The decimation-in-frequency radix-2 bit-serial butterfly PE has been implemented in a 0.8 m standard CMOS process.
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
The maximal clock frequency at 3 V supply voltage is 133 MHz with a power consumption of 30 mW (excluding the power consumed by the clock).
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/
29
30
80% of the power is consumed in the complex multiplier and 5% in the coefficient generator. The rest (15 %) is evenly distributed in the rest of the butterfly. The D flip-flops and the gates at the bottom of the block diagram are the control.
Twiddle Factor PE
Twiddle factors can be generated in several ways: by using a CORDIC PE, via trigonometric formulas, or read from a precomputed table. Here we will use the latter methodthat is, a PE that essentially consists of a ROM. We have previously shown that it is possible to use only one Wp PE. However, here it is better to use one for each butterfly PE, because the required chip area for a ROM is relatively small. If only one ROM were used, it would have been necessary to use long bitparallel buses, which are costly in terms of area, to distribute the twiddle factors to the butterfly PEs. The values of the twiddle factors, W, are spaced uniformly around the unit circle. Generally, there are N twiddle factors, but it is possible to reduce the number of unique values by exploring symmetries in the trigonometric functions. In fact, it can be shown that only N/8 +1 coefficient values need be stored in a table.
[email protected] http://www.es.isy.liu.se/
[email protected] http://www.es.isy.liu.se/
31
32
Instead of storing
Wp,
1 1 C( a) + S( a) p ----------------------------- = -- ( cos ( a ) sin ( a ) ) = ------ sin a -- 2 2 4 2 1 1 C( a) S( a) p ----------------------------- = -- ( cos ( a ) + sin ( a ) ) = ------ sin -- + a 2 2 2 4
where, a = 2pp/N. The twiddle factors in the eight octants can be expressed in terms of the twiddle factors in the range 0 to p/4.
Octant 0 1 2 3
DSP Integrated Circuits
Lars Wanhammar
a
p 0 a -4 p p -- a 2 -4 4 p p 2 -- a 3 -4 4 p p 3 -- a 4 -4 4
b a
p a -4 p a 2 -4 p a 3 -4
C+S ------------2 1 ------ sin p b - 2 4 1 ------ sin ( b ) 2 1 ------ cos p b -4 2 1 ------ cos ( b ) 2
[email protected] http://www.es.isy.liu.se/
33
x(0)
x(1)
x(N/21) x(N/2)
x(N1)
PE X(0)
PE
PE
PE
A 2-D DCT for 16 16 pixels can be built using only one 1-D DCT PE which itself consists of 16 distributed arithmetic units with N = 8. The TSPC based shift-accumulator can be used to implement a distributed arithmetic unit. The length of the shift-accumulator depends on the word length, WROM, which depends on the coefficients in the vector-products. In this case we assume that WROM = Wc+1 = 12 bits.
DSP Integrated Circuits
Lars Wanhammar Department of Electrical Engineering Linkping University [email protected] http://www.es.isy.liu.se/