Tages, and Applications.: Ieee Transactions On Signal Processing, Vol. 47, No. 3, March 1999 907
Tages, and Applications.: Ieee Transactions On Signal Processing, Vol. 47, No. 3, March 1999 907
3, MARCH 1999
907
in the proposed algorithm, which may improve the regularity of the structure, is currently under investigation. REFERENCES
[1] R. K. Rao and P. Yip, Discrete Cosine Transform: Algorithm, Advantages, and Applications. New York: Academic, 1990. [2] M. J. Narasimha and A. M. Peterson, On the computation of the discrete cosine transform, IEEE Trans. Commun., vol. COMM-26, pp. 934936, June 1978. [3] H. S. Hou, A fast recursive algorithms for computing the discrete cosine transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 14551461, Oct. 1987. [4] P. Lee and F.-Y. Huang, Restructured recursive DCT and DST algorithms, IEEE Trans. Signal Processing, vol. 42, pp. 16001609, July 1994. [5] P. Duhamel and C. Guillemot, Polynomial transform computation of 2-D DCT, in Proc. ICASSP, Apr. 1990, pp. 15151518. [6] E. Feig and S. Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans. Signal Processing, vol. 40, pp. 21742193, Sept. 1992. [7] N. I. Cho and S. U. Lee, Fast algorithm and implementation of 2-D DCT, IEEE Trans. Circuits Syst., vol. 38, pp. 297305, Mar. 1991. [8] N. I. Cho, I. D. Yun, and S. U. Lee, On the regular structure for the fast 2-D DCT algorithm, IEEE Trans. Circuits Syst., vol. 40, pp. 259266, Apr. 1993.
AbstractThe memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for buttery calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.
I. INTRODUCTION
(c) 8 DCT. (c) Second, third, Fig. 2. (Continued). Signal ow graph for 8 and fourth stages of the post-addition stages (for n is odd).
III. COMPLEXITY ANALYSIS OF THE POST-ADDITION STAGES For the post-addition stages, let A(N ) and B (N ), respectively, denote the number of all required additions and the number of additions required in the nal stage, and let C (N ) denote the number of nodes that do not require buttery computations in the rst log2 N stages. From (13) and (16), we have C (4) = 2 and C (N ) = C (N=2) + N=2 for N 8, and B (N ) = N 2 0 2N . Therefore, A(N ) = N 2 log2 N 0 C (N )+ B (N ) = N 2 (1 + log2 N ) 0 3N + 2. IV. CONCLUSIONS An index-permutation based 2-D DCT algorithm has been presented in this correspondence. The succinct derivation of the proposed algorithm makes it easier to describe the process of how to map one 2-D DCT into a number of 1-D DCTs. From the idea of [8], a matrix-form-based systematic expression for the post-addition stages
Many high-speed FFT processors have been obtained by implementing the fast Fourier transform in pipelined digital hardware with a buttery calculation unit, two-port data memory, ROM for storing twiddle factors, and memory addressing controller integrated on a chip. It is possible to use an in-place strategy that stores buttery outputs in those memory locations used by the inputs to the same buttery. The in-place strategy requires only a minimum amount of memory. For this reason, only the in-place radix-2 decimation-in-time version of the fast Fourier transform is considered here. If the buttery unit has parallel inputs and outputs, then the two buttery inputs will be accessed in the memory, and two buttery outputs will be written back to the same memory in each cycle. In order to avoid this memory bottleneck, the two-port memory module
Manuscript received May 23, 1997; revised July 22, 1998. This work was supported by the National Key Project of Fundamental Research, P.R. China. The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias S. Manolakos. The author was with the Center for High Performance Computing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China. He is now with the Department of Electrical Engineering, Link ping o University, Link ping, Sweden. o Publisher Item Identier S 1053-587X(99)00736-9.
908
is always divided into two separate banks so that two data values can be read and two data values written on each memory cycle. Pease [4] observed that the addresses of the two inputs differ in their parity. It is possible to divide the memory into two banks: one bank with address parity even and the other with parity odd. Based on this observation, Cohen proposed a simplied control logic for radix-2 FFT processors. Johnson [2] arranged the memory module in their radix-r FFT processor in a similar way. The demerit of their strategies is that the address parity should be calculated before accessing to memory and the delay of parity calculation is large. Sinha [5] presented an interesting approach to memory assignment. The approach uses the well-known triple loop control structure. Therefore, the address generation in their approach is complicated, and its delay is large. The processing speed or throughput of the FFT processor is dominated by its pipeline cycle or its clock rate. The disadvantages of existing methods [1], [5] are that their address generation delays are near that of a carry lookahead adder. This problem will limit the clock rate of buttery unit for large transforms. The objective of our work is to shorten the address generation delay, which is helpful in increasing the clock rate at which FFT processors can operate. Therefore, our work makes it possible to improve the performance of FFT processors.
Fig. 1. Signal owgraph of 16-point FFT algorithm.
II. RADIX-2 FFT ALGORITHM AND ADDRESS GENERATION FOR BUTTERFLY OPERATIONS The discrete fast Fourier transform of the N -point is dened by N 01 mk Xk = xm 1 WN (1) m=0 mk where WN = e0j (2k=N ) ; and k = 0; 1; 1 1 1 ; N 0 1: The radix-2 FFT is an efcient way to compute an N -point DFT, which is shown in Fig. 1. We assume that the inputs are arranged in a bit reverse order and that the outputs are produced in a normal order. Dene Xk (01) = xm with k = rev(m) (bit reverse number corresponding to m). The buttery calculations at pass p, which are shown in Fig. 1, are expressed by c Xs (p) = Xs (p 0 1) + Xt (p 0 1) 1 WN (2) c Xt (p) = Xs (p 0 1) 0 Xt (p 0 1) 1 WN (3) where
s t l g
Note that the twiddle factor address can be generated by using an operation similar to a shift left operation. Since the address generation logic is very simple and its delay is only a half that of the data address generation, the twiddle factor address generation is not discussed here. III. AN EFFECTIVE CONFLICT-FREE MEMORY ADDRESSING ASSIGNMENT A. Address Generation for Parallel Data Access Equations (4) and (5) indicate that the addresses s and t differ only in the pth bit at pass p: Let the binary notation of s at pass p be given by s = sn01 sn02 1 1 1 sp+1 0sp01 1 1 1 s1 s0 : Based on (4) and (5), we have t = sn01 sn02 1 1 1 sp+1 1sp01 1 1 1 s1 s0: Denote sr = tr = sn01 sn02 1 1 1 sp+1 sp01 1 1 1 s1 s0 : We place Xs (p 0 1) and Xt (p 0 1) in addresses sr and tr of memory banks M0 and M1 , respectively. This assignment allows the pair of buttery inputs Xs (p 0 1) and Xt (p 0 1) to be accessed in a conict-free manner. Furthermore, the memory banks to store Xs (p) and Xt (p) are determined according to the value of s(p+1) (s(p+1) = b0 ): This indicates that the pair of buttery outputs at pass p; Xs (p); and Xt (p) are located in the same memory bank. With this assignment, the two inputs of buttery operations at pass (p + 1) can be accessed simultaneously. Based on the above insight, we propose a memory addressing assignment for the inputs and outputs of buttery operations, which is shown in Table I. In Table I, we denote a buttery counter and pass counter by b and p, respectively, and the binary notation of b is given by b = bn02 bn03 1 1 1 b1 b0 : Dene As = bn02 bn03 1 1 1 b1 0 and At = bn02 bn03 1 1 1 b1 1: The address generation logic for read and write operations are depicted in Figs. 2 and 3, respectively. B. The Properties of Our Memory Addressing Assignment We need to analyze the properties of our strategy to ensure its correctness. 1) Conict-Free Addressing for Reads: The insight mentioned above shows that Xs (p 0 1) and Xt (p 0 1) can be accessed in parallel.
=l + 2
p+1 g p p ;2
;
(4)
=s+ 2 = 0; 1; = 0; 1;
n010p c =2
= 0; 1;
p N
n =2 :
1 111 01
l ;n
111 111
n010p
01
(5)
01
Cohen [1] proposed a simple approach to generating s and t: two input addresses of buttery operations. Denote by b and p the buttery operation and pass indices, respectively. Their bitwidths are (n 0 1) and dlog2 ne, respectively. Denote rotating x left over i bits by RL(x; i): With these notations, the two input addresses of buttery operations are given by
s t
= RL(2b; p) = RL(2b + 1; p)
(11) (12)
909
TABLE II ADDRESS ASSIGNMENT FOR THE SAMPLED DATA k ( 1); k = kn01 kn02 k1 k 0
X 0
111
0 1)
;k
n01 kn02
111
k1 k0
given by b = bn02 bn03 1 1 1 b1 b0 and b0 = 0, and the one for the next buttery operation is given by b0 = bn02 bn03 1 1 1 b1 1: We have
A
Based on our strategy presented in the previous subsection, we obtain a memory addressing assignment, which is given below.
Fig. 2. Address generation for reads.
Memory Banks
M0 M1 M0 M0 M0 M1 M1 M1
We see that Xs (p) and Xt (p) are written back to the same locations, where Xs (p0 1) and Xt (p 0 1) reside, respectively. However, Xt (p) needs to overwrite the location of Xs (p 0 1), and Xs (p) needs to overwrite the location of Xt (p 0 1), respectively. Fortunately, since memory reads, buttery calculations, and memory writes build up a pipeline to compute a FFT, Xs (p 0 1) has been loaded into buttery unit before Xt (p) is overwritten to its location. Therefore, the input reading of one buttery and the output writing of the previous buttery can be performed without conict. IV. MEMORY ADDRESSING FOR THE INPUT/OUTPUT OF FFT PROCESSORS A. Memory Addressing for Input
k
2) Conict-Free Addressing for Writes: We know that the outputs of buttery operations at pass p; Xs (p); and Xt (p) belong to the same memory bank. At the end of a buttery calculation, Xs (p) is written to a memory bank, whereas Xt (p) is stored in a register, and it will be written to the same memory bank in the next memory cycle. Table I shows that if the outputs of one buttery calculation belong to memory bank 0, then the outputs of the next buttery calculation will belong to memory bank 1. Therefore, writing the outputs of two adjacent buttery operations can be performed simultaneously. 3) Conict-Free Addressing for Reads and Writes Simultaneously: Let the input and output addresses of one buttery operation be sr ; tr ; sw ; and tw , respectively, and the input and output addresses 0 0 0 of the next buttery operation be s0 ; tr ; sw ; and tw , respectively. r Assume the value of buttery counter for one buttery operation is
When sampled data Xk (01) [see Section II, Xk (01) = xm , with = rev(m)] are loaded into an FFT processor, the data should be placed in such a way that the inputs of buttery operations at the zeroth pass (p = 0) can be accessed concurrently. We choose a memory bank to place the sampled data according to the least signicant bit of its index, as shown in Table II. B. Memory Addressing for Output Based on our strategy, the memory banks storing the nal results are determined according to the least signicant bit of their indices as the assignment for input described in the previous subsection. However, the address within each memory bank should be modied. The locations of FFT outputs are shown in Table III.
910
AND
OURS
TABLE V HARDWARE COMPLEXITY COMPARISONS OF COHEN AND SINHAS DESIGNS [1], [5], AND OURS
V. TIME DELAY AND HARDWARE COMPLEXITY COMPARISONS A. Delay Analysis and Comparisons of Our Design and The Designs of [1] and [5] It is reasonable to assume that the delays of some basic circuits are as follows: Types of Circuits Exclusive OR; Txor Multiplexer; Tmux n-bit Barrel Shifter; Trl n-bit Carry Lookahead Adder; Tadder Delays
2Tand 2Tand Tand log2 n + Tand 2Tand log2 (n 1) + 4Tand
0 c
In Cohens scheme [1], a cyclic shift on a buttery counter should be performed, and address parity should be calculated before a read or write is performed. Taking the two operatons of address and data multiplexing (interchanges) into account, the total delays of address generation for reads and writes are given by Tread = Twrite =
max
provided separately in our and Cohens schemes. We assume that in CMOS technology the sizes of some basic gates and circuits are as follows: Types of Gates and Circuits
2-Input XOR 2-Input Multiplexer 1-bit Register/Latch 9-bit Counter 13-bit Counter 9-bit barrel Shifter 10-bit barrel Shifter 9-bit CLA 10-bit Comparitor E = 10-bit Comparitor F =
Number of Transistors
10 6 10 182 270 152 168 354 76 72
Sinha [5] used the control method of three separate iteration loops to compute the fast Fourier transform. An addition should be done in the address generation (see [5, step 18, pp. 22]). Taking the branch operation implemented by using a multiplexer (see [5, steps 14 and 15, pp. 22]) into consideration, the delay of address generation is given by Tread = Tadder + Tmux = 2Tand blog2 (n 0 1)c + 6Tand : For our scheme, the delay of address generation for reads is dominated by the delay of cyclic shift operation, which is given by Tread = Trl = Tand dlog2 ne + Tand : Since storing Xt (p) in a register does not add an additional delay to that of address generation and the register access time of write operation is less than that of memory, the delay of address generation for writes is given by
Twrite
Trl ; Tparity
g+2
Tmux
= 2Tand log2 (n
0 1)e + 4
Tand :
D C
[5] [5]
Trl
Tmux
Tand
The time delay comparisons of Cohen and Sinhas designs [1], [5] and ours are summarized in Table IV. Compared with [1] and [5], our scheme reduces the delay of address generation nearly by half. B. Hardware Complexity Comparisons The size of each address generation circuit is approximated to a rst order by the number of gates and transistors. The actual area required for a given circuit will depend on the types of gates, the number of gates, and the amount of wiring area, but the relative sizes are consistent with the gate counts. In addition, the wiring complexity of our design is comparable with that of Cohens design [1]. To simplify the analysis and comparisons, we use the gate count and transistor count as a hardware complexity measure. The complexity comparisons of the scheme in this correspondence and the other two schemes are shown in Table V. The comparisons are based on 1024-point FFT processors with data representation of 20 bits (corresponding to 40 bits representation for a complex data). In addition, the address generation circuits for reads and writes are
dlog2 e + 3
n
Tand :
We see from Table V that the hardware complexity of our scheme is comparable with Cohens design [1]. This point stays true with an increase in bitwidth of data representation (the bitwidth 32). Although the hardware complexity of our scheme is higher than that of Sinhas design [5], their FFT processor is not pipelined. If their processor is pipelined, then the complexity of their address generation circuit will be higher than ours. VI. SUMMARY We have proposed an effective approach to the memory addressing of FFT processors, which is simple and is suitable for pipelined FFT processor implementations. The analysis and comparisons we made show that the delay associated with address generation is reduced nearly by half with equivalent hardware complexity compared with Cohens design [1]. Two effective memory addressing schemes for the input and output of FFT processors are also given. With our strategy, a powerful FFT processor can be implemented efciently. ACKNOWLEDGMENT The author is grateful to the reviewers for their suggestions and comments that improved greatly the papers quality.
911
REFERENCES
[1] D. Cohen, Simplied control of FFT hardware, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, pp. 577579, Dec. 1976. [2] L. G. Johnson, Conict free memory addressing for dedicated FFT hardware, IEEE Trans. Circuits Syst. II, vol. 39, pp. 312316, May 1992. [3] T. F. Ngai, M. J. Irwin, and S. Rawat, Regular, area-time efcient carry lookahead adders, J. Parallel Distrib. Comput., vol. 3, no. 1, pp. 92105, Mar. 1986. [4] M. C. Pease, Organization of large scale fourier processors, J. Assoc. Comput. Mach., vol. 16, pp. 474482, July 1969. [5] B. P. Sinha, J. Dattagupta, and A. Sen, A cost effective FFT processor using memory segmentation, in Proc. IEEE ISCAS, 1983, vol. 1, pp. 2023.
a complex equalizer operating in either baseband or passband. In case of passband equalization, a simpler structure called the phasesplitting passband equalizer (PSPE) [10] can be employed, where the functions of phase splitter and equalizer are combined. The carrierless AM/PM (CAP) transmission scheme has been chosen by the Technical Committee of the ATM forum as the ATM-LAN physical layer interface standard at 51.84 Mb/s for category 3 unshielded-twistedpair (UTP) wiring [9]. In CAP transceivers, a PSPE is employed at the receiving end. This receiver consists of a parallel arrangement of two adaptive equalizers. In this correspondence, we propose a lowpower architecture for the PSPE employed in a CAP transceiver by exploiting the Hilbert relationship between the optimum solutions of the receive lters. The rest of this correspondence is organized as follows. In the next section, we describe the generic CAP transceiver scheme. In Section III, we present the proposed receiver architecture and analyze its properties. In Section IV, we show, via analysis and simulation results, that the proposed architecture results in considerable saving in power in an ATM-LAN environment with marginal degradation in performance. II. THE CAP TRANSMISSION SCHEME
Abstract A low-power architecture for a phase-splitting passband equalizer (PSPE) is proposed in this correspondence. The Hilbert relationship between the in-phase and quadrature-phase equalizers in the PSPE is exploited to develop the proposed architecture. It is shown via analysis and simulations that in a 51.84-Mb/s ATM-LAN environment, the proposed receiver results in 1) a net saving in power if the length of the Hilbert lter is less than 130, and 2) a saving of up to 20% can be achieved with a degradation in signal-to-noise ratio of less than 0.5 dB. Index TermsATM-LAN, CAP, Hilbert transform, low-power.
I. INTRODUCTION In recent years, the development of low-power devices for applications in elds of communications and DSP has become an active area of research due to the proliferation of mobile communication systems. It is for this reason that numerous power reduction techniques have been proposed starting at the algorithmic-level [1][4], architectural level [5], logic level [6], and the circuit level [1]. These techniques are currently being applied to develop low-power and high-speed transceivers for applications such as asymmetric digital subscriber loop (ADSL) [7], high-speed digital subscriber loop [8], and ATM-LAN [9] to achieve high bit rate digital communication over bandlimited channels. The transceivers in most of these applications employ some form of adaptive equalization at the receiving end to combat corruption of the transmitted signal due to several sources of distortion such as intersymbol interference (ISI), crosstalk, and additive noise. In many of these applications, transmission schemes such as quadrature amplitude modulation (QAM) are employed, where the receiver consists of a phase splitter or a Hilbert transformer followed by
Manuscript received April 14, 1997; revised February 19, 1998. This work was supported by the National Science Foundation under CAREER Award MIP-9623737. The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias Manolakos. The authors are with the Coordinated Science Laboratory/Electrical and Computer Engineering Department, University of Illinois at UrbanaChampaign, Urbana, IL 61801 USA (e-mail: [email protected]; [email protected]). Publisher Item Identier S 1053-587X(99)01359-8.
The block diagram of the generic CAP transmitter is shown in Fig. 1(a). The bit stream to be transmitted is passed through a scrambler in order to randomize the data and are then fed into an encoder. The encoder maps a block of m bits into one of k = 2m unique complex symbols Sn = rn + jqn in a k-CAP scheme. In the 16-CAP scheme described here, we have m = 4 and k = 16. The impulse responses of the shaping lters p(kT 0 ) and p(kT 0 ) are given by p(kT 0 ) = g(kT 0 ) cos 2fc kT 0 and p(kT 0 ) = ~ ~ g(kT 0 ) sin 2fc kT 0 , where g(kT 0 ) = g(t)jt=kT is the baseband pulse, and fc is a frequency that is greater than the largest frequency component in g (t). Note that the shaping lter impulse responses form a Hilbert pair [11], i.e.,
2 2 sin (n=2)
0;
n
(1) Due to the Hilbert relationship, the magnitude response of p(n) is ~ the same as that of p(n), but the phase response of p(n) is shifted by ~ +90 and 090 in the +ve and 0ve frequency regions, respectively. The CAP receiver, which is shown in Fig. 1(b), consists of an analogto-digital (A/D) converter operating at sampling frequency 1=T 0 followed by two adaptive digital lters in parallel [10], which are also operating at sampling frequency of 1=T 0 = K=T , where K is the oversampling factor, and T is the symbol period. In the 16CAP scheme, we have K = 4. These lters form the in-phase and quadrature-phase equalizers. The lter (F) block in these equalizers consist of an FIR lter whose coefcients are computed recursively in the weight up-date (WUD) block using the popular least mean squares (LMS) algorithm [12]. This algorithm minimizes the mean squared error (MSE) given by where h i denotes expectation. The performance measure for the receiver in this equalization scheme is the signal-to-noise ratio (SNR) at the output (SNRo ) of the equalizer given by [13] SNRo = 10 2 log10
2 d
; n 6= 0 n = 0.
(2)
MSE
(3)