Area-Efficient Architecture For Fast Fourier Transform

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO.

2, FEBRUARY 1999

187

Area-Efcient Architecture for Fast Fourier Transform


Jose Antonio Hidalgo, Juan L opez, Francisco Arg uello, and Emilio L. Zapata

AbstractWe present an area-efcient parallel architecture that implements the constant-geometry, in-place Fast Fourier Transform. It consists of a specic-purpose processor array interconnected by means of a perfect rn data of size D unshufe network. For a radix r transform of N and a column of P rp processors, each processor has only one local memory of N=rP words of size rD , with only one read port and one write port that, nevertheless, make it possible to read the r inputs of a buttery and write r intermediate results in each memory cycle. The address generating circuit that permits the in-place implementation is simple and the same for all the local memories. The data ow has been designed to efciently exploit the pipelining of the processing section with no cycle loss. This architecture reduces the area by almost 50% of other designs with a similar performance.

I. INTRODUCTION The efciency of specic architectures of VLSI designs is mainly based on the selection of the algorithm to be implemented. For FFT architectures, there exists a variety of algorithms which differ in the way the intermediate results are stored and how the computations are carried out. These factors determine the organization of each processor, the memory requirements, and the interconnection network among processors. Therefore, in order to obtain a design that may occupy a small area, the selected algorithm should minimize, mainly, the size of the memory, the number of communication lines, and the control unit area. Prime-factor and split-radix FFT algorithms must be discarded because the computation patterns vary according to the stage. It results in a complex section of control that would occupy a great amount of silicon area. Furthermore, for split-radix algorithms, inefcient use of processing section and load unbalance among processors may occur because different types of butteries are to be implemented in any stage. In-place algorithms minimize memory requirements by writing the partial results over the data used for their calculation. Inorder algorithms avoid bit reversal permutation of the data sequence, which is automatically carried out during transform execution. Finally, constant-geometry algorithms present the same communication pattern in every stage, so they require a simpler interconnection network than in-place and in-order algorithms. On the other hand, the last three algorithms compute simple and regular butteries. In classical, sequential FFT implementations, the number of read/write accesses to perform the FFT creates a bottleneck: an -point radix r FFT computation requires (N=r) logr N read/write RAM accesses. Most solutions propose using either an increase of the radix to reduce the overall number of accesses at the cost

Manuscript received March 7, 1996; revised November 24, 1997. This work was supported in part by the Ministry of Education and Science (CYCIT) of Spain under Contract TIC96-1125-CO3, Xunta de Galicia XUGA 20605B96, and EC project BRPR-CT96-01070. This paper was recommended by Associate Editor S. Goto. J. A. Hidalgo is with the Departmento de Electr onica, Universidad de M alaga, 29071 M alaga, Spain. J. L opez and E. L. Zapata are with the Departmento de Arquitectura de Computadores, Universidad de M alaga, 29080 M alaga, Spain. F. Arg uello is with the Departmento de Electronica y Computacion, Universidad de Santiago de Compostela, 15706 Santiago de Compostela, Spain. Publisher Item Identier S 1057-7130(99)01757-7.

of incremented arithmetic complexity [1], [2], or partitioning the memory into r banks simultaneously accessed at the cost of complex addressing, and a larger area [3]. Another approach is restructuring the memory in order to try to reduce the number of accesses without increasing complexity [4]. An optimal design would follow from an in-place algorithm (to minimize memory size and provide regular butteries); the design would also require a simple interconnection network and a memory structured to compute the FFT by performing a reduced number of accesses. In this work we propose the design of a parallel architecture based on the constant-geometry FFT algorithm (CGFFT), which implements the in-place transform minimizing memory requirements and optimizing the efciency and simplicity of communications. The design is obtained by using the methodology proposed in [5], expressing each stage of the CGFFT algorithm as a string of operators that are easy to translate into hardware. This string of operators determines the internal structure of the processors and the interconnection network. The parallel architecture consists of an array of specic-purpose processors interconnected with local memories by means of a perfect unshufe network. Each processor has only one local memory of N=rP words of size rD (for a radix r transform of N data of size D and an array of P processors), with only one read port and one write port, and very little additional storage. The control unit permits the in-place implementation in a simple way, and it is the same for all the local memories. The data ow has been designed to efciently exploit the pipelining of the processing section with no cycle loss. Furthermore, it permits partitioning the computations among an arbitrary number of processors in such a way that data are recirculated, thus optimizing both communications and use of processors. In recent literature, several parallel designs that implement the CGFFT algorithm have been proposed, but they do not use all the natural characteristics of this algorithm. In [5] the proposed parallel architecture implements the intermediate shufes by means of two sizes of N=P FIFO queues in each processor. In [6] a processor array is proposed that uses independent shufe connections along the dimensional axes of the array, without requiring all the processors to have shufe connectivity, although this increases the complexity of the design. In [7] an efcient design is presented, but the memory is partitioned and multiport elements are used which limits to 4 the radix that can be used. The methodology we use is simpler and clearer than those based on Kronecker products [6], [7]. Furthermore, it can be applied to regularize the data ow of groups of algorithms and thus treat them in a unied way. In [8] it has been applied to the design of specic architectures to Fast Fourier and Hartley transforms, and in [9] to the design of a unied architecture for tridiagonal algorithms. We have organized the rest of this work as follows. In Section II, we dene several operators that we will use to describe each stage of the constant geometry FFT algorithm. In Section III, we obtain the design of a specic-purpose processor for the sequential computation of the FFT. In Section IV, we present some relationships and decomposition into operators and obtain the appropriate parallel architecture for the computation of the FFT, thus evaluating the efciency of our implementation. Finally, in Section V, we establish our conclusions and suggest further work. II. FAST FOURIER TRANSFORM
WITH

CONSTANT GEOMETRY

In this section we dene the basic operators to be used in the rest of the sections, we review basic ideas about FFT with constant

10577130/99$10.00 1999 IEEE

188

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

geometry, and describe each stage of the algorithm as an operator string. A. Basic Operators
r

We will consider sequences of data y (i); 0  i < N = rn ; 2  = 2m and assume that in 1 1 1 i1 are the digits of the base r representation of i. In the following, we will denote as S the ordered sequence of data. Denition 1: The buttery operator B transforms a sequence S into another S 0 of the same length. It performs the FFT buttery specic operations over each r data subsequence of S whose indices differ in the least signicant digit. Each buttery accepts r input data and produces r outputs. Denition 2: The perfect unshufe (shufe) operator 0( ) transforms a sequence S into another sequence S 0 of the same length. It performs a cyclic rotation of order 1 to the right (left) in the r-ary representation of each elements index of the sequence.

0(i) = [i1 ; in ; 1 1 1 ; i2 ]  (i) = [in01 ; 1 1 1 ; i1 ; in ]:

(1) (2)
Fig. 1. The effect of operator string Bcycle; bus 0cycle on the input matrix at stage 2 of a 16 data transform.

In some cases, it will be convenient to divide the base r representation of data into several ordered elds of digits i = (; ; );  = [a ; 1 1 1 ; 1 ];  = [b ; 1 1 1 ; 1 ];  = [c ; 1 1 1 ; 1 ]; a + b + c being equal to n. In this case, we can dene the constraint of operator 0( ) to some of the elds:

0 (; ; ) = [1 a 1 1 1 2 ][]  (; ; ) = [ [][1 c 1 1 1 2 ]:

(3) (4)

 are inverse operators, the input sequence at tth stage will be St = y t ( t01 (i))=0 i N 1 : In order to better view data evolution, we will assume that the input sequence at stage t is distributed as a matrix Ft with two rows and 2n01 columns (see Fig. 1 for N = 16; t = 2; r = 2). The Ft rows are numbered 0 and

 

0 g

Denition 3: The exchange operator (i);(j ) exchanges the ith digit of eld  with the j th digit of eld . As an example,
 (1);(1) (; ; )

= [a

111

2

1 ][][c

2

1 ]:

(5)

1 from top to bottom, and the columns 0 to 2n01 0 1 from right to left. From a computational point of view, the location (column, row) of a data in matrix Ft can be interpreted as (cycle, bus); cycle being the execution cycle, and bus the path through which this data item accesses the processor. III. ARCHITECTURES FOR THE IN-PLACE CONSTANT GEOMETRY FFT

Hereafter, the operator  (1);(1) will be denoted by ; for simplicity. We consider that operator strings are applied from left to right. The notation  t indicates that operator  is applied t times. Lemma 1: Factorization of perfect unshufe operator as a product of exchanges operators

0 =

n 1 i=1

 (i+1); (i)

(6)

where n > 1: The proof is obtained by applying the denitions of both operators. B. Constant Geometry FFT Algorithms The central role in constant geometry algorithms is carried out by perfect unshufe and perfect shufe permutations. The basic idea, due to Pease [10], is that a subsequence of r elements of S whose indices differ only in their tth digit (i.e., they are at a distance rt01 ; t > 0) is located by the perfect unshufe operator at a distance of rt02 : Consequently, if we carry out a 0 permutation of the output sequence in each stage of the ascendant FFT, a subsequence of r elements initially at a distance of rt01 will occupy consecutive positions at the beginning of the tth stage. In this way, the inputs to the butteries are at distances of 1 at each stage. So, each stage consists in applying operator string B 0 to its input sequence. We can conclude that this algorithm (CGFFT hereafter) consists in applying operator string (B 0)n to the initial input sequence. Let us consider the sequence fy t (i)=0  i  N 0 1g of initial data at the tth stage. Because of the unshufing carried out in each step, y t (i) will be placed at position 0t01 (i), so, given that 0 and

The operator string B 0 that denes each stage of the CGFFT algorithm may be decomposed as a new string of operators that are easy to translate into hardware. There are several possible decompositions we can choose. The one chosen is the key point in any design since it determines the performance of a processor, a column, or an array of processors architecture. In this section, we will address the design of a uniprocessor system to compute the in-place CGFFT from a new decomposition of 0 which will also permit to store the data in an efcient way. In order to keep the presentation and gures simple, we will develop designs for a radix 2 transform, which are easily extended to any radix. A. Design of an FFT Specic Uniprocessor The computation of the tth stage of the CGFFT algorithm occurs in the following basic steps: to read data from memory (read operator Rt ), execute some operations over data (operator B ), and to write the results in memory (write operator Wt ). We will consider that operator Rt gives the read cycle from the read address and, conversely, operator Wt gives the write address from the write cycle. Then, since operator B does not modify the position of data, the tth stage can be formulated as the operators string Rt BWt and the CGFTT algorithm as the product

CGFFT =

Rt BWt
t=1

BWt Rt+1
t=1

(7)

if we suppose that initial input data are read from a external device (i.e., R1 is not considered) and Rn+1 = I . Since each stage of the CGFFT algorithm is also dened by the operator string B 0, we

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

189

(a)

(b) Fig. 2. (a) Data ow generated by operator string (Bcycle; bus 0cycle )4 for a 16 data transform. We assume that PS is not pipelined. (b) Evolution of the processors memory for a 16 data transform.

obtain that 0 = Wt Rt+1 . The simplest denition for read and write CGFFT operators is Rt = I; Wt = 0. These are the read and write functions proposed in [5][8]. However, these functions do not permit the algorithm to be computed in place since the results of a stage are written in different locations than those where the data were read. That happens because the read and write operators chosen are not inverse of each other. This problem is solved by using new read and write functions to carry out the unshufing permutation. Notice that the read and write operators Rt = 0t01 ; Wt =  t01 also veries 0 = Wt Rt+1 , and, furthermore, they produce an in-place algorithm because Rt Wt = I . Then, the CGFFT is computed in place by performing the string

CGFFT =

=1

B

01 0 :
t

(8)

A time-efcient processor should provide simultaneous access to read/write r data, so r read ports and r write ports of length D should be necessary. This means a memory with higher access time, amount of area, and consumption, as well as a limitation of the radix [7]. To avoid these overheads, we propose a memory with only one read port and only one write port of sizes rD, as follows from next theorem. Theorem 1: The CGFFT algorithm may be mapped onto a processor that implements the following operator string:

The permutation cycle; bus exchanges the least signicant bits of the cycle; bus data elds. We will denote as Ht+1 the matrix cycle; bus (Gt+1 ). Observe in Fig. 1 the effect of operator cycle; bus on G3 : it rearranges the results of two evenodd butteries [we mean two butteries computed at the 2hth and (2h + 1)th cycles of the stage (0  h  N=2)] and provides inputs for two butteries to be computed at the next stage. Thus, storing each column of Ht+1 in a memory address with words of size two data will provide that the inputs for each buttery at the next stage can be accessed in one read cycle. Also, it determines the order in which the writing of butteries results must be carried out. Consequently, the (cycle; bus) elds of the indices in matrix Ht+1 may be interpreted as the data writing cycles and the word segment where they are written, respectively. Operator (cycle )t01 is the write function. It acts on the cycle elds of the indices in matrix Ht+1 (writing cycles) and provides the writing address in memory. Operator (0cycle )t is the read function Rt+1 . Its input is a memory address and provides the cycle in which the processor will read in this address at the next stage. The action of the two last operators, equivalent to operator 0cycle action, provides the input matrix Ft+1 of the (t + 1)th stage, unshufing the columns of matrix Ht+1 , as can be observed in Fig. 1. The data ow generated by Theorem 1 is presented in Fig. 2(a), and the memory evolution at the different stages of the algorithm in Fig. 2(b) for the case of a 16 data transform.

CGFFT =

Proof: It is sufcient to prove that 0 = cycle; bus 0cycle . Lets consider the two elds (cycle; bus) = [n01 ; 1 1 1 ; 1 ][1 ] in an n-digit base r representation. Thus, cycle; bus 0cycle [n01 ; 1 1 1, 1 ][1 ] = 0cycle [n01 , 1 1 1 ; 2 ; 1 ][1 ] = [1 , n01 ; 1 1 1 ; 2 ][1 ] = 0[n01 ; 1 1 1 ; 1 ][1 ]. In every stage t of the transform, we apply the string Bcycle; bus (cycle )t01 (0cycle )t to the input matrix Ft . The interpretation of these operators is given by the following.

=1

Bcycle; bus cycle 0cycle :


t t

01

(9) B. Hardware Implementation In this subsection, the hardware implementation of the CGFFT processor dened by Theorem 1 is obtained as follows. Operator B determines the internal structure of the processing section (PS), hardware implementation of the FFT butteries. In the radix 2 case, this section has two input buses and two output buses whose width is determined by the input data type (real or complex). The hardware implementation of operator cycle; bus is shown in Fig. 3. It consists of a set of three registers and a mux. Register r0

190

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

Fig. 4. Implementation of operator

cycle; bus

for a radix 4 transform.

Fig. 3. Design of an area-efcient processor to compute a radix 2 FFT. PS means the Processing Section, D the size of data, and r0, r1, r2 are registers of size D.

is located at the output of bus 0 in the PS. Registers r 1 and r 2 are located at the output of bus 1 and act as a serial input parallel output (SIPO) queue of size 2 (see Fig. 3). This circuit works as follows. Registers r0 and r1 store the results of the 2hth butteries during an execution cycle. In the next cycle, output 0 of the PS section and r0 are stored in memory, and in the following cycle are stored r1 and r2: The mux, controlled by the least signicant bit of counter CC (see Fig. 3), selects the two data to be stored in memory every alternative cycle. We will call the hardware that implements cycle; bus Routing Section A (RSA). The read and write functions control the memory operation. Since they produce an in-place algorithm, the address accessed at read cycle j will be also accessed at write cycle j; with a delay determined by PS (two cycles if the PS section is not pipelined). So, only the read address generating mechanism implementation is needed. On the other hand, the control should provide the read address from the read cycle (i.e., the inverse of the read function, (cycle )t01 ): The hardware implementation of this operator is made up of two counters and a cyclic bit rotation circuit, as shown in Fig. 3. A module N=2 counter (CC) denes the read cycle of the butteries. A module n counter (SC) denes the stage of the transform and controls the number of positions the CC bits must rotate in the cyclic rotation circuit (RC). The RC circuit output is the memory address which the current cycle must read (indicated by CC). To summarize, the processor designed in this section permits computing an N data radix 2 transform in (N=2)n + s + 2 cycles, s being the depth of the PS sections pipelining without any additional storage or idle cycles. The two basic ideas of the design consist: 1) in rearranging the results of two even-odd butteries so that inputs for two butteries to be computed at the next stage are written in memory, and 2) in the appropriate election of read and write functions to provide both in-place and unshufing aimed characteristics.

In the general case of a radix r transform, the design is obtained in a similar manner. The PS section has r input buses and r output buses (b0 ; 1 1 1 ; br01 ) for the computation of radix r butteries. The implementation of operator cycle; bus consists of a set of 3r(r 0 1)=2 registers organized into r SIPO queues connected to the output buses of PS, as shown in Fig. 4 for r = 4. An SIPO queue of size r + i is connected to bus bi : The outputs of the registers are directed to a multiplexer that in each cycle selects the suitable r outputs under the control of the last m bits of CC (as each base r digit has m bits). The memory is organized into N=r words of size rD: The implementation of the read and write functions is similar, but the RC circuit will carry out a cyclic rotation of m(t 0 1) bits in the tth stage. IV. PARALLEL ARCHITECTURES From the analysis of CGFFT data ow, we deduce that the most appropriate parallel architecture for exploiting their inherent temporal and spatial parallelism is an array of processors (PEs). For simplicity, in Section IV-B we will design a column of PEs and schematic the design for an array in Section IV-C. The methodology used to design both parallel architectures for the radix r CGFFT is based on the decomposition of the permutation operator 0 as a string of elementary operators, which we introduce in Section IV-A. A. Decomposition into Elementary Operators


We will borrow the notation from Section II-A: i = (; ; ); with

being equal to n. Denition 6: Partial unshufe operators

= [a ; 1 1 1 ; 1 ];  = [b ; 1 1 1 ; 1 ];  = [c ; 1 1 1 ; 1 ]; a + b + c 0 (; ; ) = [1 a 1 1 1 2 ][b 1 1 1 1 [1 c 1 1 1 2 ] 0;  (; ; ) = [ ][1 b 1 1 1 2 ][1 c
1 ]

111

(10)
2 ]:

(11) In the following lemma, we will consider that eld  consists of a single digit 1 : Lemma 2: The perfect unshufe operator can be decomposed into two partial unshufes

0 = 0;  0;  :

(12)

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

191

Fig. 5. Data ow generated by operator string

Bcycle; bus 0cycle 0PE; bus


111

for a 32 data transform processed by a column of 4 processors.

[1 a

Proof:
111

2 ][1 b

0;  0;  (; ; ) = 0);  [1 a 1 1 1 2 ][1 ] = 0(; ;  ).

2 ][][1 ]

B. Design of an In-Place CGFFT Column of Processors For a column of P processors (P = r p ; p < n); the matrix Ft of input data in stage t; 1  t  n should be distributed among the PEs local memories so that each PE evaluates r n0p01 butteries (columns in Ft ). We assume that a processors local memory is the one where it writes the results of the butteries. From a computational point of view, we are decomposing the indices of the data into three elds: processor, cycle, and bus, with sizes n 0 p 0 1; p, and 1 digits, respectively. Depending on how the columns of matrix Ft are distributed among the PEs, these elds can be interpreted in two different ways: (cycle, PE, bus) and (PE, cycle, bus). The rst one, called cyclic distribution, assigns consecutive butteries to different PEs; whereas the second, called consecutive distribution, assigns a block of consecutive butteries to each PE. In the following we will use cyclic distribution. With this distribution, in the tth stage of radix r CGFFT, the processor with index ; PE(); operates with matrix Ft;  ; of r rows and rn0p01 columns. Fig. 5 depicts the matrices Ft;  and data ow in each one of the ve stages of a radix 2 CGFFT with N = 25 inputs, processed by a column of P = 22 PEs. Theorem 2: The data ow of radix r CGFFT algorithm can be mapped onto a column of P = rp processors that implement the operator string:

The partial unshufe operator 0PE; bus denes the interconnection network of the PE column. Observe that operator 0PE; bus transforms data item [ ][p ; 1 1 1 ; 1 ][1 ] into [ ][1 ; p ; 1 1 1 ; 2 ][1 ]: As we noted in Section III, eld [1 ] can be interpreted as the memory word segment where the data are saved. So, all the data saved in segment [1 ] of the local memory of processor [p ; 1 1 1 ; 1 ] will access the PE of index [1 ; p ; 1 1 1 ; 2 ] through input bus [1 ] in the next step. The operation it carries out over the indices is equivalent to moving the data from segment [1 ] of all locations in the local memory of processor [p ; 1 1 1 ; 1 ] to the processor with index [1 ; p ; 1 1 1 ; 2 ]; without modifying their relative arrangement, i.e., in the order determined by the read function used in Section III. Therefore, we must connect segment [1 ] of the local memory of PE [p ; 1 1 1 ; 1 ] to input bus [1 ] of PE [1 ; p ; 1 1 1 ; 2 ]: Its implementation is carried out by means of the perfect unshufe interconnection network. The third column in Fig. 6 shows its design for a 4 PEs column to compute a radix 2 CGFFT. For an array of processors, the design is obtained straightforwardly from (13) by decomposing the eld PE into two elds, row = [u ; 1 1 1 ; 1 ] and column = [v ; 1 1 1 ; 1 ]; with i = i+v ; i = i , which indicates the row and column occupied in the array. Then, (13) becomes

CGFFT =

n t=1

Bcycle; bus 0row; column; bus cycle 0cycle :


t t

01

(14)

C. Array Communications Regularization From (14), operator 0row; column; bus determines the communications among processors in the array. Unfortunately, this operator generates an irregular network, with no dened global interconnection pattern. However, local patterns exist that can be extracted by factorization of this operator as simpler operators. In fact, from Lemma 2 we get 0row; column; bus = 0row; bus 0column; bus : The partial unshufing operator 0row; bus is restricted to elds row; bus, thus it determines identical communications pattern in all the columns. On the other hand, operator 0column; bus is restricted to elds column; bus and produces the same pattern in each row. Fig. 6 depicts the communications regularization carried out by partial unshufings for a 4 2 4 array of processors. Solid lines show the unshufng of rows, dashes lines show some of the interconnections determined by operator 0column; bus : From the hardware implementation point of view, we introduce modularity in communications among PEs, which is very interesting regarding reliability and design time. When some of the array dimensions is high, it is possible to decrease the interconnection complexity. In fact, using Lemma 1 we can decompose these operators as a product of exchanges. For

CGFFT =

n t=1

01 0t Bcycle; bus 
t

cycle cycle 0PE; bus :

(13)

Proof: If we denote elds  = cycle;  = PE;  = bus, from Lemma 2, it follows that B 0 = B 0cycle; bus 0PE; bus , and using Theorem 1, 0cycle; bus = cycle; bus (cycle )t01 (0cycle )t : The architecture of the PE column is dened by string (13) given by Theorem 2. The string cycle; bus (cycle )t01 (0cycle )t carries out a partial perfect unshufe internally in each PEs local memory (it does not modify the PE eld) and determines the processors routing section. This string is the same as that given by Theorem 1; therefore, each processor is analogous to the one designed in Section III, but now the memories consist of N=2P words of size 2D (radix 2) and the read and write functions are the same for all the PEs, i.e., at each cycle all the processors will read and write at the same address of their local memory. Consequently, we only need one address generator for all of them, which is an interesting characteristic to VLSI implementation regarding area and time design.

192

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

Fig. 6. Network for a 4 4 array of PEs. Solid lines carry out the unshufing operator 0row; bus , discontinue lines operator 0column; bus . The output lines at the last column of PEs show the modular interconnection for dimension row obtained from factorization of operator 0row; bus as a product of exchanges.

example, in the case of operator 0row; bus ; the unshufing is carried out in n exchanges steps in such a way that at j th step the following exchange occurs:

[u ; 1 1 1 ; j +1 ; j ; ; j 01 ; 1 1 1 ; 2 ][1 ] ! [u ; 1 1 1 ; j +1 ; ; j ; 1 1 1 ; 2 ][ ] :

(15)

As an example, output lines in the last column of PEs in Fig. 6 show the exchange steps obtained from factoring operator 0row; bus for the case of a radix 2, four rows array. Observe the regularity and modularity of this solution. To summarize, in this section we have designed a parallel architecture for the radix r CGFFT algorithm. This consists of a specic purpose processor column whose interconnection network is a perfect unshufe of the rP outputs of the local memory segments with the rP input buses of the PEs. Each processor is similar to the one obtained for the uniprocessor implementation in Section III, has one local memory of size ND=P (for an N data transform and a column of P processors), and very little additional storage (SIPO registers). The circuit for the generation of addresses is only one for each and every local memory. This processor column can be extended to an array of processors in a straightforward manner. Further factorizations of unshufing operator lead to more regular and modular designs. D. Evaluation of Design Efciency The required routing network for the architecture in Section IV-B belongs to the category known as area-efcient [11] because it is xed-interconnectioned with bus cross number O(P 2 r2 ); and evenly routes P r buses to the local memories of the P processors at each time unit. However, the area-efcient concept would be more precise if the processor area, which is mainly determined by the memory size, would be considered. In this sense, an architecture is area-efcient if each processor uses a memory of size ND=P; the minimum needed to store the data to be computed. We can conclude that our design is area-efcient in both senses.

However, we must point out that the area occupied by the memory is not only proportional to the number of data stored, but it is also proportional to the number of ports. Therefore, the most efcient memory (for a minimum number of computation cycles) would be the one used in our case, with a single read port and a single write port. This feature also allows us to improve the memory access time, which may be critical in order to establish the operation cycle in a pipelined processor. In addition, a memory organized with words of size 2D helps to reduce the memory access time. Table I shows the differences in area and access time between two memories with two read/write ports of sizes 16 2 256 and 16 2 1024, and two memories of sizes 32 2 128, and 32 2 512 with only one read port and only one write port. In the following, we compare the parameters area/time of our design with two others [7], [8]. Let AP S be the processing and internal routing sections area, and N the memory area being a parameter depending on technology and memory organization. In our design, the total amount of area would be AP S + N and AP S +2 N in the two others (we do not consider the differences that may occur in the values of AP S and ): If we consider that the area occupied by the processors is, basically, the same as the area occupied by the local memories [12], we conclude that our design reduces the area almost 50% these two others designs. On the other hand, the number of execution cycles needed to calculate a radix r FFT is [(N=rP ) logr N + s]=s in all the cases we consider, s being the depth of the pipelining of the P S section (notice that r is constraint to 4 in [7]). As pointed out before, our data storage organization permits a higher operation frequency, which means a more signicant reduction in the area 2 time parameter. V. CONCLUSIONS We have designed a specic parallel architecture for the computation of the radix r Fast Fourier Transform, which reduces almost 50% the area of other designs with a similar performance. This reduction is mainly due to three factors. First, the algorithm implemented

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 2, FEBRUARY 1999

193

is a constant geometry algorithm that uses a xed interconnection network. The second factor is that memory requirements are minimal due to the in-place implementation, since we only use the memory needed to store the data sequence to be transformed. The design uses a single local memory with a single read port and a single write port in each processor, the mechanism for the generation of addresses is very simple, and there is only one for each and every processor. Finally, the memory organization in words of length rD reduces the access time and then the processor cycle. On the other hand, the data ow is regular, efciently exploits the pipelining of the processing section with no cycle loss, and provides an optimal load balance. The design characteristics make it especially suitable for its VLSI integration. In fact, the uniprocessor system presented in Section IIIA has been implemented with 0.7 micron CMOS technology in a DSP for real-time audio application. REFERENCES
[1] B. Holland and J. Mather, Monolithic frequency domain processing with 450 MFLOPS throughput, Electron. Eng., pp. 2936, Aug. 1989. [2] E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, A fast singlechip implementation of 8192 complex point FFT, IEEE J. Solid State Circuits, vol. 30, Mar. 1995. [3] H. Miyanaga and H. Yamauchi, A 400 MFLOPS processor VLSI architecture, IEICE Trans., vol. E-74, no. 11, pp. 38453851, Nov. 1991. [4] J. A. Hidalgo-Lopez, J. C. Tejero, J. Fernandez, E. Herruzo, and A. Gago, New architecture for RFFT calculation, Electron. Lett., vol. 30, no. 22, Oct. 1994. [5] E. L. Zapata and F. Arg uello, Application-specic architecture for fast transforms based on the succesive doubling method, IEEE Trans. Signal Processing, vol. 41, pp. 14761481, Oct. 1993. [6] G. Miel, Constant geometry fast fourier transforms on array processors, IEEE Trans. Computers, vol. 42, pp. 371375, Mar. 1993. [7] S. F. Gorman and J. M. Wills, Partial column FFT pipelines, IEEE Trans. Circuits Syst., vol. 42, June 1995. [8] F. Arg uello, J. D. Bruguera, R. Doallo, and E. L. Zapata, Parallel architecture for fast transforms with trigonometric kernel, IEEE Trans. Parallel Distributed Syst., vol. 5, pp. 10911099, Mar. 1994. [9] J. Lopez and E. L. Zapata, Unied architecture for divide and conquer based tridiagonal systems solvers, IEEE Trans. Computers, vol. 43, pp. 14131421, Dec. 1994. [10] M. C. Pease, An adaptation of the fast fourier transform for parallel processing, J. ACM, vol. 15, pp. 252264, 1968. [11] C. B. Shung, H. D. Lin, R. Cypher, P. H. Siegel, and H. K. Thapar, Area-efcient architectures for the Viterbi algorithmPart I: Theory, IEEE Trans. Commun., vol. 41, pp. 636643, Apr. 1993. [12] C. D. Thompson, Fourier transforms in VLSI, IEEE Trans. Computers, vol. C-32, pp. 10471057, Nov. 1983.

Novel Vector Quantization Based Algorithms for Low-Power Image Coding and Decoding
K. Masselos, P. Merakos, T. Stouraitis, and C. E. Goutis

Abstract In this paper, a novel scheme for low-power image coding and decoding based on vector quantization is presented. The proposed scheme uses small codebooks, and block transformations are applied to the codewords during coding. Using small codebooks, the proposed scheme has reduced memory requirements in comparison to classical vector quantization. The transformations applied to the codewords extend computationally the small codebooks compensating for the quality degradation introduced by the small codebook size. Thus the coding task becomes computation-based rather than memory-based, leading to signicant power savings since memory-related power consumption forms the major part of the total power consumption of a system. Since the parameters of the transformations depend on the image block under coding, the small codebooks are dynamically adapted to the specic block under coding leading to acceptable image qualities. The proposed scheme leads to power savings of a factor of 10 in coding and of a factor of 3 in decoding, at least in comparison to classical full-search vector quantization. The main factor affecting both image quality and power consumption is the size of the codebook that is used.

I. INTRODUCTION Image and video coding form an integral part of information exchange. The number of computer systems incorporating multimedia capabilities for displaying and manipulating video data is continuously increasing. As the essential design consideration for portability is the reduction of power consumption [1], this interest in multimedia, combined with the great popularity of portable computers and phones, makes the development of low-power image and video coding/decoding schemes very important. A hardware implementation of a very low-power decoder based on vector quantization, for real time video decompression on a portable terminal, is presented in [2]. Another low-power video compression-decompression system based on pyramid vector quantization of subband coefcients is described in [3]. Vector quantization [4] is an efcient image coding technique, achieving low bit rates, i.e., lower than 1 bit per pixel. Vector quantization is described as
Q: C y R

k!C
if
i; j

= fy1 ; y2 ; 1 1 1 ; yN g; = Q(x) for

i 2 Rk d(x; yi )  d(x; yj )
y

8i

= 1; 2; 1 1 1 ; N

= 1; 2; 1 1 1 ; N

(1)

where x is a k-dimensional input vector belonging to the kdimensional space Rk ; C is the codebook of N k-dimensional words yi , and d is the distortion criterion used. In vector quantization, a vector, which is a block of pixels, is approximated by a representative vector (codeword) of the codebook, which minimizes the distortion among all the codevectors in the codebook. Compression is achieved by transmitting or storing the codeword address (index) instead of the codeword itself.
Manuscript received September 3, 1996; revised November 25, 1997. This paper was recommended by Associate Editor C.-Y. Wu. The authors are with the Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece (e-mail: [email protected]). Publisher Item Identier S 1057-7130(99)01758-9.

10577130/99$10.00 1999 IEEE

You might also like