4 - CORDIC Based FFT
4 - CORDIC Based FFT
Abstract:
A pure CORDIC based architecture is used to calculate the FFT in signal processing. Though many hardware efficient
algorithms exist, but these are not generally known due to the dominance of software systems. Among these algorithm is
a set of shift and add operation collectively known as CORDIC used for computing a wide range of function including
certain hyperbolic, trigonometric, logarithmic and linear functions. The CORDIC is used in many digital signals
processing application in which FFT is designed in this. The performance of this approach can compete with the
ordinary MAC based FFT. With the presented pure CORDIC based FFT it is now possible to replace the main
processing blocks of the WLAN and UMTS baseband by this programmable architecture.
I INTRODUCTION
A modern signal processing application requires a high computational power for technical demands that can be fulfilled only by
ASIC. Though ASIC can meet all the technical demands it has the disadvantage such as it is inflexible, costly and economical
only for mass products. So the system designers are striving to replace the hardware based solution with a software based
solution.
Due to these reasons that most commonly used programmable device i.e, DSP often lack the required power needed for
the design. In order to provide the solution between these two extrema such as programmable signal processing and dedicated
hardware a reconfigurable computing is developed.Within the framework of MORTEX project, this paper provides a solution
by replacing the MAC based FFT[1][2] by a purely CORDIC based FFT.
The Coordinate Rotation Digital Computer (CORDIC) was introduced by Volder [3] that provides an effective method
for computing iteratively the rotation of the two dimensional vector using only shift and add operation. The CORDIC algorithm
is used in many applications such as digital filtering, modulation, Fast Fourier Transform (FFT), and Signal Value
Decomposition(SVD). The CORDIC based architecture are a very appealing alternative to multiply and add hardware.
In [4] it was shown that the Rake receiver was replaced by CORDIC based algorithm that uses the reconfigurable
hardware accelerator[5] [6], which even results in a better performance. An 8 point FFT used in WLAN is implemented on the
same architecture of Rake receiver is explained in this paper.Therefore we have derived a reconfigurable architecture for a
mobile multi standard terminal and the main processing blocks of the WLAN and UMTS baseband can be replaced by this
programmable architecture.
This paper is organized as follow: Section I gives the Introduction of the CORDIC algorithm and FFT of Digital Signal
Processing. Section II describes about the MAC based algorithm. Section III explains about the CORDIC based algorithm, the
solution approach and the proposed implementation. Section IV show some simulation results in a WLAN environment, the
performance of proposed technique and the last section V concludes the paper and followed by the references.
The number of complex multiplication and addition operations required by the simple forms both the Discrete Fourier
Transform (DFT) and Inverse Discrete Fourier Transform (IDFT) is of order N2 as there are N data points to calculate, each of
which requires N complex arithmetic operations.
For an input vector X and length n, the DFT is a vector X length n, with n elements:
(1)
In computer science jargon, we may say they have algorithmic complexity O(N2) and hence is not a very effective
method. If the proposed method is not better than this then the DFT will not be very useful for the majority of practical DSP
applications. But ,there are large number of different 'Fast Fourier Transform' (FFT) algorithms that enable the calculation the
Fourier transform of a signal much faster than a DFT.
As the name suggests, FFTs are algorithms for quick calculation of discrete Fourier transform of a input vector. The
FFT that includes DFT algorithm reduces the number of computations needed for N points from O(N 2) to O(N log N) where
log is the base-2 logarithm. The function that is to be transformed is not harmonically related to the sampling frequency, but it
looks like a ‘sinc’ function (sin x).
The 'Radix 2' algorithms are useful if N is a regular power of 2 (N=2p). Let us assume that algorithmic complexity
provides a direct measure of execution time and that the relevant logarithm base is 2 then as shown in Fig 1, ratio of execution
times for the (DFT) versus (Radix 2 FFT) (denoted as ‘Speed Improvement Factor’) increases tremendously with increase in N.
The FFT are the most advanced one because there are several numbers of FFT algorithms used. There are two Radix 2
algorithms, called as 'Decimation in Frequency' (DIF) and 'Decimation in Time' (DIT) algorithms. Both the algorithms depend
on the recursive decomposition of an N point transform into 2 (N/2) point transforms. The decomposed process can be applied
to any composite (non prime) N. The method is simple if N is divisible by 2 and if it is a power of 2, then the decomposition
can be done recursively until the trivial '1 point' transform is reached.
The Fig. 1 shows the first stage of the 8-point DIF algorithm.The radix-2 decimation-in-frequency FFT is an important
algorithm obtained by the divide and- conquers approach.
Fig.1First stage of 8 point Decimation in Frequency algorithm. Fig.2 2 point Decimation in Frequency algorithm
The entire process involves v = log2 N stages of decimation, where each stage involves N/2 butterflies of the type
shown in the Fig. 2. The decimation, however, causes shuffling in data.
Consequently, the computation of N-point DFT via this algorithm requires (N/2) log2 N complex multiplications. For
design purposes, the eight-point decimation-in frequency algorithm is shown in the Fig.3. We observe, that the output sequence
has been occurred in a bit reversed order for the applied input. Furthermore, if suppose that the computation is abandoned in a
particular place then both the input and output can be achieved in the normal order.
Verilog HDL code implements an 8 point decimation-in-frequency algorithm using the butterfly structure. The number of stages
v in the structure shall be v = log2 N. In our case, N = 8 and hence, the number of stages is equal to 3. There are numerous
number of ways to implement these 3 stages. Some of them are,
A Iterative Architecture
In iterative architecture only one stage is repetitively used for three times in decimation that provides a hardware efficient
circuit for all the 12 set adders and subtractors in one set. Out of these 3 stages, the first stage will requires only 2 CORDICs
and each will take 8 clock pulses for each computation. The remaining two stages does not require any CORDIC even though
they have to be rotated the data by 0o or -90o using the CORDIC, that will take 16 clock pulses. The rotation of data for the
whole process can be achieved by using the 2’s complement and the Bus exchange that would require only less hardware.
In this when one of the data is being computed, the process has to wait until it has been completed for 36 clock pulses.
Thus,
Time Taken for computation = 24 clock cycles
No. of 12 bit adders and sub tractors = 16
B Pipeline Architecture
In pipeline architecture instead of using one stage, three stages are used separately for everydecimation and the other case is that
12 bit adders require 3 sets of sixteen adders. By using this architecture the complexity and the delay has been reduced
drastically at each stage separated from other register bank. When the previous data is set then another set of data is sent serially
into the register input. At the last the 3 stages works simultaneously as a total net effect.
Though the pipeline architecture has many advantages compared to iterative one it does not consider as the valid
option because of large amount of hardware used. Besides, this provides an improvement in architecture merely as 1 clock cycle.
Thus,
Time Taken for computation = 8 clock cycles
No. of 12 bit adders and subtractions = 40
C Proposed Method
To balance between the pipeline and iterative architecture the proposed method will use the two stages for calculating the three
decimations. The first stage will be implemented as in the normal fashion but the remaining two stages are combined together to
form one stage since it does not require any CORDIC. MUX is used to control the data that has been given as the input which in
turn controlled by COUNTER MUX. In the first stage the adders and subtractors require only the real data while the next stage
requires both real and imaginary data.
Thus,
Time Taken for computation = 10 clock cycles
No. of 12 bit adders and subtractors = 24
Fig.4 Block diagram of input architecture Fig.5 Block diagram of butterfly scheme
The input data is sent serially and depends of the counter output the data will pass into the 12 bit register for parallel
input. In this only 8 clock pulses are used in this process as shown in the Fig.4. Then later the data acts as an input for
asynchronous adders and subtractors.
The asynchronous output is given as an input to the CORDIC block. Out of many outputs the output from 0 to 5 and 8
are available for the next stage but only after the 8 clock pulse is achieved i.e., the output is ready for the second stage only after
16 clock pulses. After the desired clock pulse is achieved the output is passed to next stage.The stage 2 in this circuit jointly
implements both the second and third decimations in the architecture simply because there is no CORDIC required in these
stages and rotation required is -90o or 0o.Thus, a+bj on rotation by -90o becomes b-aj, i.e. simply 2’s complement of ‘a‘.
The Fig.5 displays the block diagram of the butterfly scheme in which the input data is varied to combine both the
stage into single stage. If suppose the inputs of one or two is flipped then we get another stage. The second and the third stages
are asynchronous in nature so they require only a single clock pulse for each computation.
Fig.6 Adjustment done to Implement 2 & 3rd Stage together Fig.7 Block diagram of output architecture
After we get the output at the end of the 3rd stage, it is loaded into the VECTORING CORDIC. This results in a
complex number of a magnitude as Real + Imag * j that takes 8 clock cycles to compute.
The 8 outputs are then sent to the output port serially for next 8 clock cycles. The output architecture Fig.7 shows how
the output is channeled into 12 bit port.
1) Rotation CORDIC : To rotate a vector x + jy by inp_angle it uses the standard CORDIC algorithm and it takes 8 iteration to
complete. When iteration starts, the x, y and the angle registers are initiated to the original values of x, y and inp_angle
respectively. The x and y registers either add or subtract the shifted values of y and x depending on the sign of the angle
accumulated in angle register respectively at every iteration.
2) Vectoring CORDIC: This module makes a vectoring CORDIC that computes the magnitude (mag) of a vector x + j*y. Note
that the CORDIC gain factor compensation is not done in this vectoring CORDIC.As such, the magnitude values are actually
multiplied by the CORDIC gain Factor of 1.647.The vectoring CORDIC uses the same concept as that explained in the rotation
CORDIC.
When started, the x and y registers are initiated with the original values of x & y respectively. Then, try to iteratively
rotate the vector so that it comes onto the x - axis and the magnitude then is equal to this x component. To achieve this, the x
and y registers iteratively add or subtract the shifted values of y and x respectively so that the y register dies down to zero.
At the start of the iterations, we need to bring the vector in the region +90 to -90 degrees. The original vector x + j*y, if
in the 1st or the 2nd quadrant, is rotated by -90 degrees to get it in the region +90 degrees to -90 degrees. If the vector x + j*y is
in the 3rd or the 4th quadrant, it is rotated by +90 degrees to get it in region +90 degrees to -90 degrees.
IV SIMULATION RESULTS
The CORDIC FFT architecture uses certain blocks as Twelve Bit Adder, Counters and the CORDIC blocks themselves require
Shifters and registers. The output of CORDIC FFT is compared with the MAC based FFT and the results are analysed.
The FFT consists of the butterfly stage, shifter unit and control unit. The input to the main block is given as x_k which
is of 16 bit and the output is represented as out_real and out_imag.
The LUT is required by the angle accumulator register in the rotation cordic module to compute the angle values that
need to be added or subtracted from the accumulated angle. The angle values stored are stored in the following format:
12 bit representation is used.
The MSB, bit11, has a binary weight of -180 degrees.
The next bit, bit10 has a binary weight of +90 degrees.
Then, the next successive bits, bit(9:0) have binary weights of +90/(2^n) where n varies from1 for bit9 to 10 for bit 0.
Thus the least count or the least angle that can be represented in this system is 90/(2^10) = 0.087890625 degrees.
Fig.11 Simulation result of LUT Fig.12 Simulation result of 12 bit full adder
Fig.12 shows the full adder module is used to make a 'cascaded' 12 bit adder block. As explained for the full adder, this
adder works as carry bypass. This 12 bit adder is used as an adder/subtractor for two 12 bit numbers: a & b.The addition
subtraction depends on the sign bit. sign = 0 means addition, sign = 1 means subtraction. For subtraction, using an EXOR
inverter array, the 1's complement of b is passed to the cascade of 3 full adder blocks along with making the input sign = 1.
The main module consists of an input array which is of 12 bits.Counter is used in this in which according to the
counter input each stage of the butterfly output will be produced.For example when counter input is 000001 and 000100 then
the corresponding output outr0 will be produced.
V CONCLUSION
The key contribution of this project is the realization of FFT using the CORDIC algorithm. The CORDIC algorithm that has
been used is simply a shift and add operation which is of an efficient one and improve the performance in the design.The
CORDIC based FFT is compared with the MAC based FFT in which real and imaginary part multiplication is omitted instead
of that rotation is performed using the CORDIC.Thus the designed CORDIC based FFT provides a better performance than the
MAC based FFT.
ACKNOWLEDGEMENT
The author thanks the Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology for
the support rendered for carrying out this project.
REFERENCES
[1] A. V. Oppenheim and R. W. Schafer, Discrete Time Signal Processing, 3rd ed. Upper Saddle River, New Jersey: Prentice-Hall, 1999.
[2] C. V. Loan, Computational Frameworks for the Fast Fourier Transform, 1st ed. Philadelphia, Pennsylvania: SIAM, 1992.
[3] J. Volder, “The CORDIC trigonometric computing technique,” IRE Transactions on Electronic Computers,vol. EC-8, no. 3, pp. 330–334, 1959.
[4] B. Heyne, M. Otte, and J. Gotze, “A Performance Adjustable and Reconfigurable CDMA Receiver Concept for UMTS-FDD,” in 14th IEEE International
Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC2003), Beijing, China, September 2003.
[5] B. Oelkrug, M. Bucker, D. Uffmann, A. Dr¨oge, J. Brakensiek, and M. Darianian, “Programmable hardware accelerator for universal telecommunication
applications,” in 2nd Workshop on Software Radios, Karlsruhe, Germany,2002.
[6] M. Otte, J. Gotze, and M. Bucker, “Matrix Based Signal Processing on a Reconfigurable Hardware Accelerator,” in 10th Digital Signal Processing
Workshop, Pine Moutain, Georgia, USA, October 2002.
[7] B. Heyne and J. Goetze, “A pure cordic based fft for reconfigurable digital signal processing,” 12 th European Signal Processing Conference(Eusipco2004),
vol. 7, September 2004.
[8] Z. Liu, K. Dickson, and J. V. McCanny, “A floating-point cordic based svd processor,” in ASAP, 2003.
[9] R. Andraka,“A survey of cordic algorithms for fpgas,” FPGA ’98.Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field
programmable gate arrays, vol. 34, no. 2, pp. 191–200, Feb. 22-24 1998.
BIOGRAPHY
S.Karthick received his B.E. (Electrical and Electronics Engineering) degree from Anna University, Chennai in April
2006 and M.E. (Applied Electronics) degree from Anna University, Chennai in April 2008.He is currently pursuing his
Ph.D. degree at Anna University, Chennai in the area of Low power VLSI. He is presently working as Assistant
Professor (Sr.G) in the department of Electronics and Communication Engineering, Bannari Amman Institute of
Technology, Sathyamangalam. He is having a total of 5 years of teaching experience in engineering college. His
research interest includes Low power VLSI and Reconfigurable FPGA. He is the life member in Indian Society for
Technical Education and Member in Institution of Engineers. He has published 8 papers in International and National
Journals, 5 papers in International conferences and National Conferences .
P.PRIYA received her BE degree in Electronics and Communication Engineering from Chettinad College of Engineering
and Technology, Puliyur, in 2011 and pursuing ME(VLSI Design) in Bannari Amman Institute of Technology,
Sathyamangalam
Dr. S. Valarmathy received her B.E. (Electronics and Communication Engineering)degree and M.E. (Applied Electronics)
degree from Bharathiar University, Coimbatorein April 1989 and January 2000 respectively. She received her Ph.D.
degree at AnnaUniversity, Chennai in the area of Biometrics in 2009. She is presently working as Professor& Head in the
department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam.
She is having a total of 20 years of teaching experience in various engineering colleges. Her research interest includes
Biometrics, Image Processing, Soft Computing, Pattern Recognition and Neural Networks. She is the life member in
Indian Society for Technical Education and Member in Institution of Engineers. She has published 14 papers in
International and National Journals, 48 papers in International conferences and National Conferences.