Implementation of Fast Fourier Transform (FFT) On FPGA Using Verilog HDL
Implementation of Fast Fourier Transform (FFT) On FPGA Using Verilog HDL
Implementation of Fast Fourier Transform (FFT) On FPGA Using Verilog HDL
An Advanced-VLSI-Design-Lab (AVDL) Term-Project, VLSI Engineering Course, Autumn 2004-05, Deptt. Of Electronics & Electrical Communication, Indian Institute of Technology Kharagpur
Submitted by Abhishek Kesh (02EC1014) Chintan S.Thakkar (02EC3010) Rachit Gupta (02EC3012) Siddharth S. Seth (02EC1032) T. Anish (02EC3014)
1
ACKNOWLEDGEMENTS
It is with great reverence that we wish to express our deep gratitude towards our VLSI Engineering Professor and Faculty Advisor, Prof. Swapna Banerjee, Department of Electronics & Electrical Communication, Indian Institute of Technology Kharagpur, under whose supervision we completed our work. Her astute guidance, invaluable suggestions, enlightening comments and constructive criticism always kept our spirits up during our work.
We would be accused of ingratitude if we failed to mention the consistent encouragement and help extended by Mr. Kailash Chandra Ray, Graduate Research Assistant, during our Term-Project work. The brainstorming sessions at AVDL spent discussing various possible architectures for the FFT were very educative for us novice VLSI students.
Our experience in working together has been wonderful. We hope that the knowledge, practical and theoretical, that we have gained through this term project will help us in our future endeavours in the field of VLSI.
In computer science jargon, we may say they have algorithmic complexity O(N2) and hence is not a very efficient method. If we can't do any better than this then the DFT will not be very useful for the majority of practical DSP applications. However, there are a number of different 'Fast Fourier Transform' (FFT) algorithms that enable the calculation the Fourier transform of a signal much faster than a DFT. As the name suggests, FFTs are algorithms for quick calculation of discrete Fourier transform of a data vector. The FFT is a DFT algorithm which reduces the number of computations needed for N points from O(N 2) to O(N log N) where log is the base-2 logarithm. If the function to be transformed is not harmonically related to the sampling frequency, the response of an FFT looks like a sinc function (sin x) / x The 'Radix 2' algorithms are useful if N is a regular power of 2 (N=2p). If we assume that algorithmic complexity provides a direct measure of execution time and that the relevant logarithm base is 2 then as shown in Fig. 1.1, ratio of execution times for the (DFT) vs. (Radix 2 FFT) (denoted as Speed Improvement Factor) increases tremendously with increase in N. The term 'FFT' is actually slightly ambiguous, because there are several commonly used 'FFT' algorithms. There are two different Radix 2 algorithms, the so-called 'Decimation in Time' (DIT) and 'Decimation in Frequency' (DIF) algorithms. Both of these rely on the recursive decomposition of an N point transform into 2 (N/2) point transforms. This decomposition process can be applied to any composite (non prime) N. The method is particularly simple if N is divisible by 2 and if N is a regular power of 2, the decomposition can be applied repeatedly until the trivial '1 point' transform is reached.
Fig. 1.1: Comparison of Execution Times, DFT & Radix 2 FFT The radix-2 decimation-in-frequency FFT is an important algorithm obtained by the divideand-conquer approach. The Fig. 1.2 below shows the first stage of the 8-point DIF algorithm.
Fig. 1.2: First Stage of 8 point Decimation in Frequency Algorithm. The decimation, however, causes shuffling in data. The entire process involves v = log2 N stages of decimation, where each stage involves N/2 butterflies of the type shown in the Fig. 1.3.
Fig. 1.3: Butterfly Scheme. Here WN = e j 2/ N, is the Twiddle factor. Consequently, the computation of N-point DFT via this algorithm requires (N/2) log2 N complex multiplications. For illustrative purposes, the eight-point decimation-in frequency algorithm is shown in the Figure below. We observe, as previously stated, that the output sequence occurs in bit-reversed order with respect to the input. Furthermore, if we abandon the requirement that the computations occur in place, it is also possible to have both the input and output in normal order.
2. ARCHITECTURE
2.1 Comparative Study
Our Verilog HDL code implements an 8 point decimation-in-frequency algorithm using the butterfly structure. The number of stages v in the structure shall be v = log2 N. In our case, N = 8 and hence, the number of stages is equal to 3. There are various ways to implement these three stages. Some of them are, A) Iterative Architecture - Using only one stage iteratively three times, once for every decimation This is a hardware efficient circuit as there is only one set of 12-bit adders and subtractors. The first stage requires only 2 CORDICs. The computation of each CORDIC takes 8 clock pulses. The second and third stages do not require any CORDIC, although in this structure they will require to rotate data by 0o or -90o using the CORDIC, which will take 16 (8 for the second and 8 for third stage) clock pulses. The entire process of rotation by 0o or -90o can rather be easily achieved by 2s complement and BUS exchange which would require much less hardware. Besides, while one set of data is being computed, we have no option but to wait for it to get completely processed for 36 clock cycles before inputting the next set of data. Thus, Time Taken for computation = 24 clock cycles
No. of 12 bit adders and subtractors = 16 a) Pipeline Architecture - Using three separate stages, one each for every decimation This is the other extreme which would require 3 sets of sixteen, 12-bit adders. The complexity of implementation would definitely be reduced and delay would drastically cut down as each stage would be separated from the other by a bank of registers, and one set of data could be serially streamed into the input registers 8 clock pulses after the previous set. The net effect is that at a time we can have 3 stages working simultaneously. However, this architecture is not taken into consideration as a valid option simply because of the immense hardware required. Besides, it would give improvement of merely 1 clock cycle over the architecture discussed below which we have used in terms of the total time taken. Thus, Time Taken for computation = 8 clock cycles
b) Proposed Method - Using 2 stages to calculate the 3 decimations Our architecture attempts to strike a balance between the iterative and pipeline architectures. We use two stages for the 3 decimations. The first stage is implemented in standard fashion. It is the second and third stages which are merged together to form one stage, as they do not require any CORDIC. The selection of data for computation is controlled by MUX which is in turn controlled by the COUNTER MUX. The first stage requires adders and subtractors only for the REAL data, while next stage requires adders and subtractors for both REAL and IMAGINARY data. Thus, Time Taken for computation = 10 clock cycles
No. of 12 bit adders and subtractors = 24 The above data clearly highlights the fact that the implemented architecture is a trade-off between the two extreme architectures.
2.2 Working
The data is serially entered into the circuit. Depending upon the output of the counter, the data goes into the respective 12 bit register for parallel input. The first 8 clock pulses are used in this input process as shown in the Fig. 2.2.1. This data later automatically acts as input to the asynchronous adders and subtractors.
The outputs are now ready to be inputted to the CORDIC block. Outputs 0 to 5 and 8 are ready for next stage, but the outputs to the CORDIC are available only after 8 more clock pulses. Hence, the output to the second stage is available only after 8+8 =16 clock pulses. This output is loaded into the input register, whose output is in turn fed to stage 2 of the circuit. The stage 2 in this circuit jointly implements both the second and third decimations in the architecture simply because there is no CORDIC required in these stages and rotation required is -90o or 0o. Thus, a+bj on rotation by -90o becomes b-aj, i.e. simply 2s complement of a
The Fig. 2.2.3 displays how by varying the input of data, both the stages can be implemented using only one stage and used iteratively. If the second and third inputs are flipped, we get the structure for the third stage. As both second and third stages are asynchronous, they require only one clock pulse each for computation.
Fig. 2.2.3: Adjustments done to implement 2nd & 3rd Stage together After we get the output at the end of the 3rd stage, it is loaded into the VECTORING CORDIC. The VECTORING CORDIC gives the magnitude of the complex number entered as Real + Imag * j as the output, taking 8 clock cycles to compute.
We then send these 8 outputs serially in the output port in the next 8 clock cycles. The above architecture illustrates how the output is channeled into a 12 bit port by the use of counter value and the bank of multiplexers.
Thus, the entire operation of taking in the input vector, performing FFT and giving the result in the output port takes a total of 34 clock cycles. The distribution is summarized as follows.
Taking the 8 real values into reg_x[0:7] Performing Rotation CORDIC 2nd and 3rd Stage of Butterfly Scheme Performing the Vectoring CORDIC to get the magnitude Giving the 8 magnitude values into 'out' one after the other
3. Building blocks
As we saw in the last section, the FFT architecture uses certain blocks as Rotation CORDIC, Vectoring CORDIC, Twelve Bit Adder and Counters. The CORDIC blocks themselves require Shifters and registers. These blocks are now explained.
Because of this, the x and y components get multiplied by the CORDIC gain factor of 1.647. The exact gain depends on the number of iterations and obeys the relation:
To remove this factor we need to do compensation. We have two instantiations of the rotate_cordic module running side by side. One cordic rotates the input x by angle_a= (-45 + beta) degrees and the other rotates the input x by angle_b = (-45 beta) degrees. Now, the outputs of these two cordics at the end of 8 iterations are: Cordic 1: xa = An*x*cos(angle_a), ya = An*x*sin(angle_a) Cordic 2: xb = An*x*cos(angle_b), yb = An*x*sin(angle_b) where An = 1.647 = CORDIC gain factor (after 8 iterations.) At the end of this, we get the final x and y values by taking the mean of xa & xb and ya & yb. This compensates the CORDIC gain factor as follows: By taking cos(beta) = (1/An), x1 = (xa + xb)/2 = (An*x*cos(angle_a) + An*x*cos(angle_b))/2 = An*x*(cos(-45 + beta) + cos(-45 - beta))/2 = An*x*cos(-45)*cos(beta) = x*cos(-45) y1 = (ya + yb)/2 = (An*x*sin(angle_a) + An*x*sin(angle_b))/2 = An*x*(sin(-45 + beta) + sin(-45 - beta))/2 = An*x*sin(-45)*cos(beta) = x*sin(-45)
11
12
At the start of the iterations, we need to bring the vector in the region +90 to -90 degrees. The original vector x + j*y, if in the 1st or the 2nd quadrant, is rotated by -90 degrees to get it in the region +90 degrees to -90 degrees. If the vector x + j*y is in the 3rd or the 4th quadrant, it is rotated by +90 degrees to get it in region +90 degrees to -90 degrees.
->The next bit, bit[10] has a binary weight of +90 degrees. ->Then, the next successive bits, bit[9:0] have binary weights of +90/(2^n) where n varies from 1 for bit[9] to 10 for bit [0]. Thus the least count or the least angle that can be represented in this system is 90/(2^10) = 0.087890625 degrees. In our code, the LUT is made using a simple case statement as shown in the following code segment. (Fig. 3.4.1)
Fig. 3.4.2: LUT Circuit Now, the angles stored in the LUT describe the angle to be used at each of the 8 iterations. These angles are calculated as follows: Angle at iteration i (i = 0,....,7) = atan(K) where K = 2^(-i). 15
Thus, at first iteration, i.e. when count_out = 3'b000, angle = 45 degrees. The representations we have used and the actual angles that are to be used are shown alongside in the code at the relevant place. Please note that we have used the best representation possible to minimize error. The angles differ by at max the least count of 0.087890625 degrees.
Sample # 1 Input Vector Verilog O/P 1.647 MATLAB O/P 128 183 45 57 21 64 38.86 148 47 28.54 61 15 9.11 0 47 28.54 255 64 38.86 200 57 34.6
Sample # 2 Input Vector Verilog O/P 1.647 MATLAB O/P 5 92 55.86 53.5 100 27 16.39 0 47 28.54 62 71 43.11 12 17 10.32 9 71 43.11 240 47 28.54 0 27 16.39
18
Sample # 3 Input Vector Verilog O/P 1.647 MATLAB O/P 255 0 0 255 4039 255 255 255 255 255 255 255
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 Sample # 5
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
100 48 29.15 29
30 19 11.54
13 21 12.75
11 15 9.11
19 8 4.86 5
28 15 9.11 8.855
4 21 12.75
27 19 11.54
12.993 11.717
5. Future Work
5.1 Further Improvement in Architecture.
One way in which the present implementation can be improved is by changing the input output process. The input output block remains idle when processing is going on. We cannot enter new sets of data as long as the entered set has been completely computed. The new proposed architectural modification takes care of the fact that when computation of one is going on, input and output blocks are not staying idle. This will lead to kind of pipelined input output architecture for the whole block.
19
Fig. 5.1.1: Suggested Improvement in Input Output Architecture 8 bits are entered serially into the 8 shift registers. After the 8 clock pulses only the 8 sets of numbers are entered to the block for actual processing. We know that the processing will require 12 clock pulses more. This time is utilized to enter new sets of data into the shift registers. Similarly, previously computed sets of data after the VECTORING CORDIC can be
equivalently shifted out. This will give rise to additional hardware but there will be considerable improvement in the time complexity.
20
References:
[1] J. G. Proakis and D.G. Manolakis, Digital Signal Processing, Principles, Algorithms and Applications. 3rd Edition, 1998, Prentice Hall India Publications.
[2]
B. Das and S. Banerjee, Some Studies on VLSI Based Signal Processing for
[3]
Proceedings of the 1998 ACM/SIGDA sixth International Symposium on Field Programmable Gate Array.
Web References:
http://www.dspguru.com/info/faqs/cordic.htm
21