FFT Processor
FFT Processor
4,200
Open access books available
116,000
International authors and editors
125M
Downloads
154
Countries delivered to
TOP 1%
most cited scientists
12.2%
Contributors from top 500 universities
Rozita Teymourzadeh
Rozita Teymourzadeh
Additional information is available at the end of the chapter
http://dx.doi.org/10.5772/66745
Abstract
Electrical motors are vital components of many industrial processes and their operation
failure leads losing in production line. Motor functionality and its behavior should be
monitored to avoid production failure catastrophe. Hence, a high‐tech DSP processor is a
significant method for electrical harmonic analysis that can be realized as embedded sys‐
tems. This chapter introduces principal embedded design of novel high‐tech 1024‐point
FFT processor architecture for high performance harmonic measurement techniques.
In FFT processor algorithm pipelining and parallel implementation are incorporated
in order to enhance the performance. The proposed FFT makes use of floating point to
realize higher precision FFT. Since floating‐point architecture limits the maximum clock
frequency and increases the power consumption, the chapter focuses on improving the
speed, area, resolution and power consumption, as well as latency for the FFT. It illus‐
trates very large‐scale integration (VLSI) implementation of the floating‐point parallel
pipelined (FPP) 1024‐point Radix II FFT processor with applying novel architecture that
makes use of only single butterfly incorporation of intelligent controller. The functional‐
ity of the conventional Radix II FFT was verified as embedded in FPGA prototyping. For
area and power consumption, the proposed Radix II FPP‐FFT was optimized in ASIC
under Silterra 0.18 µm and Mimos 0.35 µm technology libraries.
Keywords: FFT, butterfly, Radix, floating point, high speed, FPGA, Embedded, VLSI
1. Introduction
The prevalent subject of Fourier analysis encompasses a vast spectrum of mathematics where
parts may appear quite different at first glance. In Fourier analysis, the term Fourier transform
often refers to the process that decomposes a given function into the harmonics domain. This
process results in another function that describes what frequencies are in the original function.
© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
©Attribution
2016 The Author(s). Licensee InTech. This chapter is distributed under
License (http://creativecommons.org/licenses/by/3.0), the terms
which ofunrestricted
permits the Creative use,
Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted
distribution, and reproduction in any medium, provided the original work is properly cited. use, distribution,
and reproduction in any medium, provided the original work is properly cited.
68 Fourier Transforms - High-tech Application and Current Trends
Meanwhile, the transformation is often given a more specific name depending upon the domain
and other properties of the function being transformed.
Fourier transform was introduced with the main concepts of discrete Fourier transform
(DFT) [1] in the heart of most DSP processor. The DFT is a Fourier representation of a
finite‐length sequence which is the most important fundamental operation in digital signal
processing and communication system [2, 3]. However, the computation complexity of the
direct evaluation of an N‐point DFT involves a long phase computational time and large
power consumption [4]. As a result of these problems, it is important to develop a fast
algorithm. There are numerous viewpoints that can be taken toward the derivation and
interpretation of the DFT representation of a finite‐duration sequence. The sequence of
x̃(n ) that is periodic with period N so that x̃(n ) = x̃(n + kN ) functions for any integer value
of k. It is possible to represent x̃(n ) in terms of Fourier series that is represented by the sum
of sine and cosine values or equivalently complex exponential sequences with frequen‐
cies that are integer multiplies of the fundamental frequencies 2π / N associated with the
periodic sequence. The same representation can be applied to finite‐duration sequence.
The resulting Fourier representation for finite duration sequences will be referred to as
the DFT. Sequence of length N by a periodic sequence can be represented by a periodic
sequence with period N, one period of which is identical to the finite‐duration sequence.
The sampled sequence signal in frequency is defined as
∞
X(ω) = ∑ x(n)e −jωn (1)
n= −∞
The DFT X(ω) is a function of continuous‐frequency variable ω, and the summation in Eq. (1)
extends toward positive and negative infinitively. Therefore, the DFT is a theoretical Fourier
transform of a digital signal. However, it cannot be implemented for real applications. It is the
sample of the signal in time domain at a particular time and can be expressed as:
∞
x (n ) = ∫ x(t)δ(t − tn) (2)
0
The frequency analysis of a finite‐length sequence is equal to the sample of continuous fre‐
quency variable ω at N equally spaced frequencies ωk = 2πk/N for k = 0, 1, 2, …, N ‐ 1 on the
unit circle. These frequency samples are expressed as:
2πk
X(k) = X(ωk) , ωk = ____
N
N−1 j2πkn N−1 (3)
∑ x (n )e − ∑ x(n)WNkn , k = 0, 1, … , N − 1
_____
X(k) = N =
n=0 n=0
The DFT is based on the assumption that the signal x(n ) is periodic. Therefore, X(k) for k = 0, 1,
…, N ‐ 1 can uniquely represent a periodic sequence x(n) of period N. The inverse DFT is the
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 69
http://dx.doi.org/10.5772/66745
reversed process of the DFT. It converts the frequency spectrum X(k) back to the time domain
signal x(n) [5]:
1 N−1 j2πkn
1 N−1 ( ) −kn
x(n) = __ ∑ X(k)e = __ ∑ X k WN , n = 0, 1, … , N − 1
_____
N k=0
N
N
(5)
k=0
Direct computation of an N‐point DFT according to Equation Eq. (5) requires N(N‐ 1) complex
additions and N(N‐ 1) complex multiplications. The complexity for computing an N‐point
DFT is therefore O(N2). High computation complexity in DFT algorithm and need for having
efficient Fourier processor leads for introduction of a fast Fourier transform (FFT) processor.
In 1965 Cooley and Tukey [6] developed the use of FFT in order to save time and avoid unnec‐
essary complex calculations. FFT algorithm computes an N‐point forward DFT or inverse
DFT (IDFT) where N is 2 power of M. FFT algorithm divides N‐point data into two N/2‐point
series and performs the DFT on series individually results in the order of O(N / 2 ) 2 complex‐
ity as compared with the original N2 operations in an N‐point DFT. The process of dividing
can be continued until a 2‐point DFT is reached. FFT algorithm computes an N‐point forward
DFT or inverse DFT (IDFT) where N is 2 power of m. The FFT is a family of algorithms that
efficiently implements the DFT. Table 1 shows the comparison between the calculation of
direct DFT and FFT when a different number of N is applied.
Number of points Complex addition Complex multiplication Complex addition Complex multiplication
4 12 16 8 4
8 56 64 24 12
16 240 256 64 32
To calculate FFT algorithm, there are two well‐known methods identified as DIT‐FFT and
DIF‐FFT calculations [7–9]. In general, FFT processor has many types in terms of Fourier cal‐
culation. Taking into account different types of FFT algorithms are:
• Different Radixes, such as Radix II, Radix IV, etc., and mixed‐radix algorithms.
• DIT and DIF.
• Real and complex algorithm.
Based on the DFT definition and combination of the FFT concept, X(k) can be written as:
N N
( 2 )−1 ( 2 )−1
_ _
N−1
X(k) = ∑ x(n)WNkn = ∑ x(2m)W 2mk + N
∑ x(2m + 1 ) WN(2m+1)k (7)
n=0 m=0 m=0
Since WN2mk mk
= WN/2 , the equation will be simplified as:
N N
( 2 )−1 ( 2 )−1
_ _
X(k) = ∑ x (m)W__Nmk + W k
e N
∑ xo(m)W__Nmk k = 0, 1, … , N − 1 (8)
m=0 2 m=0 2
where WNk is complex twiddle factor with unit amplitude and different phase angles. The 8‐point
FFT utilizes the twiddle factors from WN0 to WN7. The first twiddle factor WN0 = 1. All twiddle factors
are distributed around the unit circle. Figure 1 shows the twiddle factor for 8‐point Fourier
transform. The twiddle factor WNk repeats itself after every multiple of N. The twiddle factors are
periodic and for 8‐point FFT twiddle factor 0 and 8 are equal.
N N
( 2 )−1 ( 2 )−1
_ _
Butterfly calculation is the fundamental concept of the FFT algorithm and 8‐point butterfly
structure is shown in Figure 2.
Radix II butterfly FFT is decomposed into M stages, where M = log N2 . In each stage, N / 2 complexes
are multiplied by the twiddle factors where N complex additions are required. Therefore, the
total computational requirement is (N log N2 ) / 2 complex multiplications and N log N2 complex addi‐
tions. Consequently, expanding Radix II butterfly calculation into 8 data is shown in Figure 3.
DIF‐FFT calculation is similar to the DIT‐FFT algorithm. As far as FFT calculation is involved,
the time domain sequence is divided into two subsequences with N / 2 samples: The DFT con-
cept of x(n ) expressed as:
N N
( 2 )−1 ( 2 )−1
_ _
N−1 (N/2)−1
∑ ∑ ∑ x(n)WNnk + ∑ x(n + N / 2)WNnk WN( 2 )k
N
_
X(k) = x(n)W nk +N
x(n)W nk
N
= (10)
n=0 N
n=__ n=0 n=0
2
72 Fourier Transforms - High-tech Application and Current Trends
N
( 2 )−1
_
Later, Eq. (11) is expanded into two parts including even X(2k ) and odd X(2k + 1 ) sam\twid‐
dle factor characteristic, Eq. (11) is simplified to:
( ) 2 k +1 n
=Wn2 kn W=
kn
N / 2 ,Wn WNnWNkn/ 2
N N
−1 −1
2
N 2
=n 0=
nk
N /2
n 0
X (=
2k ) ∑ x ( n ) + x n + 2 W= ∑ x ( n )W 1
nk
N /2
(12)
N N
−1 −1
2
N 2
N
=n 0=
nk
N
nk
N /2 X ( 2k + 1) =
n 0
∑ x ( n ) − x n + 2 W W = ∑ x ( n )W W
2
n
N
nk
N /2 k = 0,1,…, − 1
2
Similarly, 8‐point DIF FFT structure is shown in Figure 4 with detail complex calculation in
three stages. The output sequence X(k ) of the DIF‐FFT is bit‐reversed, while the input sequence
x(n ) of the DIT‐FFT is bit‐reversed. In addition, there is a slight difference in the calculation
of butterfly architecture. As shown in Figure 4, the complex multiplication is performed
before the complex addition or subtraction in the DIT‐FFT processor. In contrast, the complex
subtraction is performed before the complex multiplication in the DIF‐FFT. The process of
decomposition is continued until the last stage is reduced to the 2‐point DFT. Since the fre‐
quency samples in the DIF‐FFT are bit‐reversed, it is required to apply bit‐reversal algorithm
to the frequency samples. Likewise, the DIF‐FFT algorithm also uses in‐place computation.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 73
http://dx.doi.org/10.5772/66745
Unlike the DIF structure, input data in DIT‐FFT is in bit‐reverse format while the output is
sorted. On the other hand, both the DIT and DIF can go from normal to shuffled data or vice
versa. In order to apply Radix II FFT structures, DIT and DIF algorithms require the same
number of operations and bit‐reversal to compute the FFT calculation. The overall performance
of the FFT processor is dependent on the application, hardware implementation, and conve‐
nience. If the design is focused on high speed structure, the processor has to take the most
efficient approach and algorithm to perform the FFT calculation accordingly. In this chapter
DIT‐FFT architecture is considered for floating‐point implementation.
Measured frequency by FFT will be subjected to quantization noise error with respect to the
real frequency. This is caused by the fact that the FFT only computes the spectrum at dis‐
crete frequencies. This error is said to affect the accuracy. In addition, spectral leakage effect
becomes very significant when small amplitude harmonics are close to large amplitude ones
since they become hidden by the energy distribution of the larger harmonics. Furthermore,
the fixed internal arithmetic calculation generates white noise in frequency domain. To reduce
the generated noise effect and enhance signal strength, floating‐point technique is designed
and implemented. The floating‐point technique allows numbers to be represented with a
large dynamic range. Therefore, floating‐point arithmetic enables the reduction of overflow
problems that occur in fixed‐point arithmetic. Although it is at the expense of throughput and
chip area size, the new architecture is designed and investigated to avoid undesired effects in
floating‐point FFT algorithm. Floating‐point arithmetic provides higher precision and a much
larger dynamic range under IEEE 754 standard [10]. Therefore, floating‐point operations sup‐
port more accurate DSP operations. Table 2 compares the efficiency between fixed‐point and
the floating‐point FFT processor.
74 Fourier Transforms - High-tech Application and Current Trends
In floating‐point format, the data are translated based on power and mantissa in the decimal
system. This notation can be expanded into the binary system. Representing the data in power
and mantissa system gives the data the capability of storing a much greater range of numbers
than if the binary points were fixed. Floating point refers to the “truth” of the Radix point, which
refers to the decimal point or in computers it is known as the binary point that has the capabil‐
ity to float. This entails the event to occur anywhere that is relative to the significant digit of
the number. Thus, a floating‐point representation, with its position indicated separately in the
internal representation, is a computer's recognition of a scientific concept. Although the benefit
of floating‐point representation over fixed‐point (and integer) representation is much wider in
range of values, but the floating‐point format needs more storage. Hence, the implementation of
high performance system requires applying efficient and fast floating‐point processor, which is
competitive with the fixed‐point processor. Various types of floating‐point representation have
been used in computers in the past. However, in the last decade, the IEEE 754 standard [10] has
defined the representation. According to the IEEE 754 standard [10], the single precision is cho‐
sen to represent the floating‐point data. The IEEE standard specifies a way in which the three
values described can be represented in a 32‐bit or a 64‐bit binary number, referred to single and
double precision, respectively [11, 12]. In this project, single precision is selected to function. For
the 32‐bit numbers, the first bit (MSB) specifies the sign, followed by 8 bits for the exponent, and
the remaining 23 bits are used for the mantissa. This arrangement is illustrated in Figure 5. The
sign bit is set to zero if the number is positive, and the bit is set to 1 if the number is negative. The
mantissa bits are set to the fractional part of the mantissa in the original number in bits 22 to 0.
Floating‐point algorithm finds huge demand in industry. To conclude this section, Table 3
summarizes the FFT algorithm application in fixed‐point and floating‐point architectures.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 75
http://dx.doi.org/10.5772/66745
Prototyping
4G OFDM Transceiver
In 2009, Xilinx Logic core [13] introduced the FFT processor using the Radix structure on a
chip. The introduced FFT processors were designed to offer a trade‐off between core sizes and
transform time. These architectures are classified below:
• FFT Processor with Radix II pipelined serial I/O architecture
The pipeline serial I/O allows to continue data processing, whereas the burst parallel I/O
loads and processes data separately by using the iterative approach. It is smaller in size than
the parallel but has a longer transform time. In the case of Radix II algorithm, it uses the same
iterative approach as Radix IV with the difference of smaller butterfly size that differentiates
it. Yet, the transformation time is longer. Finally, for the last category, based on Radix II archi‐
tecture, this variant uses a time multiplexed approach to the butterfly for an even smaller core,
at the expense of longer transformation time. Figure 6 shows the throughput versus resource
among the four architectures.
In this design, n‐stage of Radix II butterfly is connected as a serial structure. Each unit of
Radix II butterfly has its own RAM memory to upload and download data. The input data
are stored in the RAM while the processor simultaneously performs transform calculations
on the current frame of data and loads input data for the next frame of data and unloads
the result of the previous frame of data. Input data are presented in sorted order. The
unloaded output data can either be in bit‐reversed order or in sorted order. When sorted
output data are selected, an additional memory resource is utilized. Figure 7 illustrates the
architecture of the pipeline serial I/O with individual memory bank, which connects in a
serial structure.
76 Fourier Transforms - High-tech Application and Current Trends
Radix IV structure accepts 4 input data simultaneously whereas Radix II takes only 2 input data
at the time to perform FFT calculation. Radix IV input data uploaded into the FFT processor,
cannot be uploaded while the calculation is underway. When the FFT is started, the data are
loaded. After a full frame has been loaded, the core computes the transformation. The result
can be downloaded after the full process is over. The data loading and unloading processes can
be overlapped if the data are unloaded in digit‐reversed order. Figure 8 shows the Radix IV
structure when 4 input data are loaded for FFT calculation.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 77
http://dx.doi.org/10.5772/66745
due to the sequence calculations. Figure 10 shows the Radix II lite structure when 2 input data
are loaded for FFT calculations.
Figure 10. FFT processor with Radix II lite burst I/O architecture [13].
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 79
http://dx.doi.org/10.5772/66745
In Section 2, FFT fundamental was discussed and elaborated. Furthermore, different FFT archi‐
tectures were provided with the detail on IO configuration. Here, advance FFT processor with the
focus on 1024 floating‐point parallel architecture for high performance application is provided.
As shown in Figure 11, there are six major subprocessor units in the high‐tech 1024 point
Radix II FPP‐FFT algorithm. These units are shared memory, bit reverse, butterfly arithmetic,
smart controller, ROM, and finally address generator unit. The floating‐point input data act
as a variable streaming configuration into the processor. The variable streaming configuration
allows continuous streaming of input data and produces continuous stream of output data.
Figure 12 shows the internal schematic of the pipeline butterfly algorithm with the parallel
architecture at a glance.
To enhance the speed of calculation in Radix II butterfly algorithm, the pipeline registers
are located after each addition, subtraction, and multiplication subprocessors. Hence, the
pipeline butterfly algorithm keeps the final result in the register to be transferred into the
80 Fourier Transforms - High-tech Application and Current Trends
RAM by the next clock cycle. Additionally, the parallel architecture splits the data in real
and imaginary format and increases the speed of FFT calculation by 50%. As a result of
the design algorithm, Radix II FPP‐FFT processor calculates 1024 point floating‐point FFT
exactly after O(N / 2 log 2N) + 11 clock a pulse which proves the performance improvement in
comparison with similar Radix II FFT architecture. The existence of 11 clock pulses delay
is due to 11 pipeline registers in adder, subtraction, and multiplier in a serial butterfly
block. Additionally, parallel design of the FFT algorithm decreases the calculation time
significantly.
Radix II butterfly unit is responsible for calculating the complex butterfly equations as
output1 = input1 + W k × input2 and output2 = input1 − W k × input. To calculate the butterfly equa‐
tion, it is necessary to initiate the RAM with bit‐reverse format and the external processor
loads the data in the RAM. Since butterfly equation deals with complex data, thus each
butterfly requires four multiplication units (two for the real and two for the imaginary)
and six additional units (three for the real and three for the imaginary part). Fixed point
implementation of such complex calculation does not satisfy high‐tech application of FFT
processor due to the generated noise of round‐off, overflow, and coefficient quantization
errors [14]. Consequently in order to reduce the error as well as to achieve high‐resolu‐
tion output, the floating‐point adders and subtractions are used to replace the fixed‐point
arithmetic units.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 81
http://dx.doi.org/10.5772/66745
Butterfly processor efficiency greatly depends on its arithmetic units, and high‐speed float‐
ing‐point adder is the bottle neck of butterfly calculation. Based on IEEE‐754 standard [10] for
floating‐point arithmetic, 32‐bit data register is considered to allocate mantissa, exponent, and
sign bit in a portion of 23, 8, and 1 bits, respectively. The advantages of floating‐point adder
are that the bias power is applied to complete the calculation and avoid using unsigned value.
Additionally, the floating‐point adder unit performs the addition and subtraction using sub‐
stantially the same hardware as used for the floating‐point operations. This functionality
minimizes the core area by minimizing the number of elements. Furthermore, each block
of floating‐point adder/subtraction operates the arithmetic calculation within only one clock
cycle that results high‐throughput and low latency for the entire FFT processor. Figure 13
shows the novel structure of the floating‐point adder when it is divided into four separate
blocks while detail algorithm is presented in Figure 14.
The purpose of having separate blocks is to share the total critical path delay into three equal
blocks. These blocks calculate the arithmetic function within one clock cycle. However, the
propagation delay can be associated with continuous assignment to increase the overall criti‐
cal path delay and for the slowing down of the throughput. Based on combinational design,
the output of each stage depends on its input value at the time. The unique structure of float‐
ing‐point adder enables feeding of the output result in the pipeline registers after every clock
cycles. Hence, the sequential structure is applied for the overall pipelined add/subtraction
algorithm to combine the stages. The processing flow of the floating‐point addition/subtraction
operation consists of comparison, alignment, addition/subtraction, and normalization stages.
The comparison stage compares two input exponents. This unit compares two exponents and
provides the result for the next stage. The comparison is made by two subtraction units and
the result is revealed by compare_sign bit.
82 Fourier Transforms - High-tech Application and Current Trends
According to the results of the comparison stage, the alignment stage shifts the mantissa
and transfers it to the adder/subtraction stage. The number of shifting will be selected by the
comparison stage output. Consequently, each stage of the floating‐point adder algorithm is
executed within one clock cycle. Floating‐point adder/subtraction unit satisfies high speed and
efficiency of arithmetic unit in cost of die area size. The floating‐point arithmetic unit is designed
to calculate entire numbers regardless of the number sign. As shown in Figure 15, there is a logic
gate involved with the stages, which cause higher delay propagation through the circuit.
Floating‐point numbers are generally stored in registers as normalized numbers. This means
that the most significant bit of the mantissa has a nonzero value. Employing this method
allows the most accurate value of a number to be stored in a register. For this purpose, the
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 83
http://dx.doi.org/10.5772/66745
normalized stage is required. This unit is located after the add/sub stage. The output signal
representing the add/sub block leads to zero digits of an unnormalized result of the calcula‐
tion operation. The normalized block ignores the digital value of zero from the MSB of the
mantissa and shifts the mantissa to imply value of one in digital as MSB in mantissa.
In a floating‐point multiplier, the bias power format is applied to avoid having negative expo‐
nent in the data format. Additionally, the multiplier is designed as pipelined structure to
84 Fourier Transforms - High-tech Application and Current Trends
enhance speed calculation, with the intention of the initial result appearing after the latency
period where the result can then be obtained after every clock cycle. The multiplier offers
low latency and high throughput and is IEEE 754 compliant. This design allows a trade‐off
between the clock frequency and the overall latency by adding the pipeline stage.
Smart controller unit significantly affects the efficiency of the 1024 Radix II FPP‐FFT proces‐
sor. As such, small die area can be achieved by designing high performance controller for
the FFT processor. In this architecture, FFT controller is designed with the pipeline capabil‐
ity. The global controller unit provides the signal control to the different parts of the FFT
processor. Additionally, several paths are switched between the data input and data output
in architecture design and the data path is controlled. To calculate the 1024 point Radix II
FFT processor, it is necessary to have log 2 N stages, which are 10 stages for 1024‐point data.
N
Furthermore, each stage calculates __2
butterfly that is 512 butterfly calculations in the design.
Hence, there are two counter in corporation with the controller to count the stage number of
the processor and the number of butterfly calculation. Smart controller with collaboration of
address generator unit calculates 1024 point floating‐point FFT by using only one butterfly
structure. This functionality has great contribution on power supply as well as saving die area
size. Figure 18 shows the smart controller state machine, which controls the flow of the 1024
floating‐point Radix II FFT processor.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 85
http://dx.doi.org/10.5772/66745
There are several control signals in smart controller to clarify the presence of correct output after
finishing the current cycle of FFT calculation. The control signals transfer information through
the RAM, ROM, butterfly preprocessor, and address generator. The designed controller oper‐
ates according to the provided state machine (Figure 18) and makes the high performance FFT
calculation feasible for implementation. The controller unit is structured into the subblocks
such as in sequential and combination units. Sequential unit is responsible for updating the
state of the processor, while the combinational unit performs the states individually. The state
machine waits for processor core to complete the entire FFT calculations and then records data
points into the memory. Reset state is received every time the reset input is asserted then holds
the entire calculation. The processor gets activated after the reset input signal is removed.
board memory system. Furthermore, each complex RAM has the capability of saving real
and imaginary input data separately. The module is programed with a dual‐in‐line header
to provide the appropriate location for storing input and output result in each stage con‐
sequently. It is composed of two delay memories and multiplexer, which allows straight
through or crossed input‐output connection as required in the pipeline algorithm. Memory
unit similarly contains the controller trig. The controller, which is connected directly to the
memory modules, takes the responsibility of transferring data through the memory and
arithmetic blocks ensuring that no data conflict occurs within the complete process of the
FFT calculations. This is another advantage of high‐tech smart memory modules, by which
data can be read and written in the memory simultaneously without sending bubble data in
the FFT processor.
In order to verify the functionality of the 1024‐point FPP‐FFT processor, the VHDL code for the
overall processor is developed. Register transfer level (RTL) behavior description of the pro‐
cessor is generated for downloading into FPGA prototyping. The procedure is continued by
attaching the library cell and constraint file for ASIC implementation. High performance FFT
is transferred into the gate level synthesis to complete postsimulation stage. The design moves
forward to the back‐end implementation by 0.18 µm Silterra technology and 0.35 Mimos tech‐
nology library. Generated netlist with constraint file is transferred to complete floor planning
and place and route stage. The implementation process is summarized in Figure 20.
The high‐tech 1024‐point FPP‐FFT specification generated by Xilinx ISE synthesis report is
provided in Table 4.
As stated in Table 4, high‐tech FFT processor operates with the maximum clock frequency
of 227.7 MHz and the total latency of 5131 clock cycles (Figure 21) to prove the computation
complexity derived from (N/2log2N) + 11 when N = 1024.
Place and route (PAR) process was completed and the processor routed successfully on silicon
chip (Figure 22).
Later, the 1024‐point FPP‐FFT processor was optimized in Silterra 0.18 µm and Mimos 0.35 µm
technology for power consumption and die size measurement in maximum clock frequency.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 89
http://dx.doi.org/10.5772/66745
Table 5 shows the optimization result of FFT processor implementation in Silterra 0.18 µm and
Mimos 0.35 µm technology library.
LUTs slice 4419 (23%) Min. input arrival time (ns) 3.788
Logic slice 2584 (13%) Max. output required time (ns) 6.774
Multiplexers 77
Tri‐states 98
Table 5. Optimized power consumption and die area size in different technology library.
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 91
http://dx.doi.org/10.5772/66745
To conclude, after FPGA implementation and ASIC optimization and with considering avail‐
able software and hardware resources, the high‐tech 1024‐point Radix II FPP‐FFT processor
was implemented and tested in FPGA prototyping under Xilinx ISE software and CAD tools in
synopsis. Figure 23 shows relevant FPGA board, and Table 6 summarizes the design property.
Latency (µs) 22
Accuracy ≤0.01
In this chapter, high‐tech 1024‐point Radix II FFT processor was implemented. The design
was launched with introducing 32‐bit data single precision floating‐point parallel pipeline
architecture. Then, it was followed by implementing the subcomponents such as Radix II
butterfly and smart controller. The implementation result of high‐tech 1024‐point Radix II
FPP FFT processor was provided accordingly. Designing high speed floating‐point arith‐
metic unit such as adder/subtraction (278 MHz), multiplier (322 MHz), implementing smart
controller to save area and increase system efficiency, design processor as single chip by
implementing complex dual memory, and providing pipeline and parallel architecture
lead to present a high‐tech 1024‐point Radix II FPP FFT processor. In addition, the proces‐
sor was synthesized using the Xilinx ISE platform. From synthesis report, it was found
that the FPP FFT processor shows the maximum clock frequency of 227 MHz. The latency
for calculating 1024‐point FFT is 22 µs. After FPGA implementation, the proposed proces‐
sor was optimized in ASIC under Silterra 0.18 µm and Mimos 0.35 µm technology librar‐
ies. The estimation power consumption was reported 640 mW in Silterra and 1.198 W in
Mimos technology library with sample rate of 25 ms/s. The procedure was followed by
defining the constraints and the netlist (gate level) to produce the ASIC layout. The design
compiler result shows the die size of 2.32 × 2.32 mm2 in Silterra 0.18 µm technology and
4.256 × 4.256 mm2 in Mimos 0.35 µm technology. From the given specification, it was found
that the high‐tech 1024‐point Radix II FPP FFT processor is suitable for high performance
DSP application.
Author details
Rozita Teymourzadeh
References
[1] Bergland, G. D. A guided tour of the fast Fourier transform. IEEE Spectrum Conference.
1969; pp. 41–52.
[2] Gold, B., Radar, C. Digital Processing of Signals. New York: McGraw‐Hill; 1969.
[3] Smith, J. O. Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications.
2nd ed. W3K Publishing; Stanford, California; 2007. DOI: ISBN 978‐0‐9745607‐4‐8
[4] Alegre P. Low Power QDI Asynchronous FFT. 2016 22nd IEEE International Symposium
on Asynchronous Circuits and Systems (ASYNC). 2016;978‐1‐4673‐9008‐8; pp. 87–88.
DOI: http://doi.ieeecomputersociety.org/10.1109/ASYNC.2016.17
High Resolution Single-Chip Radix II FFT Processor for High-Tech Application 93
http://dx.doi.org/10.5772/66745
[5] Kuo, S. M., Gan, W.‐S. Digital Signal Processors, Architecture, Implementations and
Applications. Pearson Education International: Prentice Hall; Singapore; 2005. DOI:
ISBN: 0131277669
[6] Cooley, J. W., Tukey, J. W. An algorithm for the machine computation of complex Fourier
Series. Mathematics of Computation Journal. 1965;19:297–301.
[10] IEEE Std. 1985. For Binary Floating‐Point Arithmetic. IEEE Standard 754‐1985. 1985;1–17.
[11] Madeira, P. A Low Power 64‐point Bit‐Serial FFT Engine for Implantable Biomedical
Applications. IEEE 2015 Euromicro Conference on Digital System Design (DSD) (2015).
2015; 978‐1‐4673‐8034‐8; pp. 383–389. DOI: http://doi.ieeecomputersociety.org/10.1109/
DSD.2015.30
[12] Ahmedabad. Implementation of Input Data Buffering and Scheduling Methodology
for 8 Parallel MDC FFT. 2015 19th International Symposium on VLSI Design and Test
(VDAT) (2015). 2015;978‐1‐4799‐1742‐6; pp. 1–6. DOI: http://doi.ieeecomputersociety.
org/10.1109/ISVDAT.2015.7208107
[13] Xilinx Logic core. Fast Fourier transform. Xilinx; 2009; Version 7.0.DS260; pp. 1–64.
[14] Ifeachor, E. C., Jervis, B. V. Digital Signal Processing: A Practical Approach. 2nd ed. Prentice
Hall; 2002.