100% found this document useful (1 vote)
748 views17 pages

Fast Fourier Transforms

An FFT is an efficient algorithm to compute the discrete Fourier transform (DFT) in O(N log N) time rather than O(N^2) time of direct computation. The most common FFT algorithm is the Cooley-Turkey algorithm, which recursively breaks down a DFT into smaller DFTs. While Cooley-Turkey is generally used for power-of-two sizes, it can be generalized to other factorizations. Many other FFT algorithms have been developed, but most focus on further reducing the number of operations needed for computation. The minimum complexity of FFTs remains an open problem, though lower bounds have been proven for certain cases.

Uploaded by

ekichi_onizuka
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
748 views17 pages

Fast Fourier Transforms

An FFT is an efficient algorithm to compute the discrete Fourier transform (DFT) in O(N log N) time rather than O(N^2) time of direct computation. The most common FFT algorithm is the Cooley-Turkey algorithm, which recursively breaks down a DFT into smaller DFTs. While Cooley-Turkey is generally used for power-of-two sizes, it can be generalized to other factorizations. Many other FFT algorithms have been developed, but most focus on further reducing the number of operations needed for computation. The minimum complexity of FFTs remains an open problem, though lower bounds have been proven for certain cases.

Uploaded by

ekichi_onizuka
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 17

A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier

transform (DFT) and it’s inverse. FFTs are of great importance to a wide variety of
applications, from digital signal processing and solving partial differential equations to
algorithms for quick multiplication of large integers. This article describes the algorithms,
of which there are many; see discrete Fourier transform for properties and applications of
the transform.

Let x0, xN-1 be complex numbers. The DFT is defined by the formula

Evaluating these sums directly would take O (N2) arithmetical operations. An FFT is an
algorithm to compute the same result in only O (N log N) operations. In general, such
algorithms depend upon the factorization of N, but (contrary to popular misconception)
there are FFTs with O (N log N) complexity for all N, even for prime N.

Many FFT algorithms only depend on the fact that is a primitive root of unity, and thus
can be applied to analogous transforms over any finite field, such as number-theoretic
transforms.

Since the inverse DFT is the same as the DFT, but with the opposite sign in the exponent
and a 1/N factor, any FFT algorithm can easily be adapted for it as well.

The Cooley-Turkey algorithm


By far the most common FFT is the Cooley-Turkey algorithm. This is a divide and
conquer algorithm that recursively breaks down a DFT of any composite size N = N1N2
into many smaller DFTs of sizes N1 and N2, along with O(N) multiplications by complex
roots of unity traditionally called twiddle factors (after Gentleman and Sanded, 1966).

This method (and the general idea of an FFT) was popularized by a publication of J. W.
Cooley and J. W. Turkey in 1965, but it was later discovered that those two authors had
independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and
subsequently rediscovered several times in limited forms).

The most well-known use of the Cooley-Turkey algorithm is to divide the transform into
two pieces of size N / 2 at each step, and is therefore limited to power-of-two sizes, but
any factorization can be used in general (as was known to both Gauss and
Cooley/Turkey). These are called the radix-2 and mixed-radix cases, respectively (and
other variants such as the split-radix FFT have their own names as well). Although the
basic idea is recursive, most traditional implementations rearrange the algorithm to avoid
explicit recursion. Also, because the Cooley-Turkey algorithm breaks the DFT into
smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT, such
as those described below.

Other FFT algorithms


There are other FFT algorithms distinct from Cooley-Turkey. For N = N1N2 with cop rime
N1 and N2, one can use the Prime-Factor (Good-Thomas) algorithm (PFA), based on the
Chinese Remainder Theorem, to factorize the DFT similarly to Cooley-Turkey but
without the twiddle factors. The Rader-Brenner algorithm (1976) is a Cooley-Turkey-like
factorization but with purely imaginary twiddle factors, reducing multiplications at the
cost of increased additions and reduced numerical stability. Algorithms that recursively
factorize the DFT into smaller operations other than DFTs include the Broun and QFT
algorithms. (The Rader-Brenner and QFT algorithms were proposed for power-of-two
sizes, but it is possible that they could be adapted to general composite n. Broun’s
algorithm applies to arbitrary even composite sizes.) Broun’s algorithm, in particular, is
based on interpreting the FFT as a recursive factorization of the polynomial an − 1, here
into real-coefficient polynomials of the form am − 1 and z2M + am + 1.

Another polynomial viewpoint is exploited by the Wino grad algorithm, which factorizes
an − 1 into cyclostomes polynomials—these often have coefficients of 1, 0, or −1, and
therefore require few (if any) multiplications, so Wino grad can be used to obtain
minimal-multiplication FFTs and is often used to find efficient algorithms for small
factors. Indeed, Wino grad showed that the DFT can be computed with only O(N)
irrational multiplications, leading to a proven achievable lower bound on the number of
multiplications for power-of-two sizes; unfortunately, this comes at the cost of many
more additions, a tradeoff no longer favorable on modern processors with hardware
multipliers. In particular, Wino grad also makes use of the PFA as well as an algorithm by
Rader for FFTs of prime sizes.

Rader's algorithm, exploiting the existence of a generator for the multiplicative group
modulo prime N, expresses a DFT of prime size n as a cyclic convolution of (composite)
size N − 1, which can then be computed by a pair of ordinary FFTs via the convolution
theorem (although Wino grad uses other convolution methods). Another prime-size FFT
is due to L. I. Bluestein, and is sometimes called the chirp-z algorithm; it also re-
expresses a DFT as a convolution, but this time of the same size (which can be zero-
padded to a power of two and evaluated by radix-2 Cooley-Turkey FFTs, for example),
via the identity no = − (k − n)2 / 2 + n2 / 2 + k2 / 2.

FFT algorithms specialized for real and/or symmetric


data
In many applications, the input data for the DFT are purely real, in which case the outputs
satisfy the symmetry

and efficient FFT algorithms have been designed for this situation (see e.g. Sorensen,
1987). One approach consists of taking an ordinary algorithm (e.g. Cooley-Tukey) and
removing the redundant parts of the computation, saving roughly a factor of two in time
and memory. Alternatively, it is possible to express an even-length real-input DFT as a
complex DFT of half the length (whose real and imaginary parts are the even/odd
elements of the original real data), followed by O(N) post-processing operations.

It was once believed that real-input DFTs could be more efficiently computed by means
of the discrete Hartley transform (DHT), but it was subsequently argued that a specialized
real-input DFT algorithm (FFT) can typically be found that requires fewer operations
than the corresponding DHT algorithm (FHT) for the same number of inputs. Bruun's
algorithm (above) is another method that was initially proposed to take advantage of real
inputs, but it has not proved popular.

There are further FFT specializations for the cases of real data that have even/odd
symmetry, in which case one can gain another factor of (roughly) two in time and
memory and the DFT becomes the discrete cosine/sine transform(s) (DCT/DST). Instead
of directly modifying an FFT algorithm for these cases, DCTs/DSTs can also be
computed via FFTs of real data combined with O(N) pre/post processing.

Bounds on complexity and operation counts


A fundamental question of longstanding theoretical interest is to prove lower bounds on
the complexity and exact operation counts of fast Fourier transforms, and many open
problems remain. It is not even rigorously proved whether DFTs truly require Ω(NlogN)
(i.e., order NlogN or greater) operations, even for the simple case of power of two sizes,
although no algorithms with lower complexity are known. In particular, the count of
arithmetic operations is usually the focus of such questions, although actual performance
on modern-day computers is determined by many other factors such as cache or CPU
pipeline optimization.

Following pioneering work by Winograd (1978), a tight Θ(N) lower bound is known for
the number of real multiplications required by an FFT. It can be shown that only
irrational real multiplications are required to compute a DFT of power-of-two length N =
2m. Moreover, explicit algorithms that achieve this count are known (Heideman & Burrus,
1986; Duhamel, 1990). Unfortunately, these algorithms require too many additions to be
practical, at least on modern computers with hardware multipliers.

A tight lower bound is not known on the number of required additions, although lower
bounds have been proved under some restrictive assumptions on the algorithms. In 1973,
Morgenstern proved an Θ(NlogN) lower bound on the addition count for algorithms
where the multiplicative constants have bounded magnitudes (which is true for most but
not all FFT algorithms). Pan (1986) proved an Θ(NlogN) lower bound assuming a bound
on a measure of the FFT algorithm's "asynchronicity", but the generality of this
assumption is unclear. For the case of power-of-two N, Papadimitriou (1979) argued that
the number Nlog2N of complex-number additions achieved by Cooley-Tukey algorithms
is optimal under certain assumptions on the graph of the algorithm (his assumptions
imply, among other things, that no additive identities in the roots of unity are exploited).
(This argument would imply that at least 2Nlog2N real additions are required, although
this is not a tight bound because extra additions are required as part of complex-number
multiplications.) Thus far, no published FFT algorithm has achieved fewer than Nlog2N
complex-number additions (or their equivalent) for power-of-two N.

A third problem is to minimize the total number of real multiplications and additions,
sometimes called the "arithmetic complexity" (although in this context it is the exact
count and not the asymptotic complexity that is being considered). Again, no tight lower
bound has been proven. Since 1968, however, the lowest published count for power-of-
two N was long achieved by the split-radix FFT algorithm, which requires 4Nlog2N − 6N
+ 8 real multiplications and additions for N > 1. This was recently reduced to (Johnson
and Frigo, 2007; Lundy and Van Buskirk, 2007).

Most of the attempts to lower or prove the complexity of FFT algorithms have focused on
the ordinary complex-data case, because it is the simplest. However, complex-data FFTs
are so closely related to algorithms for related problems such as real-data FFTs, discrete
cosine transforms, discrete Hartley transforms, and so on, that any improvement in one of
these would immediately lead to improvements in the others (Duhamel & Vetterli, 1990).

Accuracy and approximations


All of the FFT algorithms discussed so far compute the DFT exactly (in exact arithmetic,
i.e. neglecting floating-point errors). A few "FFT" algorithms have been proposed,
however, that compute the DFT approximately, with an error that can be made arbitrarily
small at the expense of increased computations. Such algorithms trade the approximation
error for increased speed or other properties. For example, an approximate FFT algorithm
by Edelman et al. (1999) achieves lower communication requirements for parallel
computing with the help of a fast-multipole method. A wavelet-based approximate FFT
by Guo and Burrus (1996) takes sparse inputs/outputs (time/frequency localization) into
account more efficiently than is possible with an exact FFT. Another algorithm for
approximate computation of a subset of the DFT outputs is due to Shentov et al. (1995).
Only the Edelman algorithm works equally well for sparse and non-sparse data, however,
since it is based on the compressibility (rank deficiency) of the Fourier matrix itself rather
than the compressibility (sparsity) of the data.

Even the "exact" FFT algorithms have errors when finite-precision floating-point
arithmetic is used, but these errors are typically quite small; most FFT algorithms, e.g.
Cooley-Tukey, have excellent numerical properties. The upper bound on the relative error
for the Cooley-Tukey algorithm is O(ε log N), compared to O(ε N3/2) for the naïve DFT
formula (Gentleman and Sande, 1966), where ε is the machine floating-point relative
precision. In fact, the root mean square (rms) errors are much better than these upper
bounds, being only O(ε √log N) for Cooley-Tukey and O(ε √N) for the naïve DFT
(Schatzman, 1996). These results, however, are very sensitive to the accuracy of the
twiddle factors used in the FFT (i.e. the trigonometric function values), and it is not
unusual for incautious FFT implementations to have much worse accuracy, e.g. if they
use inaccurate trigonometric recurrence formulas. Some FFTs other than Cooley-Tukey,
such as the Rader-Brenner algorithm, are intrinsically less stable.

In fixed-point arithmetic, the finite-precision errors accumulated by FFT algorithms are


worse, with rms errors growing as O(√N) for the Cooley-Tukey algorithm (Welch, 1969).
Moreover, even achieving this accuracy requires careful attention to scaling in order to
minimize the loss of precision, and fixed-point FFT algorithms involve rescaling at each
intermediate stage of decompositions like Cooley-Tukey.

To verify the correctness of an FFT implementation, rigorous guarantees can be obtained


in O(N log N) time by a simple procedure checking the linearity, impulse-response, and
time-shift properties of the transform on random inputs (Ergün, 1995).

Multidimensional FFT algorithms


The multidimensional DFT

transforms an array with a d-dimensional vector of indices by a set of d nested


summations (over for each j), where the division , defined as , is performed element-wise.
Equivalently, it is simply the composition of a sequence of d sets of one-dimensional
DFTs, performed along one dimension at a time (in any order).

This compositional viewpoint immediately provides the simplest and most common
multidimensional DFT algorithm, known as the row-column algorithm (after the two-
dimensional case, below). That is, one simply performs a sequence of d one-dimensional
FFTs (by any of the above algorithms): first you transform along the n1 dimension, then
along the n2 dimension, and so on (or actually, any ordering will work). This method is
easily shown to have the usual O(NlogN) complexity, where is the total number of data
points transformed. In particular, there are N / N1 transforms of size N1, etcetera, so the
complexity of the sequence of FFTs is:

In two dimensions, the can be viewed as an matrix, and this algorithm corresponds to first
performing the FFT of all the rows and then of all the columns (or vice versa), hence the
name.

In more than two dimensions, it is often advantageous for cache locality to group the
dimensions recursively. For example, a three-dimensional FFT might first perform two-
dimensional FFTs of each planar "slice" for each fixed n1, and then perform the one-
dimensional FFTs along the n1 direction. More generally, an asymptotically optimal
cache-oblivious algorithm consists of recursively dividing the dimensions into two
groups and that are transformed recursively (rounding if d is not even) (see Frigo and
Johnson, 2005). Still, this remains a straightforward variation of the row-column
algorithm that ultimately requires only a one-dimensional FFT algorithm as the base case,
and still has O(NlogN) complexity. Yet another variation is to perform matrix
transpositions in between transforming subsequent dimensions, so that the transforms
operate on contiguous data; this is especially important for out-of-core and distributed
memory situations where accessing non-contiguous data is extremely time-consuming.

There are other multidimensional FFT algorithms that are distinct from the row-column
algorithm, although all of them have O(NlogN) complexity. Perhaps the simplest non-
row-column FFT is the vector-radix FFT algorithm, which is a generalization of the
ordinary Cooley-Tukey algorithm where one divides the transform dimensions by a
vector of radices at each step. (This may also have cache benefits.) The simplest case of
vector-radix is where all of the radices are equal (e.g. vector-radix-2 divides all of the
dimensions by two), but this is not necessary. Vector radix with only a single non-unit
radix at a time, i.e. , is essentially a row-column algorithm. Other, more complicated,
methods include polynomial transform algorithms due to Nussbaumer (1977), which
view the transform in terms of convolutions and polynomial products. See Duhamel and
Vetterli (1990) for more information and references.

• The FFT
Algorithms for computing the DFT which are more
computationally efficient than the direct method are called Fast
Fourier Transforms.
Generally, we use FFT to refer to algorithms which work by
breaking the DFT of a long sequence into smaller and smaller
chunks.

GOERTZEL ALGORITHM
FLOW GRAPH FOR GOERTZEL
ALGORITHM

The Goertzel algorithm helps by about a factor of 2. (Unless we


need only certain X[k].s, then it can help a lot.)
We can reduce the time complexity from O(N2) (compute time
proportional to N2) to O(N logN) (proportional to N times logN),
if we use FFTs. There are two main methods:
Decimation-in-time
Decimation-in-frequency

DECIMATION IN TIME
Resulting Improvement
Original direct DFT:
2N2 steps (N2 multiplications, N2 additions)
New way:
N/2 DFT + N/2 DFT + N mult + N add.
= 2(N/2)2 + 2(N/2)2 + N + N
= N2 + 2N steps
(N2/2 + N multiplications, additions)
This is faster if N2 > 2N
True as long as N > 2

Basic Underlying Idea:


Use 2 half-size DFTs instead of 1 full-size DFT.
Continuing the idea: We could split the half-size DFTs in half again, and
keep splitting the pieces in half until the number of points left in each block
is down to 2. (Then we just do those directly.)
Side note: We could just as easily split into 3 pieces instead of 2. The math
works out to break a DFT into any set of equal-size pieces.

8 POINT DIT FFT


AT 2 POINT DFT
RESULTING 8 POINT FFT

As long as N can be broken down into a bunch of small factors, we can use
this method.
Usually we use powers of 2 (N=2x), so that 2 is the biggest prime factor. This gives the
most improvement
The time complexity becomes proportional to how many times we have to
subdivide N before we get
to the smallest block . the number of FFT stages.
For powers of 2, this is log2(N).
Overall time becomes (N log2N) instead of (N2)

Significance
Change from N2 to Nlog2N is big:
N=64: 4,096 reduces to 384
N=256: 65,536 2,048
N=1024: 1,048,576 10,240
N=4096: 16,777,216 49,152
N=16384: 268,435,456 229,376
But we can still get a little more improvement through symmetry.
The Butterfly
we use N=2x, then all the FFT operations are 2-input 2-output blocks,
similar to 2-point DFT.s.
This building block is called a butterfly.
we can re-draw the butterfly operation, reducing each block to a single
multiplication(2ximprovement)

RESULTING 8 POINT FFT


In-place computation
Each stage of the FFT process has N inputs and N outputs, so we need
exactly N storage locations at any one point in the calculations.
It is possible to re-use the same storage locations at each stage to reduce
memory overhead.
Any algorithm which uses the same memory to store successive iterations of a calculation
is called an in-place algorithm.
Computation must be done in a specific order.

DECIMATION IN FREQUENCY
Again, we can compute the DFT using 2 half-size DFTs, but this time we
combine x[n] terms first, then do the DFT.
We still
Have a sequence of butterfly elements
Can use symmetry to reduce computation
Can do in-place computation
BUTTERFLY ELEMENT
RESULTING 8 POINT FFT

You might also like