1981 Book FastFourierTransformAndConvolu
1981 Book FastFourierTransformAndConvolu
Editor: T. S. Huang
Springer Series in Information Sciences
Editors: King Sun Fu Thomas S. Huang Manfred R. Schroeder
With 34 Figures
Series Editors:
Library ofCongress Cataloging in Publication Data. Nussbaumer, Henri J. 1931 - Fast Fourier transform and
convolution algorithms. (Springer series in information sciences; v. 2). Bibliography: p. lncludes index. 1. Fourier
transformations-Data processing. 2. Convolutions (Mathematics}-Data processing. 3. Digital filters (Mathema-
tics) 1. Title. II. Series. QA403.5.N87 515.7'23 80-18096
This work is subject to copyright. AII rights are reserved, whether the whole or part of the material is concerned,
specifica1ly those of translation, reprinting, reuse of illustrations, broadcasting, reproduction by photocopying
machine or similar means, and storage in data banks. Under § 54 ofthe German Copyright Law where copies are
made for other than private use, a fee is payable to "Verwertungsgesellschaft Wort", Munich.
This book presents in a unified way the various fast algorithms that are used
for the implementation of digital filters and the evaluation of discrete Fourier
transforms.
The book consists of eight chapters. The first two chapters are devoted to
background information and to introductory material on number theory and
polynomial algebra. This section is limited to the basic concepts as they apply
to other parts of the book. Thus, we have restricted our discussion of number
theory to congruences, primitive roots, quadratic residues, and to the
properties of Mersenne and Fermat numbers. The section on polynomial
algebra deals primarily with the divisibility and congruence properties of
polynomials and with algebraic computational complexity.
The rest of the book is focused directly on fast digital filtering and
discrete Fourier transform algorithms. We have attempted to present these
techniques in a unified way by using polynomial algebra as extensively as
possible. This objective has led us to reformulate many of the algorithms which
are discussed in the book. It has been our experience that such a presentation
serves to clarify the relationship between the algorithms and often provides
clues to improved computation techniques.
Chapter 3 reviews the fast digital filtering algorithms, with emphasis on
algebraic methods and on the evaluation of one-dimensional circular
convolutions.
Chapters 4 and 5 present the fast Fourier transform and the Winograd
Fourier transform algorithm.
We introduce in Chaps. 6 and 7 the concept polynomial transforms and
we show that these transforms are an important tool for the understanding of
the structure of multidimensional convolutions and discrete Fourier trans-
forms and for the design of improved algorithms. In Chap. 8, we extend these
concepts to the computation of one-dimensional convolutions by replacing
finite fields of polynomials by finite fields of numbers. This facilitates intro-
duction of number theoretic transforms which are useful for the fast com-
putation of convolutions via modular arithmetic.
Convolutions and discrete Fourier transforms have many uses in physics
and it is our hope that this book will prompt some additional research in
these areas and will help potential users to evaluate and apply these techniques.
We also feel that some of the methods presented here are quite general and
might someday find new unexpected applications.
VI Preface
Chapter 1 Introduction
1.1 Introductory Remarks. 1
1.2 Notations. . . . . . 2
1.3 The Structure of the Book. 3
References . . 241
Subject Index. 247
1. Introduction
other than convolution and DFT. It is likely, for instance, that polynomial
transforms will appear as a very general tool for mapping multidimensional
problems into one-dimensional problems.
The matter of comparing different algorithms which perform the same func-
tions is pervasive throughout this book. In many cases, we have used the number
of arithmetic operations required to execute an algorithm as a measure of the
computational complexity. While there is some rough relationship between the
overall complexity of an algorithm and its algebraic complexity, the practical
value of a computation method depends upon a number of factors. Apart from
the number of arithmetic operations, the efficiency of an algorithm is related to
many parameters such as the number of data moves, the cost of ancillary oper-
ation, the overall structural complexity, the performance capabilities of the
computer on which the algorithm is executed, and the skill of the programmer.
Therefore, ranking different algorithms as a function of actual efficiency ex-
pressed in terms of computer execution times is a difficult art so that the com-
parisons based on the number of arithmetic operations must be weighted as a
function of the particular implementation.
1.2 Notations
For transforms, we use the notation Xk> which, for a DFT, has the form
N-l
Xk = L Xn W nk . (1.2)
n=O
We have also sometimes adopted Rader's notation (x)p for the residue of x mod-
ulop.
1.3 The Structure of the Book 3
Many new digital signal processing algorithms are derived from elementary
number theory or polynomial algebra, and some knowledge of these topics is
necessary to understand these algorithms and to use them in practical applica-
tions.
This chapter introduces the necessary background required to understand
these algorithms in a simple, intuitive way, with the intent of familiarizing
engineers with the mathematical principles that are most frequently used in this
book. We have made here no attempt to give a complete rigorous mathematical
treatment but rather to provide, as concisely as possible, some mathematical tools
with the hope that this will prompt some readers to study further, with some
of the many excellent books that have been published on the subject [2.1-4].
The material covered in this chapter is divided into two main parts: ele-
mentary number theory and polynomial algebra. In elementary number theory,
the most important topics for digital signal processing applications are the
Chinese remainder theorem and primitive roots. The Chinese remainder the-
orem, which yields an unusual number representation, is used for number
theoretic transforms (NIT) and for index manipulations which serve to map
one-dimensional problems into multidimensional problems. The primitive roots
playa key role in the definition of NITs and are also used to convert discrete
Fourier transforms (OFT) into convolutions, which is an important step in the
development of the Winograd Fourier transform algorithm.
In the polynomial algebra section, we introduce briefly the concepts of rings
and fields that are pervasive throughout this book. We show how polynomial
algebra relates to familiar signal processing operations such as convolution and
correlation. We introduce the Chinese remainder theorem for polynomials and
we present some complexity theory results which apply to convolutions and
correlations.
a = bq + r, o ~ r < b, (2.1)
2.1 Elementary Number Theory 5
where q is called the quotient and r is called the remainder. When r = 0, band q
arefactors or divisors of a, and b is said to divide a, this operation being denoted
by b Ia. When a has no other divisors than 1 and a, a is a prime. In all other
cases, a is composite.
When a is composite, it can always be factorized into a product of powers of
prime numbers p~', where c, is a positive integer, with
a = I1Yt'. (2.2)
I
when d = (a, b) = 1, a and b have no common factors other than 1 and they are
said to be mutually prime or relatively prime.
The GCD can be found easily by a division algorithm known as the Euclidean
algorithm. In discussing this algorithm, we shall assume that a and b are positive
integers. This is done without loss of generality, since (a, b) = (- a, b) = (a,
-b) = (-a, -b). Dividing a by b yields
(2.4)
b = rlq2 + r2,
rl = r q3 + r3>
2
(2.5)
Since rl > r2 > r3 ... , the last remainder is zero. Thus, by the last equation,
rk Irk-I' The preceding equation implies that rk Irk-Z, since rk Irk-I' Finally, we
obtain rk Iband rk Ia. Hence, rk is a divisor of a and b. Suppose now that c is any
divisor of a and b. By (2.4), c also divides rl' Then, (2.5) implies that c divides r 2,
r3 ... rk' Thus, any divisor c of a and b divides rk and therefore c <; rk' Hence,
rk is the GCD of a and b.
An important consequence of Euclid's algorithm is that the GCD of two
integers a and b is a linear combination of a and b. This can be seen by rewriting
(2.4) and (2.5) as
6 2. Elements of Number Theory and Polynomial Algebra
'1 = a - bql
'2 = b - 'lq2
(2.6)
'1
The first equation shows that is a linear combination of a and b. The second
equation shows that'2 is a linear combination of band'l and therefore of both a
and b. Finally, the last equation implies that'k is a linear combination of a and
'k
b. Since = (a, b), we have
where m and n are integers. When a and b are mutually prime, (2.7) reduces
to Bezout's relation
1 = ma + nb. (2.8)
We now change our point of view by considering a linear equation with integer
coefficients a, b, and c
ax + by = c (2.9)
where x and yare a pair of integers which are the solution of this Diophantine
equation. Such an equation has a solution if and only if (a, b) Ic. To demonstrate
°
this point, we note the following. It is obvious from (2.9) that for a = or b = 0,
we must have blc or alc.
For a =1= 0, b =1= 0, it is apparent that if (2.9) holds for integers x and y, then
d = (a, b) is such that dl c~ Conversely, if dl c, c = cld and (2.7) implies the
existence of two integers m and n such that d = ma + nb. Hence c = cld =
clma +clnb, and the solutions of the Diophantine equation are given by
x = Clm, y = cln. Thus, for (a, b) Ic, the solution of the Diophantine equation
is given by the Euclidean algorithm. The solution of the Diophantine equation
is not unique, however. This can be seen by considering a particular solution
c = axo + byo. Assuming x, y is another solution, we have
Since [(a/d), (b/d)] = 1, this implies that (b/d) I(x - xo) and x = Xo + (b/d)s
where s is an integer. Substituting into (2.1l), we obtain
y = Yo - (a/d) s
(2.12)
x = Xo + (bjd)s.
2.1 Elementary Number Theory 7
This defines a class of linearly related solutions for (2.9) which depend upon the
integer s.
As a numerical example, consider the equation
15x + 9y = 21.
We first use Euclid's algorithm to determine the GCD d with a = 15 and
b = 9,
15 = 9·1 +6
9 = 6·1 + 3
6 = 3·2
6 = 15 - 9-I
3 = 9 - 6-I = - 15 + 2·9
Thus, m = - 1 and n = 2. Dividing c = 21 by d = 3 yields c) = 7. This gives
a particular solution Xo = - 7, Yo = 14. If we divide a = 15 and b = 9 by
d = 3, we obtain (ajd) = 5 and (bjd) = 3. Hence, the general solution to the
Diophantine equation becomes
y = 14 - 5s
x = - 7 + 3s,
where s is any integer.
a) == a2 modulo b. (2.13)
(2.14)
8 2. Elements of Number Theory and Polynomial Algebra
Underlying the concept of congruence is the fact that, in many physical prob-
lems, one is primarily interested in measuring relative values within a given
range. This is apparent, for instance, when measuring angles. In this case, the
angles are defined from 0 to 359 0 and two angles that differ by a multiple of
360 0 are considered to be equal. Hence angles are defined modulo 360.
Thus, in congruences, we are interested only in the remainder r of the division
of a by b. This remainder is usually called the residue and is denoted by
r == a modulo b. (2.15)
(2.16)
where the subscript is omitted when there is no ambiguity on the nature of the
modulus b.
It follows directly from the definition ofresidues given by (2.14) that addi-
tions and multiplications can be performed directly on residues
ax == c modulo b (2.18)
where Xo is a particular solution and s can be any integer smaller than b. How-
ever, there are only d distinct solutions since (bjd)s has only d distinct values mo-
dulo b. An important consequence of this point is that the linear congruence
ax == c modulo b always has a unique solution when (a, b) = 1. Thus, when
(a, b) Ic, the linear congruence ax == c modulo b can be solved and Euclid's
algorithm provides a method for computing the values of x which satisfy this
relation. We shall see later that Euler's theorem gives a more elegant solution
to the linear congruence (2.18) when (a, b) = 1.
We consider now the problem of solving a set of simultaneous linear
congruences with different moduli. Changing our notation, we want to find the
integer x which satisfies simultaneously the k linear congruences
2.1 Elementary Number Theory 9
The solution to this problem plays a major role in many signal processing algori-
thms and is given by the Chinese remainder theorem
Theorem 2.1: Let m, be k positive integers greater than 1 and relatively prime
in pairs. The set of linear congruences x == ' I modulo m, has a unique solution
k
modulo M, with M = II mi.
I-I
Equation (2.22) defines k linear congruences. Since the m, are mutually prime,
[m" (Mimi)] = 1 and each of these congruences has a unique solution T, which
can be computed by Euclid's algorithm or Euler's theorem (theorem 2.3). Let us
now reduce x in (2.21) modulo m .. , one of the moduli mi. Except for Mlm .., all
the expressions MImi contain mIl as a factor and are therefore equal to zero
modulo mIl. Hence, (2.21) reduces to
and, since (2.22) implies that (Mlm")T,, == I modulo mIl, (2.23) becomes
It is seen easily that this operation can be repeated for all moduli m, and there-
'I
fore that (2.21) is the solution of the k linear congruences x == modulo mi.
As a simple application of the Chinese remainder theorem, let us find the
solution to the simultaneous congruences
(2.25)
In this system, the addition or the multiplication of two integers a and b is done
by adding or multiplying separately their various residues al and bl, without any
carry from one residue to another one. Thus, if M = IIml is chosen to be the
I
product of many small relatively prime moduli m l , the computation can accom-
modate large numbers although actual calculations are performed on a large set
of small residues, without carry propagation. Hence, residue number systems
are quite effective for high-speed multiplications and additions. Unfortunately,
this advantage is usually offset by many practical difficulties related to the cost
of translating from conventional number systems to RNS, the lack of a division
operation, and the increased word length required for unambiguous operation
in a modular system. Because of these limitations, the RNS is rarely used. We
shall see, however, in Chap. 8 that modular arithmetic and the Chinese re-
mainder theorem play an important role in the definition of number theoretic
transforms and may have significant applications in these areas.
The Chinese remainder theorem is also often used to map an M-point one-
dimensional data sequence Xn into a k-dimensional data array. This is done by
noting that if n is defined modulo M, with n = 0, ... , M - 1, we can redefine n
by the Chinese remainder theorem as
k
n = I; (Mimi) nlTI modulo M, (2.26)
1=1
where the index nl along dimension i takes the values 0, ... , m l - 1. This map-
ping, which is possible only when M is the product of relatively prime factors
mh is very important for the computation of discrete Fourier transforms and
convolutions, as will be seen in the following chapters.
We now introduce the concept of permutation. Let us consider again the set
of M integers n, with n = 0, ... , M - 1. If we multiply modulo M, each element
nl of n by an integer a, we obtain a set of M numbers bl defined by
The nl are all distinct. We would like the bl to be also all distinct in such a way
that when the nl span the M values 0, ... , M - 1, the bl span the same values,
although in a different order.
Each equation (2.27) is a linear congruence and we already know from (2.19)
that the solution of this congruence is unique if (a, M) = 1. Let us assume that
(a, M) = 1 and consider two distinct values nl and nu pertaining to the set of the
M integers n. Since (a, M) = 1, the linear congruence (2.27) defines two integers
bl and bucorresponding, respectively, to nl and nu' Subtracting bufrom bl yields
If b, = b", this implies that a(nl - nIl) == 0 modulo M. This is impossible because
a is relatively prime with M and nl - nIl < M. Thus, for (a, M) = 1, the permu-
tation defined by (2.27) maps all possible values of n.
As an example, consider the permutation defined by b == 5n modulo 6 . 5
and 6 are mutually prime. When n takes successively the values 0, 1,2, 3, 4, 5,
the integers b take the corresponding values 0, 5, 4, 3, 2, 1.
We shall see in the following chapters that permutations are often used in
signal processing to reorder a set of data samples. At this point, we return to
the one-dimensional to multidimensional mapping using the Chinese remainder
theorem to show that this method can be simplified by permutation. When M
is the product of two mutually prime factors m) and mz, (2.26) becomes
(2.29)
Since T) and Tz are mutually prime with m) and mz, respectively, mZn) T) and
m)nzTz can be viewed as the two permutations nIT) modulo m) and nzTz modulo
mz of two sets of m) points and mz points, respectively. Hence the mapping
defined by (2.29) can be replaced by the simpler mapping
(2.30)
The advantage of (2.30) over (2.26) is that the computation of the inverses T,
is no longer required.
As an example, consider M = 6, with ml = 2 and mz = 3. The sequence
n is given by: to, 1,2,3,4, 5}. Since T) = 1 and T z = 2, (2.26) yields n == 3n) +
4nz while (2.30) gives n == 3n, + 2n2. When the pair nl> nz takes successively the
values {(O, 0), (1, 0), (0, 1), (1,1), (0, 2), (1, 2)}, the sequence n becomes to, 3, 4,
1, 2, 5} for the first equation and to, 3, 2, 5, 4, I} for the second equation. Thus,
both approaches span the complete set of values of n, although in a different
order.
We have seen that defining integers modulo an integer m partitions these integers
into m equivalence classes. Among these classes, those corresponding to integers
which are relatively prime to m playa particularly important role and we shall
often need to know how many integers are smaller than m and relatively prime
to m. This quantity is usually denoted by ~(m) and called Euler's totientfunction.
We may observe that ~(1) = 1 since (1, 1) = 1. When m is a prime, with
m = p, all integers smaller than p are relatively prime to p. Thus,
12 2. Elements of Number Theory and Polynomial Algebra
~(p) = p - 1. (2.32)
If m = pC, the only numbers less than m and not prime with p are the multiples
of p. Therefore,
In order to find ~(m) for any integer m, we first establish that Euler's totient
function is multiplicative.
Theorem 2.2: If a and b are two mutually prime integers, ~(a.b) = ~(a) ~(b).
The theorem is proved by considering all integers u smaller than a·b and
defined by u = aq + ',' = 0, I, ... , a - I and q = 0, I, ... , b - I. It is seen that
'I
u is relatively prime to a if, is one of the ~(a) integers relatively prime to a.
Thus, the b~(a) integers U1 given by ('I), (a + 'I), (2a + 'I), ... , [(b - l)a + ,tJ
are prime to a. If q is chosen among the ~(b) integers smaller than b and mutually
prime with b, the corresponding integers U1 are relatively prime to b, since no fac-
tor of b can divide a or q. Thus, there are ~(a)~(b) integers relatively prime to a
and b and therefore relatively prime to a·b.
An immediate corollary of theorem 2.2 is that, if an integer N is given by its
prime factorization N = p1' p~2 ... pre', then ~(N) becomes
k
~(N) = N II (l - I/PI). (2.34)
1=1
L~(d) = N. (2.35)
diN
This property follows from the fact that N/d is a divisor of N when dl N. Thus,
N= L~(N/d) (2.37)
diN
(2.38)
Then, by definition, X,+1 == ax, and Xt+l == ax,. Thus, X,+1 == X,+1 and, for any
n ~ r,
This means that, for n ~ i, the sequence Xn repeats itself cyclically with a period
of r - i elements. When i = 0, the sequence repeats itself from the beginning
(aO = 1) and the cyclic group defined be (2.40) contains all the possible values of
Xn corresponding to a and m. The conditions for this important case are given by
Euler's theorem.
Theorem 2.3: If (a, m) = 1, then
We can cancel the product of n on both sides of the congruence because the
various integers n are relatively prime to m. Thus, (2.42) yields (2.41) and the
proof is completed. When m is a prime, with m = p, then (a,p) = 1 ifp does not
divide a and we have tfi(p) = p - 1. In this case, Euler's theorem reduces to
Fermat's theorem
Theorem 2.4: If p is a prime, then, for every integer a,
or
14 2. Elements of Number Theory and Polynomial Algebra
aP =a modulo p. (2.44)
Theorems 2.3 and 2.4 give a simple alternative to Euclid's algorithm for
solving linear congruences when (a, m) = 1. Consider the congruence
ax =c modulo m. (2.45)
Theorem 2.5: If g is a root of order r modulo m, the r integers gO, gl, ... , g,-I are
incongruent modulo m.
This theorem is proved by assuming that g" == g" for two distinct values rl
and r2 such that r2 < rl < r. If this were the case, we would have
g"-" == 1 modulo m.
This is impossible since rl - r2 < rand r is, by definition, the smallest integer
such that g' == 1.
Theorem 2.6: If (g, m) = 1 and gb == 1 modulo m, the order r of the integer
g must divide b.
If gb == 1 modulo m, ml(gb - 1). Let us assume that g is of order r. This
means that r is the smallest integer such that g' == 1 modulo m. Since m I(g' - 1),
m I(gd - 1), where d = (r, b). Since d is the GCD of rand b, d ~ r. However,
d cannot be less than r which is by definition the smallest integer such that g' == 1.
Thus, d = r. Since dl b, rib.
Theorem 2.7: If(g, m) = 1, the order r of the integer g must divide ~(m).
This theorem follows directly from theorem 2.6 and Euler's theorem, since
g;<m) == 1 modulo m if(g, m) = 1.
It can also be shown that primitive roots exist only for m = pC or m = 2pc,
with p an odd prime. When p = 2, primitive roots exist only for m = 2 and
m = 4. When m = p, with p an odd prime, the following theorem, first introduced
by Gauss, specifies the number of roots ofa given order.
Theorem 2.8: Ifr Ip - 1, withp an odd prime, there are ~(r)incongruent integers
which have order r modulo p.
Suppose that g has order r modulo p. Then, by theorem 2.5, the r integers
gO, gl, ... , g,-I are incongruent modulo p and satisfy the equation x' == 1 modulo
p. Thus, the sequence gn modulo p is periodic and n is defined modulo r, with
n = 0, 1, ... r - 1. Then, for (b, r) = 1, gbn modulo p is a simple permutation of
the sequence gn modulo p. When (b, r) =1= 1 the sequence b·n modulo r will con-
tain repetitions and therefore the corresponding integers gb will be of order less
than r. Thus, we have either zero or ~(r) incongruent roots of order r. Since all
integers in the set I, ... , p - 1 have some order, the total number of roots, for
all divisors ofp - 1, is equal to p - 1. We note, with (2.35), that
L
,Ip-l
~(r) = p - 1. (2.49)
Thus, there are ~(r) roots of order r for each divisor r of p - 1 and this com-
pletes the proof of the theorem.
The theory of primitive roots is quite complex and a complete treatment can
be found in [2.1-3]. In practice, primitive roots modulo primes less than 10000
are given in [2.5]. When m = pC or m = 2pc, for p an odd prime, the primitive
16 2. Elements of Number Theory and Polynomial Algebra
roots are of order r, with r = <}(m). Thus, an integer will be primitive root if
gft $. 1 modulo m for n < r. Moreover, if r = qft q~' ... q;' is the prime fac-
torization of r, any integer which is not primitive root will be a root of order r/
smaller than r, where r/ is a factor of r. Thus, a primitive root g satisfies the
condition
(2.50)
The use of (2.50) greatly simplifies the search for primitive roots as can be
seen with the following example corresponding to m = 41. Since 41 is a prime,
r = <}(41) = 40 = 5.2 3 • A straightforward approach to checking whether an
integer x is a primitive root would be to compute xn modulo 41 for n = 1,2, ... ,
39 and to check that x ft $. 1 for all these values of n. When m is large, this method
becomes rapidly impracticable and it is much simpler to check that xn $. 1
modulo m only for the n being a factor of <}(m). In our case, we note that 220 == 1,
38 == 1,420 == 1) and 520 == 1. Thus 2, 3,4, and 5 are ruled out as primitive roots.
We note however that 68 == 10 and 620 == 40 and, therefore, 6 is a primitive root
modulo 41.
Once a primitive root g has been found, any root of order r/, where r/I r, can
easily be found by raising g to the power r/r/. Moreover, when m is composite,
with m = m l , m2, ... , mk> we know by the Chinese remainder theorem that
any root of order r l modulo m must also be a root of order r l modulo mi> m2,
... , mk. Thus, these roots are easily found once the primitive roots modulo pC
and modulo 2p c are known.
Primitive roots playa very important role in digital signal processing. We
shall see in Chap. 8 that they may be used to define number theoretic transforms.
Another key application of primitive roots concerns the mapping of DFTs into
circular correlations, which is a crucial step in the development of the Winograd
Fourier transform algorithm. We shall discuss this point in detail in Chap. 5,
but we give here the essence of the technique in the simple case of a p-point
DFT, with p a prime,
(2.52)
The exponents and indices in (2.51) are defined modulo p. Thus, for k =1= 0 and
n =1= 0, we can change the variables with
n == g" modulo p,
(2.53)
k == gV modulop, u, v = 0, ... ,p - 2,
2.1 Elementary Number Theory 17
'*
where g is a primitive root modulo p.
Then, for k 0, (2.51) becomes
p-2
XIfV == Xo + I: XII" WII"+V. (2.54)
.-0
This demonstrates that the main part of the computation of Xk is a circular
correlation of the sequence XII" with the sequence WII".
We have seen in the preceding section that primitive roots were closely analogous
to exponentials. We shall discuss here the concept of quadratic residues. It is
notable that this class of residues can be viewed as the equivalent of square roots
defined in the set of integers modulo an integer m.
If (a, m) = 1, a is said to be a quadratic residue ofm if X2 == a modulo m has
a solution. If this congruence has no solution, a is a quadratic nonresidue of m.
It is obvious, from the Chinese remainder theorem, that if a is a quadratic
residue of m and if m is composite, a must be a quadratic residue of each mutual-
ly prime factor of m. Furthermore, it can be shown that a is a quadratic residue of
pC, withp an odd prime, ifand only if a is a quadratic residue ofp. When m = 2 C
and a is an odd integer, a is a quadratic residue of 2. Moreover, a is a quadratic
residue of 4 if and only if a == 1 modulo 4 and a is a quadratic residue of 2\
k ~ 3 if and only if a == 1 modulo 8. Thus, we can restrict our discussion to odd
prime moduli since all other cases are deduced easily from this particular case.
In the following, we shall determine the number of distinct quadratic residues
of p and show how to check integers for quadratic residue properties. We first
establish the two following theorems.
Theorem 2.9: If p is an odd prime, the number Q(p) of distinct quadratic residues
is given by
xi == xi modulo p. (2.56)
However, XI +
X2 ~ P - 1 since XI, X2 ~ (p - 1)/2. Thus, XI + X2 $. 0 modulo
p and, since p is prime, (2.56) would imply that XI == X2' Under these conditions,
all solutions are distinct and we have (p - 1)/2 distinct quadratic residues
different from zero.
18 2. Elements of Number Theory and Polynomial Algebra
I if a is a quadratic residue of p
(alp) = { .. . . (2.57)
-I 1f a 1S a quadrattc nonres1due of p.
We can note immediately that the definition implies that (alp) = 1 ifpi (x 2 - a).
Hence (lIp) = 1 and (a 2lp) = 1, since p I(a 2 - a2).
In order to use Legendre's symbol for the determination of quadratic re-
sidues, we shall use a criterion introduced by Euler. We define this criterion here
without proof.
Theorem 2.11: If p is an odd prime and a is an integer, then
(53/3) = (2/3)
and,finally
Thus, (II/53) =
1. Since we have already shown that (3/53) = -I, (33/53)
= (3/53)(11/53) =
-I and 33 is a quadratic nonresidue of 53. This means that
it is impossible to find any integer x such that x 2 == 33 modulo 53.
(2.63)
F, = 22' + I, (2.64)
where t is any positive integer. These numbers are important in digital signal
processing because arithmetic operations modulo Mersenne and Fermat num-
bers can be implemented relatively simply in digital hardware. This stems from
the fact that the machine representation of numbers, usually given in binary
notation by a B-bit number
B-1
a= L: a 2 1 1, al f {O, I} (2.65)
;=0
20 2. Elements of Number Theory and Polynomial Algebra
q = 2kp + 1. (2.67)
(2.69)
(2.71)
Hence, we have established (2.68) by induction. Suppose now that two Fermat
numbers are not relatively prime. Then, (Fm, Fk ) = d, with d =1= 1 and we
would have dl Fm and dl Fk • In this case, (2.71) would imply that d12. This is im-
possible because d would have to be even and thus could not divide any Fermat
number. Hence d = I and all Fermat numbers are mutually prime.
Theorem 2.17: 3 is a primitive root of all prime Fermat numbers.
Any primitive root g must be a quadratic nonresidue because if it were a
quadratic residue, some powers of g would not be distinct. By theorem 2.9, the
number Q(F,) of distinct quadratic nonresidues is equal to 2 2'-1. We also know,
by theorem 2.8, that there are ~(F, - 1) = 22'-1 distinct primitive roots modulo
F,. Since Q(F,) = ~(F, - 1), all quadratic nonresidues are primitive roots and
we need only to show that 3 is a quadratic nonresidue to prove the theorem. In
order to show that 3 is a quadratic nonresidue for all Fermat numbers, we first
note, by direct verification, that 3 is primitive root modulo FI = 5, since 3 is a
root of order 4 modulo 5. We then show, by induction, that for any Fermat
number,
22 2, Elements of Number Theory and Polynomial Algebra
This can be seen by noting that, if F, = 12k + 5, then F'+I = (F, - 1)2 + 1
= (12k + 4)2 + 1 = 12kl + 5. Thus, we can check whether 3 is a quadratic
nonresidue by computing Legendre's symbol [3/(l2k + 5)]. We have
(2.73)
Thus, in the ring of integers modulo F" 22<-2(1 + 22'-') is congruent to -./ -2
and is therefore a root of order 2H2.
We also note that, since (22<-1)2 == - 1 modulo F" - 1 is a quadratic re-
sidue of F,. This means that j = ...r=T is real in the ring of integers modulo
F" with j == 22'-'. We shall use this property in Chap. 8 to simplify the com-
putation of complex convolutions.
Mersenne and Fermat numbers have many other interesting properties that
cannot be discussed in detail here. Some of these properties can be found in [2.7].
Now suppose that the N elements of h n and Xm are assigned to be the coefficients
of polynomials H(z) and X(z) of degree N - 1 in z, z being the polynomial
variable. Hence we have
2.2 Polynomial Algebra 23
N-I
X(Z) = ~ xmzm. (2.76)
m-O
(2.78)
(2.79)
This means that the convolution of two sequences can be treated as the multi-
plication of two polynomials. Moreover, if the convolution defined by (2.74)
is cyclic, the indices /, m, and n are defined modulo N. Thus, in N-term cyclic
convolutions, we have N = O. This implies that ZN == 1 and therefore that a cy-
clic convolution can be viewed as the product of two polynomials modulo the
polynomial ZN - 1
Thus, in order to deal with convolutions analytically, one must define various
operations where the usual number sets are replaced by sets of polynomials.
These operations on polynomials bear a strong relationship to operations on in-
tegers and can be treated in a unified way by using the concepts of groups,
rings, and fields. In the following, we shall give only the flavor ofthese concepts,
since full details are available in any textbook on modern algebra [2.8].
2.2.1 Groups
Consider a set A of N elements a, b, c, .... These elements could be, for instance,
positive integers or polynomials. Now suppose that we can relate elements in
the set by an operation which is denoted EB. Again, this operation is quite general
and could be, for example, an addition or a logical or operation, the only con-
straint at this stage being that a, b, and c pertain to the set A, with
Then, any set which satisfies the following conditions is called a group
-Associative law: a EB (b EB c) = (a EB b) EB c
-Identity element. There is an element e of the, group which, for any element
of the group, is such that e EB a = a.
-Inverse. Every element a of A has an inverse ii which is an element of the
group: a EB ii = Ii EB a = e
When the operation is commutative, with a EB b = b EB a, the group is called
Abelian. The order of a group is the number of elements of this group.
Now consider a group having a finite number of elements, and the successive
operations a EB a, a EB a EB a, a EB a EB a EB a, .... Each of these operations
produces an element of the group. Since the group is finite, the sequence will
necessarily repeat itself with a period r. r is called the order of the element a. If
the order of an element g is the same as the order of the group, all elements of the
group are generated by g with the operations g, g EB g, g EB g EB g, .... In this
case, g is called a generator and the group is called a cyclic group.
In order to illustrate these concepts, let us consider the set A of N integers
0, 1, ... , N - 1. For addition modulo N, a + b == c, with a, b, c E A. Moreover,
a + (b + c) == (a + b) + c, 0 + a == a, and a + (N - a) == O. Thus, A is a
group with respect to addition modulo N. This group is Abelian, since a + b ==
b + a. It is also cyclic with the integer 1 as generator, since all elements of the
group are generated by adding 1 to the preceding element. We now consider the
set B of N - 1 integers, 1, 2, ... , N - 1 with identity element 1 and with the
addition modulo N replaced by the multiplication modulo N. B is generally
not a group with respect to multiplication modulo N, because some elements of
the set have no inverse. For instance, if N = 6, only 1 and 5 have inverses. Thus,
the set of integers 1, 2, 3, 4, 5 is not a group with respect to multiplication modulo
6. Note however that, when N is a prime, then B becomes a cyclic group. For
instance, if N = 5, the inverses of 1, 2, 3, 4 are, respectively, 1, 3, 2, 4 and
therefore the set of integers 1,2,3,4 is a group, which is cyclic with generators
2 and 3. It can be seen that the group of the N integers 0, 1, ... , N - 1 with addi-
tion modulo Nhas the same structure as the group of the Nintegers 1,2, ... , N
with multiplication modulo (N + 1), N + 1 being a prime. Such a relation be-
tween two groups is called isomorphism.
A set A is a ring with respect to the two operations EB and ® is the following
conditions are fulfilled:
-(A, EB) is an Abelian group
- If c = a ® b, for a, b E A, then c E A
-Associative law: a ® (b ® c) = (a ® b) ® c
-Distributive law: a ® (b EB c) = a ® b EB a ® c and (b EB c) ® a = b
®aEBc®a
2.2 Polynomial Algebra 25
The ring is commutative if the law @ is commutative and it is a unit ring if there
is one (and only one) identity element u for the law @.
It can be verified easily that the set of integers is a ring with respect to addition
and multiplication.
If we now require that the operation @ satisfies the additional condition
that every element a has one (and only one) inverse (a@ ii = u), then a unit ring
becomes afield. It can be verified easily that, for any prime p, the set of integers
0, 1, ... , P - 1 form a field with addition and multiplication modulo p. This field
is called a Galois field and denoted GF(p).
We shall give here several important results concerning fields.
Theorem 2.19: If a, b, c are elements of a field, the condition a @ c = b@c
implies that a = b.
We have a Q9 C = b Q9 c. Thus, if b is the inverse of b, we have b Q9 a Q9 C
= b @ b @ c. This implies that b @ a @ c = c and therefore that b = a.
A consequence of this theorem is that, if we consider the set S of the n distinct
elements al> az, ... , an of a finite field, then the n elements al Q9 al> al Q9 a z, ... ,
al Q9 an are all distinct. Since the result of the operation @ is, by definition, an
element of the field, the n elements al @ ai, al @ az, ... , al @ an are the set S.
This generalizes to any field the concept of permutation that has been introduced
by (2.27) for fields of integers modulo a prime p. By using an approach quite
similar to that used for fields of integers, it is also possible to show that, for all
finite fields, there are primitive roots g which generate all field elements, except e,
by successive operations g Q9 g, g @ g Q9 g, ....
Another important property is that all finite fields have a number of elements
which is pd, where p is a prime. These fields are denoted GF(pd).
In the rest of this chapter, we shall restrict our discussion to rings and fields
of polynomials. In these cases, the operations EB and Q9 usually reduce to addi-
tions and multiplications modulo polynomials. In order to simplify the notation,
we shall replace special symbols EB and @ with the notation that has been de-
fined for residue arithmetic. Using this notation, we first introduce residue poly-
nomials and the Chinese remainder theorem.
where the degree of R(z) is less than the degree of P(z). This representation is
unique. All polynomials having the same residue when divided by P(z) are said
to be congruent modulo P(z) and the relation is denoted by
At this point, it is worth noting that when we deal with polynomials, we are
mainly interested by the coefficients of the polynomials. Thus, if we have a set of
N elements ao, al> ... , aN-I> arranging these elements in the form of a polynomial
H(z) = ao + alZ + a2z2 ... + aN_lzN of the dummy variable z is essentially a
convenient way of tagging the position of an element al relative to the others.
This feature is very important in digital signal processing because each poly-
nomial coefficient represents a sample of an analog signal stream and therefore
defines its location and intensity.
Returning to the congruence relation (2.83), we see that two polynomials
which differ only by a multiplicative constant are congruent. Thus, residue
polynomials deal with the relative values of coefficients rather than with their
absolute values. Equation (2.83) defines equivalence classes of polynomials
modulo a polynomial P(z). It can be verified easily, by referring to the definitions
in the preceding section, that the set of polynomials defined with addition and
multiplication modulo P(z) is a ring and reduces to a field when P(z) is irreduci-
ble.
When P(z) is not irreducible, it can always be factorized uniquely into powers
of irreducible polynomials. Note however that the factorization depends on the
field of coefficients: Z2 + 1 is irreducible for coefficients in the field of rational
numbers. If the coefficients are defined in the field of complex numbers, then
Z2 + 1 = (z - j)(z + j),j = --/-1.
Now suppose that P(z) is the product of dpolynomials PI(z) having no com-
mon factors (these polynomials are usually called relatively prime polynomials
by analogy with relatively prime numbers)
(2.84)
Since each of these polynomials Piz) is relatively prime with all the other poly-
nomials Pt(z), it has an inverse modulo every other polynomial. This means that
we can extend the Chinese remainder theorem to the ring of polynomials modulo
P(z) and therefore express uniquely H(z) as a function of the polynomials HI(z)
obtained by reducing H(z) modulo the various polynomials Plz). The Chinese
remainder theorem is then expressed as
d
H(z) == ~ SI(z)Ht(z) modulo P(z), (2.85)
1=1
2.2 Polynomial Algebra 27
d
S,.(z) == T ..(z) IT P,(z) modulo P(z) (2.87)
'-I
,¢ ..
Note that (2.88) implies that T,.(z) $. 0 modulo P,.(z). Thus, when S..(z) is re-
duced modulo the various polynomials P,(z), we obtain (2.86). Therefore, when
H(z), defined by (2.85), is reduced modulo P,(z), we obtain H,(z) == H(z) modulo
P,(z), which completes the proof of the theorem.
When computing H(z) from the various residues H,(z) by the Chinese re-
mainder theorem, one must determine the various polynomials S,(z). For a given
P(z), these polynomials are computed once and for all by (2.87). The most dif-
ficult part of calculating S,(z) relates to the evaluation of the inverses T,(z)
defined by (2.88). This is done by using Euclid's algorithm, as described in Sect.
2.1, but with integers replaced by polynomials. The polynomials S,(z) can also
be computed very simply by using computer programs for symbolic mathe-
matical manipulation [2.9-10].
L~I
H(z) = f=6 h"z" (2.89)
28 2. Elements of Number Theory and Polynomial Algebra
L~I
X(Z) = ~ X",Z'" (2.90)
LI~-'}.
Y(Z) = H(z)X(z) = (2.91)
,- y,z'.
Since Y(z) is of degree L\ + L'}. - 2, Y(z) is unchanged if it is defined modulo
any polynomial P(z) of degree equal to LI +~- 1
We now assume that P(z) is chosen to be the product of L\ +~- 1 first degree
relatively prime polynomials
L.-t/:r l
P(z) = 11 (z - a,), (2.93)
'-I
where the a, are LI ~ + - 1 distinct numbers in the field F of coefficients. Since
P(z) is the product of LI + ~ - 1 relatively prime polynomials, we can apply
the Chinese remainder theorem to the computation of (2.92). This is done by
reducing the polynomials H(z) modulo (z - a,), performing LI + L'}. - 1
polynomial multiplications H,(z)K,(z) on the reduced polynomials, and recon-
structing Y(z) by the Chinese remainder theorem. We note however that the
reductions modulo (z - a,) are equivalent to substitutions of a, for z in H(z)
and X(z). Thus, the reduced polynomials H,(z) and X,(z) are the simple scalars
H(a,) and X(a,) so that the polynomial multiplications reduce to LI + L'}. - 1
scalar multiplications H(a,)X(a,). This completes the proof ofthe theorem.
Note that this theorem provides not only a lower bound on the number of
general multiplications, but also a practical algorithm for achieving this lower
bound. However, the bound concerns only the number of general multiplica-
tions, that is to say, the multiplications where the two factors depend on the
data. The bound does not include multiplications by constant factors which
occur in the reductions modulo (z - a,) and in the Chinese remainder recon-
struction. For short convolutions, the L\ + L'}. - 1 distinct a, can be chosen to
be simple integers such as 0, + 1, -1 so that these multiplications are either
trivial or reduced to a few additions. For longer convolutions, the a, must be
chosen among a larger set of distinct values. In this case, some of the a, are no
longer simple so that multiplications in the reductions and Chinese remainder
operation are unavoidable. This means that the Cook-Toom algorithm is
practical only for short convolutions. For longer convolutions, better algorithms
can be obtained by using a transform approach which we now show to be
closely related to the Cook-Toom algorithm and Lagrange interpolation.
This can be seen by noting that Y(z) is reconstructed by the Chinese re-
mainder theorem from the L1 + L z - 1 scalars Y(a,) obtained by substituting
a, for z in Y(z). Thus, the Cook-Toom algorithm expresses a Lagrange inter-
2.2 Polynomial Algebra 29
polation process [2.9]. Since the field F of coefficients and the interpolation values
can be chosen at will, we can select the a/ to be the LI + ~ - 1 successive
powers of a number W, provided that all these numbers are distinct in the field
F. In this case, a/ = W/ and the reductions modulo (z - at) are expressed by
(2.94)
with similar relations for X(W/). Thus, with this particular choice of a" the
Cook-Toom algorithm reduces to computing aperiodic convolutions with
transforms having the DFT structure. In particular, if W = 2 and if F is the field
of integers modulo a Mersenne number (2 P - 1, p prime) or a Fermat number
(2V + 1, v = 2'), the Cook-Toom algorithm defines a Mersenne or a Fermat
transform (Chap. 8).
When W= e- 2jlt/(L,+L,-I), j = ""'-1, the Cook-Toom algorithm can be
viewed as the computation of an aperiodic convolution by DFTs. In this case,
P(z) becomes
(2.95)
(2.96)
we have
30 2. Elements of Number Theory and Polynomial Algebra
(2.97)
(2.98)
(2.99)
where ~(N/) is Euler's totient function (Sect. 2.1.3). Thus, for circular convolu-
tions with coefficients in the field of rationals, theorem 2.21 reduces to theorem
2.22.
Theorem 2.22: An N-point circular convolution is computed with 2N-d general
multiplications, where d is the number of divisors of N, including 1 and N.
Theorems 2.21 and 2.22 provide an efficient way of computing circular convolu-
tions because the coefficients of the cyclotomic polynomials are simple integers
and can be simply 0, + 1, -1, except for very large cyclotomic polynomials.
When N is a prime, for instance, ZN - 1 = (z - 1)(zN-l + ZN-2 + ... + 1).
Thus, the reductions and Chinese remainder reconstruction are implemented
with a small number of additions and, usually, without multiplications. In order
to illustrate this computation procedure, consider, for instance, a circular con-
volution of 3 points. Since 3 is a prime, we have Z3 - 1 = (z - 1)(z2 + z + 1).
Reducing X(z) modulo (z - 1) is done with 2 additions by simply substituting 1
2.2 Polynomial Algebra 31
for zinX(z). For the reduction modulo (Z2 + + z 1), we note that Z2 == -z - 1.
Thus, this reduction is also done with 2 additions by subtracting the coefficient
of Z2 in X(z) from the coefficients of ZO and Zl. When the sequence H(z) is fixed,
the Chinese remainder reconstruction can be considered as the inverse of the
reductions and is done with a total of 4 additions. Moreover, the polynomial
multiplication modulo (z - 1) is a simple scalar multiplication and the poly-
nomial multiplication modulo (Z2 + + z 1) is done with 3 multiplications and 3
additions as shown in Sect. 3.7.2. Thus, when H(z) is fixed, a 3-point circular
convolution is computed with 4 multiplications and 11 additions as opposed to 9
multiplications and 6 additions for direct computation. We shall see in Chap. 3
that a systematic application of the methods defined by theorems 2.20-2.22
allows one to design very efficient convolution algorithms.
3. Fast Convolution Algorithms
The main objective of this chapter is to focus attention on fast algorithms for
the summation of lagged products. Such problems are very common in physics
and are usually related to the computation of digital filtering processes, con-
volutions, and correlations. Correlations differ from convolutions only by virtue
of a simple inversion of one of the input sequences. Thus, although the develop-
ments in this chapter refer to convolutions, they apply equally well to correlations.
The direct calculation of the convolution of two N-point sequences requires
a number of arithmetic operations which is of the order of N 2 • For large con-
volutions, the corresponding processing load becomes rapidly excessive and,
therefore, considerable effort has been devoted to devising faster computation
methods. The conventional approach for speeding up the calculation of con-
volutions is based on the fast Fourier transform (FFT) and will be discussed in
Chap. 4. With this approach, the number of operations is of the order of N log2
Nwhen Nis a power of two.
The speed advantage offered by the FFT algorithm can be very large for long
convolutions and the method is, by far, the most commonly used for the fast
computation of convolutions. However, there are several drawbacks to the FFT,
which relate mainly to the use of sines and cosines and to the need for complex
arithmetic, even if the convolutions are real.
In order to overcome the limitations of the FFT method, many other fast
algorithms have been proposed. In fact, the number of such algorithms is so
large that an exhaustive presentation would be almost impossible. Moreover,
many seemingly different algorithms are essentially identical and differ only in
the formalism used to develop a description. In this chapter, we shall attempt to
unify our presentation of these methods by organizing them into algebraic and
arithmetic methods. We shall show that most algebraic methods reduce to
various forms of nesting and yield computational loads that are often equal to
and sometimes less than the FFT method while eliminating some of its limita-
tions. We shall then present arithmetic methods which can be used alone or in
combination with algebraic methods and which allow significant processing
efficiency gains when implemented in special purpose hardware.
Most fast convolution algorithms, such as those based on the FFT, apply only
to periodic functions and therefore compute only cyclic convolutions. However,
3.1 Digital Filtering Using Cyclic Convolutions 33
(3.1)
The overlap-add algorithm, as an initial step, sections the input sequence Xm into
v contiguous blocks X +1JN• of equal length N 2, with m = u
M +
vN2, U = 0, ... ,
N2 - 1, and v = 0, 1,2, ... for the successive blocks. The aperiodic convolution
of each of these blocks X +1JN• with the sequence h n is then computed and yields
+
M
with
(3.3)
(3.4)
(3.5)
The overlap-save algorithm sections the input data sequence into v overlapping
blocks x U+VN, of equal length N, with m = u + vN2, u = 0, ... , N - 1, and v
taking the values 0, 1, 2, ... for successive blocks. In this method, each data
block has a length N, instead of N2 for the overlap-add algorithm, and overlaps
the preceding block by N - N2 samples. The output of the digital filter is con-
structed by computing the successive length-N circular convolutions of the
blocks xU+VN, with the block of length N obtained by appending N - NI zero-
valued samples to hn. Hence, the output Y"+,,N, of each circular convolution is
given by
N-I
Y"+,,N, = L:
n-O
hnX<I,-n>+vN" II = O, ... ,N - 1
12 = 0,1, ... (3.6)
(3.7)
where hn and Xm are the two input sequences of length N, can be viewed as a
polynomial product modulo (ZN - 1). In polynomial notation, we have
N-l
H(z) L: hnzn
= 11=0 (3.9)
(3.10)
(3.12)
Y(z) is computed by first reducing the input polynomials H(z) and X(z) modulo
P,(z)
(3.16)
36 3. Fast Convolution Algorithms
OF N TERMS
REDUCTION REDUCTION
MODULO MODULO
(ZN_ I )/ (Z-I) (Z-I)
XjZ)
POLYNOMIAL
SCALAR
MULTIPliCATION . - - HjZ)jRjZ)
MODULO MULTIPliCATION
(ZN_l)/(Z-l)
Y2./
CHINESE REMAINDER
RECONSTRUCTION
and
Except for large values of N, the cyclotomic polynomials p/(z) are particu-
3.2 Computation of Short Convolutions and Polynomial Products 37
larly simple, since the coefficients of z can only be 0 or ± 1. This means that the
reductions modulo P,(z) and the Chinese remainder reconstruction are done
without multiplications and with only a limited number of additions. When N is
a prime, for instance, d is equal to 2, and ZN - 1 is the product of the two
cyclotomic polynomials Pl(z) = z - 1 and P2(z) = ZN-l + ZN-2 + ... + 1.
Thus, Xl and Xb) are computed with N - 1 additions by
(3.19)
(3.20)
It can be seen that the Chinese remainder operation requires the same
number of additions as the reductions modulo the various cyclotomic poly-
nomials. This result is quite general and applies to any circular convolution, with
one of the sequences being fixed. Hence, the reductions and Chinese recon-
structions are implemented very simply and the main problem associated with
the computation of convolutions relates to the evaluation of polynomial pro-
ducts modulo the cyclotomic polynomials P,(z).
We note first that, since the polynomials Pi(z) are irreducibles, they always can
be computed by interpolation with 2D - I general multiplications, D being the
degree of P,(z) (theorem 2.21). Using this method for multiplications modulo
(Z2 + I) and (Z3 - 1)/(z - I) yields algorithms with 3 mUltiplications and 3
additions as shown in Sect. 3.7.2.
For longer polynomial products, this method is not practical because it
requires 2D - 1 distinct polynomials z - ai' The four simplest interpolation
polynomials are z, I/z, (z - I), and (z + I). Thus, when the degree D of P,(z) is
38 3. Fast Convolution Algorithms
larger than 2, one must use integers at different from 0 and ± 1, which implies
that the corresponding reductions modulo (z - Ut) and the Chinese remainder
reconstructions use multiplications by powers of at which have to be imple-
mented either with scalar multiplications or by a large number of successive
additions. Thus one is led to depart somewhat from the interpolation method
using real integers, in order to design algorithms with a reasonable balance be-
tween the number of additions and the number of multiplications.
One such technique consists in using complex interpolation polynomials,
such as (Z2 + 1), which are computed with more multiplications than poly-
nomials with real roots but for which the reductions and Chinese remainder
operations remain simple.
Another approach consists in converting one-dimensional polynomial
products into multidimensional polynomial products. If we first assume that we
want to compute the aperiodic convolution YI of two length-N sequences hn and
X m , this corresponds to the simple polynomial product Y(z) defined by
2N-2
Y(z) = ~ YIZ 1 (3.23)
1=0
N-I
H(z) = ~ hnz n (3.24)
n=O
N-I
X(z) = ~ xmzm. (3.25)
m=O
The polynomials H(z) and X(z) are of degree N - 1 and have N terms. If N is
composite with N = N 1N 2 , H(z) and X(z) can be converted into two-dimen-
sional polynomials by
(3.26)
(3.27)
(3.28)
(3.29)
(3.30)
with
(3.31)
3.2 Computation of Short Convolutions and Polynomial Products 39
(3.33)
(3.34)
This approach can be used recursively to cover the case of more than two dimen-
sions and it has the advantage of breaking down the computation of large poly-
nomial products into that of smaller polynomial products. Hence, a polynomial
product modulo a cyclotomic polynomial P(z) of degree N can be computed with
a multidimensional aperiodic convolution followed by a reduction modulo P(z).
In many instances, P(z) can be converted easily into a multidimensional
polynomial by simple transformations such as P(z) = Pz(ZI), Zl = ZN,. A cyclo-
tomic polynomial P(z) = (Z9 - I)J(z3 - 1) = Z6 + Z3 + 1 can be viewed, for ex-
ample, as a polynomial Pz(ZI) = zI + Zl + 1 in which Z3 is substituted to Zl.
In these cases, the multidimensional approach can be refined by calculating the
polynomial product modulo P(z) as a two-dimensional polynomial product
modulo (ZN, - Zl), Pz(z\). With this method, a polynomial multiplication modulo
(Z4 + 1) is implemented with 9 multiplications and 15 additions (Sect. 3.7.2) by
computing a polynomial product modulo (ZZ - Zl) on polynomials defined modulo
(zr + I). This is a significant improvement over the direct computation by inter-
polation of the same polynomial multiplication modulo (Z4 + 1), which requires
7 multiplications and 41 additions.
The main advantage of this approach stems from the fact that the polynomial
multiplications modulo (ZN, - Zl) can be computed by interpolation on powers
ofz 1• More precisely since (ZN, - Zl) is an irreducible polynomial of degree Nl> a
polynomial multiplication modulo (ZN,- Zl) can be evaluated with 2N1 - l general
multiplications by interpolation on 2Nl - 1 distinct points (theorem 2.21). The
two simplest interpolation points are Z = 0 and IJz = O. For the 2Nl - 3 remain-
ing interpolation points, we note that, if the degree N2 of the polynomial P 2 (ZI)
is larger than Nl> the 2Nl points ± 1, ... , ±zi",-l are all distinct. Thus, for
40 3. Fast Convolution Algorithms
(3.35)
(3.36)
(3.37)
with similar relations for Bk(z) corresponding to H(z). Hence, Y(z) is computed
by evaluating Ak(z), Bk(z), calculating the 2N)-1 polynomial multiplications
Ak(z)Bk(z) modulo Pz(z), and reconstructing Y(z) by the Chinese remainder
theorem. In (3.35), the multiplications by power of z) correspond to simple
shifts of the input polynomials, followed by reductions modulo Pz(z). When
Pz(z) = zf' + 1, these operations reduce to a rotation by (m)k modulo 2")
words of the input polynomials, followed by a simple sign inversion of the over-
flow words. Thus, (3.35) is calculated without multiplications and, when N) is
composite, the number of additions can be reduced by use of an FFT-type algo-
rithm. In particular, if N) is a power of 2 and Pz(z) = zf' + 1, with N z = 2",
the reductions corresponding to the computation of ZN, - z) can be computed
with only N z(2N) logz N) - 5) additions. We shall see now that, when one of the
input sequences is fixed, the Chinese remainder reconstruction can be done
with approximately the same number of additions.
H=Eh (3.38)
X=Fx (3.39)
Y=H0X (3.40)
y=GY, (3.41)
where h and x are column vectors of the input sequences h n and X m , E and Fare
the input matrices of size M X N, 0 denotes the element by element product,
G is the output matrix of size N x M, and y is a column vector of the output
sequence y/. When the algorithm is designed by an interpolation method, the
matrices E and F correspond to the various reductions modulo (z - at) while the
3.2 Computation of Short Convolutions and Polynomial Products 41
(3.42)
where en,k,fm,k, and gk,l are, respectively, the elements of matrices E, F, and G.
For a circular convolution algorithm modulo P(z) = ZN - 1, we must satisfy the
condition
M-I
S = "e
...:.... m,k f,n,kg k,l
k-O
-- 1 if m +n- I == 0 modulo N
S=l ifm+n-I=O
S = -lor z, ifm+n-/=N
S=O if m +n- I$.O modulo N. (3.44)
We now replace the matrices E and G by matrices E' and G', with elements,
respectively, gk,N-m and eN-I,k. Subsequently, S becomes S' and (3.43) obviously
implies
M-I
S' = L:
k-O
eN-I,dn,kgk,N-m = +n-
1 if m I == 0 modulo N
Thus, as pointed out in [3.3], the convolution property still holds when the
matrices E and G are exchanged with simple transposition and rearrangement of
lines and columns. The same general approach can also be used for polynomial
products modulo (ZN + 1) or (ZN - z,). However, in these cases, the conditions
S = 1 and S = -lor z, in (3.44) are exchanged for m or 1= O. Thus, the
42 3. Fast Convolution Algorithms
Table 3.1. Number of multiplications and additions for short cyclic convolution algorithms.
2 2 4
3 4 11
4 5 15
5 10 31
5 8 62
7 16 70
8 14 46
8 12 72
9 19 74
16 35 155
16 33 181
Table 3.2. Number of arithmetic operations for various polynomial products modulo P(z).
Z2 + 1 2 3 3
(Z3 - l)/(z - 1) 2 3 3
z' + I 4 9 15
Z4 + 1 4 7 41
(ZS - 1)/(z - 1) 4 9 16
(ZS - 1)/(z - 1) 4 7 46
(Z7 - 1)/(z - 1) 6 15 53
(Z9 - 1}/(Z3 - 1) 6 15 39
Z8 + I 8 21 77
(z? + z, + 1) (z~ - 1)/(Z2 - 1) 8 21 83
(z1 + 1)(zi + Z2 + 1) 8 21 76
(ZT + 1)(z~ - 1)/(Z2 - 1) 8 21 76
z'o + 1 16 63 205
(Z27 - 1)/(z9 - 1) 18 75 267
Z32 + I 32 147 739
zO' + 1 64 315 1899
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 43
We list the details of a number of frequently used algorithms for the short
convolutions and polynomial products in Sects. 3.7.1 and 3.7.2. Tables 3.1 and
3.2 summarize the corresponding number of operations for these algorithms and
others. We have optimized these algorithms to favor a reduction of the number
of multiplications. Thus, the algorithms lend themselves to efficient implementa-
tion on computers in which the multiplication execution time is much greater
than that for addition and subtraction. When mUltiply execution time is about
the same as addition, it is preferable to use other polynomial product algorithms
in which the number of additions is reduced at the expense of an increased num-
ber of multiplications.
The Agarwal-Cooley method requires that the length N of the cyclic convolution
Yl must be a composite number with mutually prime factors. In the following,
we shall assume that N = N,N2 with (NI> N 2) = 1. The convolution Yl is given
by the familiar expression
N-'
Yl = ~ hnx1- n, 1= 0, ... , N - 1. (3.46)
n-O
Since N, and N z are mutually prime and the indices k and n are defined modulo
N,N2 , a direct consequence of the Chinese remainder theorem (theorem 2.1) is
that I and n can be mapped into two sets of indices II> n, and 12, n2, with
(3.49)
N2- 1
Xm,(z) = ~ XN,m,+N,m,Zm, (3.50)
mll.=O
(3.51)
(3.52)
(3.53)
(3.54)
By permuting the role of N, and N z, the same convolution could have been com-
puted as a convolution of N z terms in which the scalars would have been replaced
by polynomials of N, terms. In this case, the number of multiplications would be
the same, but the number of additions would be AzN, + MzA,. Thus, the
number of additions for the nesting algorithm is dependent upon the order ofthe
operations. If the first arrangement gives fewer additions, we must have
(3.55)
or,
(3.56)
Therefore, the convolution to perform first is the one for which the quantity
(M, - N,)/A, is smaller.
When N is the product of more than two relatively prime factors N" N z,
N 3 , ... , N d , the same nesting approach can be used recursively by converting the
convolution of length N into convolution of size N, X N z X N3 '" X Nd and
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 45
(3.57)
3 x 3 16 77 1.78 8.56
4 x 4 25 135 1.56 8.44
5 x 5 100 465 4.00 18.60
7 x 7 256 1610 5.22 32.86
8 x 8 196 1012 3.06 15.81
9 x 9 361 2072 4.46 25.58
12 x 12 400 3140 2.78 21.81
16 x 16 1225 7905 4.79 30.88
20 x 20 2500 15000 6.25 37.50
30 x 30 6400 41060 7.11 45.62
36 x 36 9025 62735 6.96 48.41
40 x 40 19600 116440 12.25 72.77
60 x 60 40000 264500 11.11 73.47
72x72 70756 488084 13.65 94.15
80 x 80 122500 767250 19.14 119.88
120 x 120 313600 1986240 21.78 137.93
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 47
As indicated in the previous section, the nesting method has many desirable
features for the evaluation of convolutions. The method is particularly attractive
for short- and medium-length convolutions.
The main drawbacks of the nesting approach relate to the use of several rela-
tively prime moduli for indexing the data, the excessive number of additions for
large convolutions, and the amount of memory required for storing the short
convolution algorithm. The first point is intrinsic to the nesting method, since
one-dimensional convolutions are converted into multidimensional convolutions
only if N is the product of several relatively prime factors. If N does not satisfy
this condition, a one-dimensional to multidimensional mapping is feasible only
at the expense of a length increase of the input sequences which translates into an
increased number of arithmetic operations, as will be shown in Sect. 3.4. Thus,
one cannot hope to eliminate relatively prime moduli in the computation of one-
dimensional convolutions by the Agarwal-Cooley nesting approach.
However, the impact of the other limitations concerning storage require-
ments and the number of additions can be relieved by replacing the convolution
nesting process with a nesting of polynomial products. In this method, which is
called split nesting [3.6], the short convolutions of the Agarwal-Cooley method
are computed as polynomial products. We shall restrict our discussion of this
method to convolutions of length N 1N 2, with Nl and N2 distinct odd primes,
since all other cases can be deduced easily from this simple example.
Since Nl is prime, ZN, - 1 factors into the two cyclotomic polynomials (z- 1)
and P2(Z) = ZN,-I + z N,-I + '" + 1. The cyclic convolution Yl of two N 1-
point sequences hn and Xm can be computed as a polynomial product modulo
(ZN, - 1) with
(3.59)
(3.60)
48 3. Fast Convolution Algorithms
N,f=;1
Y(Z) = 2..... YIZI == H(z)X(z) modulo (ZN, - 1). (3.61)
1-0
Using the Chinese remainder theorem, Y(z) is calculated as shown in Fig. 3.1
by reducing H(z) and X(z) modulo P2(Z) and (z - 1) to Hlz), H. and X 2(z), X.,
respectively, computing the polynomial products H 2(z)X2(z) modulo P 2(z) and
H.X. modulo (z - 1), and reconstructing Y(z) by
The reductions and the Chinese remainder operations always have the same
structure, regardless of the particular numerical value of N •. Therefore, these
operations, when implemented in a computer, need not be stored as individual
procedures for each value of N., N2 ... , but can be defined by a single program
structure. In particular, the reductions modulo p.(z) and (z - 1) are computed
with N. - 1 additions by
(3.64)
(3.65)
Thus, the only part of each convolution algorithm which needs to be indivi-
dually stored is that corresponding to the polynomial products. With this
method, the savings in storage can be quite significant. If we consider for
example a simple convolution of 15 terms, the calculation can be performed by
nesting a convolution of 3 terms with a convolution oflength 5. Since the 3-point
and 5-point convolutions are computed, respectively, with 11 additions and 31
additions, a typical computer program would require about 42 instructions to
implement the short convolution algorithms in the conventional nesting method.
Alternatively, if the 3-point and 5-point convolutions are computed as polyno-
mial products, the calculation breaks down into scalar multiplications and poly-
nomial multiplications modulo (Z3 -1 )/(z -I) and modulo (ZS - 1)/(z - 1). Since
these two polynomial products are calculated, respectively, with 3 additions and
16 additions, a program for the split nesting approach would require a general
purpose program for reductions and Chinese remainder reconstruction modulo
a prime number, plus about 19 instructions to implement the two polynomial
products.
The implementation of a convolution of length N.N2 by split nesting can be
conveniently described using a polynomial representation. As a first step (similar
to that used with conventional nesting), the convolution YI oflength N.N2 is con-
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 49
(3.66)
(3.67)
Since Nl and N2 are primes, zf' - 1 and z~, - 1 factor, respectively, into the
cyclotomic polynomials(zl - 1), P 2(ZI) = Zf,-I Zf,-2 + + ... +
1 and (Z2 - 1),
P 2(Z2) = Z~,-l +
Z~,-2 + ... +
1. Y(ZIJ Z2) can therefore be computed by a
Chinese remainder reconstruction from polynomial products defined modulo
these cyclotomic polynomials with
where
and similar relations for Ht,k(ZIJ Z2)' A detailed representation of the convolution
50 3. Fast Convolution Algorithms
of N1N2 points is given in Fig. 3.2. As shown, the procedure includes a succes-
sion of reductions modulo (Zl - 1), (Z2 - 1), P 2(ZI), PiZ2) followed by one
scalar multiplication, two polynomial multiplications modulo P 2(ZI) and modulo
P 2(Z2), and one two-dimensional polynomial multiplication modulo P 2(ZI),
P 2(Z2). The convolution product is then computed from these polynomial prod-
ucts by a series of Chinese remainder reconstructions. In this approach the
polynomial product modulo P 2(ZI), P 2(Z2) is computed by a nesting technique
identical to that used for ordinary two-dimensional convolutions, with a poly-
nomial multiplication modulo PiZl) in which all scalars are replaced by a poly-
nomial of N2 - 1 terms and in which all scalar multiplications are replaced by
polynomial multiplications modulo P 2(Z2).
+
I ORDERING I
_t_ ARRAY OF N}xN2 TERMS
REDUCTION
+ REDUCTION
+
MODULO MODULO
N2
!
(Z2 -})/(Z-1) (Zr})
t
REDUCTION
t
REDUCTION
+
REDUCTION
1
REDUCTION
MODULO MODULO MODULO MODULO
N} N}
(Z) -I)/(Zrl) (ZI-I) (ZI -1)/(Zr l ) (Z}-})
~ X2.2(Zj.Z2) XU(Z2)
N2
(Z2 -1)/ (Zrl)
I t t
X2.1(ZI)
t , Xu
I.CHINESE REMAINDER CHINESE REMAINDER
RECONSTRUCTION AND
RECONSTRUCTION
REORDERING
I I
" "
CHINESE REMAINDER
Fig. 3.2. Split nesting com
RECONSTRUCTION
of a convolution of lengt
+ with N" N2 odd prime.
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 51
We shall see now that the split nesting method reduces the number of ad-
ditions. Assuming that the short convolutions oflengths NI and N2 are computed,
respectively, with Al additions, MI multiplications and A2 additions, M2 multi-
plications, the polynomial products modulo P2(ZI) and P 2(Z2) are calculated with
MI - 1 and M2 - 1 multiplications while the number of additions breaks down
into
Al = AI,I + A ,I
2
A2 = A I ,2 + A2,2, (3.72)
(3.73)
The computation of the polynomial product modulo P 2(ZI), PiZ2) is done with
(N2 - 1)A2,1 + (MI - 1)A 2,2 additions. Hence, the total number of additions A
for the convolution oflength NIN2 computed by split nesting reduces to
(3.74)
Since MI > Nl and N lA l,2 + M lA2,2 < M l(A l,2 + A 2,2) = MlA2' it can be
seen, by comparison with (3.54), that splitting the computations saves (Ml - N l )
A l ,2 additions over the conventional nesting method, while using the same
number of multiplications.
The split nesting technique can be applied recursively to larger convolutions
of length NlN2N3 ... Nd and, with slight modifications, to factors that are powers
of primes. In Table 3.5, we give the number of arithmetic operations for one-
dimensional convolutions computed by split nesting of the polynomial product
algorithms corresponding to Table 3.2. It is seen, by comparing with conven-
tional nesting (Table 3.3), that the split nesting method reduces the number of
additions by about 25 %for large convolutions.
In the split nesting method, the computation breaks down into the calcula-
tion of one-dimensional and multidimensional polynomial products, where the
latter products are evaluated by nesting. We have shown previously in Sect. 3.2.2
that such multidimensional polynomial products can be computed more ef-
ficiently by multidimensional interpolation than by nesting. Thus, additional
computational savings are possible, at the expense of added storage require-
ments, if the split nesting multidimensional polynomial product algorithms are
designed by interpolation.
52 3. Fast Convolution Algorithms
Table 3.5. Number of arithmetic operations for cyclic convolutions computed by the split
nesting algorithm.
where Hn (z), Xm (z), and YI (z) are the polynomials corresponding to the real
parts of the input and outP~t sequences, and it,(z), Xm,(z), and YI,(z) are the
polynomials corresponding to the imaginary parts of the input and output
sequences with
(3.76)
(3.77)
(3.78)
,-1
Since N2 = 2', ZN, - 1 = (z - 1) IT (Z2· + 1), and ~,(z) + jYI,(z) can be
v-o
computed by the Chinese remainder reconstruction from the various reduced
polynomials YV.I,(z) + jYv,I,(z) defined by
(3.81)
(3.82)
where
same concept can also be applied to other rings. For example, in a field modulo
P(z) = (Z3 - l)/(z - 1) = Z2 + z + 1, we have [(2z + l)/,J3']2 == - I modulo
P(z). Thus, in this case, j== (2z + 1)/,J3' modulo P(z). Note however, that this
approach is less attractive than when j is defined modulo (zq + I), q even, since
multiplications by j = (2z + 1)/,J3' cannot be implemented with simple poly-
nomial rotations and additions.
(3.85)
Similarly, A 2 (N, N I), the number of additions for the digital filter, is given as a
function of AI(N), the number of additions per output point for the cyclic con-
volutions oflength N, by
(3.86)
Table 3.6. Optimum block sizes and number of operations for digital filters computed by
circular convolutions and split nesting.
2 2.23 10.88 18
4 2.53 12.46 18
8 3.29 14.77 24
16 4.44 21.76 60
32 6.04 34.25 84
64 8.12 37.98 180
128 9.78 51.36 360
256 12.10 72.17 1260
512 14.53 94.91 2520
We have seen in the preceding sections that a filtering process can be computed
as a sequence of cyclic convolutions which are evaluated by nesting. The nesting
method maps one-dimensional cyclic convolutions of length-N into a multidi-
mensional cyclic convolution of size Nt X N2 ... X Nd provided Nh N 2, ... , N d,
the factors of N, are relatively prime.
We shall now present a method introduced by Agarwal and Burrus [3.8]
which maps directly the one-dimensional aperiodic convolution of two length-N
sequences into a multidimensional aperiodic convolution of arrays of dimension
Nt X N2 ... X N d, where Nh N 2, ... , Nd are factors of N which need not be re-
latively prime. The aperiodic convolution of two length-N sequences hn and Xm
is given by
N-I
YI = I:
n=O
hnX'-n, 1=0, ... , 2N - 2, (3.87)
where the output sequence YI is of length 2N - I and the sequences hn' Xm, and
YI are defined to be zero outside their definition length. In polynomial notation,
this convolution can be considered as the product of two polynomials of degree
N - I with
(3.89)
N-I
X(z) = I: xmzm (3.90)
m=O
3.4 Digital Filtering by Multidimensional Techniques 57
2N-2
Y(Z) = ~ YIZ'. (3.91)
1=0
(3.92)
(3.93)
(3.94)
(3.95)
(3.96)
(3.97)
The various samples of YI are then obtained as the coefficients of Z in Y(z, ZI),
after setting ZI = ZN,. Hence, the aperiodic convolution YI can be considered as
a two-dimensional convolution of arrays of size NI x N 2 • More precisely, this
two-dimensional convolution is a one-dimensional convolution oflength 2N2 -
1 where the N2 input samples are replaced by N2 polynomials of NI terms and all
multiplications are replaced by aperiodic convolutions of two length-NI se-
quences
N~I
Y.,(z) = L. Hn,(z)X.,_n,(z). (3.98)
11)-=0
(3.99)
The number of additions is slightly more difficult to evaluate than for con-
58 3. Fast Convolution Algorithms
ventional nesting because here the nested convolutions are noncyc1ic. Hence, the
input additions corresponding to the length 2N2 - 1 convolution algorithm are
performed on polynomials of NI terms, while the output additions are done on
polynomials of2NI - I terms. Moreover, since each polynomial Ys,(z) of degree
2NI - 1 is multiplied by zf' = ZN,s" the various polynomials Ys,(z) overlap by
NI - 1 samples. Let A I ,2 and A 2,2 be the number of input and output additions
required for a length-(2N2 - 1) aperiodic convolution and Al be the total
number of additions corresponding to the aperiodic convolution of length
2NI - 1. Then, A, the total number of additions for the aperiodic convolution
Yr, is given by
(3.100)
When N is the product of more than two factors, the same formulation can be
used recursively. Since the factors of N need not be relatively prime, N can be
chosen to be a power of a prime, usually 2 or 3. In this case, with N = 2' or
N = 3', the aperiodic convolution is computed by t identical stages calculating
aperiodic convolutions of length 3 or 5. Only one short convolution algorithm
needs be stored and computer implementation is greatly simplified by use of the
regular structure of the radix-2 or radix-3 nesting algorithm.
We give, in Sect. 3.7.3, short algorithms which compute aperiodic convolu-
tion of sequences of lengths 2 and 3. For the first algorithm, the noncyc1ic con-
volution of length 2 is calculated with 3 multiplications when one of the
sequences is fixed. Using this algorithm, the aperiodic convolution of two se-
quences of length N = 2' is computed with M multiplications, M being given by
M= 3'. (3.101)
The number of multiplications per input point is therefore equal to (1.5)'. If the
same aperiodic convolution is calculated by FFT, with 2 real sequences per FFT
of dimension 21+1, as shown in Sect. 4.6, the number of real multiplications per
input point is 3(2 + t). Since 3(2 + t) increases much more slowly with t than
(1.5)" the FFT approach is preferred for long sequences. However, for small
values of t, up to t = 8, (1.5)' is smaller than 3(2 + t) and the multidimensional
noncyc1ic nesting method yields a lower number of multiplications than the
conventional radix-2 FFT approach. This fact restricts the region of preferred
applicability for the algorithm to aperiodic convolution lengths of up 256 input
samples or less.
The radix-2 nesting algorithm can also be redefined to be a radix-3 nesting
process which uses the short algorithm in Sect. 3.7.3. In this case, the aperiodic
convolution of two sequences oflength 3 is calculated with 5 multiplications and
an aperiodic convolution of N = 3" input samples is computed in tl stages
with M = 5" multiplications. Thus, the number of mUltiplications per input
point is equal to (5/3),'. This result can be compared, for convolutions of equal
3.4 Digital Filtering by Multidimensional Techniques 59
length, to that of the radix-2 algorithm by setting 3" = 2' which yields t\ = t
log3 2. Consequently, the number of multiplications per input point reduces to
(5/3)t1o g ,2 = (1.38)' for an aperiodic convolution computed with the radix-3
nesting algorithm, as opposed to (1.5), for a convolution of equal length calcu-
lated by a radix-2 algorithm. Thus, the number of multiplications increases less
rapidly with convolution size for a radix-3 algorithm than for a radix-2 algo-
rithm, as can be seen in Table 3.7 which lists the number of operations for
various convolutions computed by radix-2 and radix-3 algorithms.
Table 3.7. Number of arithmetic operations for aperiodic convolutions computed by radix-2
and radix-3 one-dimensional to multidimensional mapping.
2 3 3 1.50 1.50
3 5 20 1.67 6.67
4 9 19 2.25 4.75
8 27 81 3.37 10.12
9 25 194 2.78 21.56
16 81 295 5.06 18.44
27 125 1286 4.63 47.63
32 243 993 7.59 31.03
64 729 3199 11.39 49.98
81 625 7412 7.72 91.51
128 2187 10041 17.09 78.45
243 3125 40040 12.86 164.77
256 6561 31015 25.63 121.15
It can be seen, for instance, that a convolution of about 256 input terms is
computed by a radix-3 algorithm with approximately half the number of multi-
plications corresponding to a radix-2 algorithm. Unfortunately, this saving is
achieved at the expense of an increased number of additions so that the
advantage of the radix-3 algorithm over the radix-2 algorithm is debatable.
A comparison with cyclic nesting techniques can be made by noting that the
output of an N-taps digital filter can be computed by sectioning the input data
sequence into successive blocks of N samples, calculating a series of aperiodic
convolutions of these blocks with the sequence of tap values, and adding the
overlapping output samples of these convolutions. With this method, the
number of multiplications per output sample of the digital filter is the same as
the number of multiplications per input point of the aperiodic convolutions,
while the numbers of additions per point differ only by (N - 1)/N. Therefore, it
can be seen that the implementation of a digital filtering process by noncyclic
nesting methods usually requires a larger number of arithmetic operations than
60 3. Fast Convolution Algorithms
when cyclic methods are used. This difference becomes very significant in favor
of cyclic nesting for large convolution lengths. Thus, while an aperiodic nesting
method using a set of identical computation stages is inherently simpler to
implement than the mixed radix method used with cyclic nesting, this ad-
vantage is offset by an increase in number of arithmetic operations.
The aperiodic nesting method described above can be considered as a
generalization of the overlap-add algorithm. An analogous multidimensional
formulation can also be developed for the overlap-save technique [3.8]. This
yields aperiodic nesting algorithms that are very similar to those derived from
the overlap-add algorithm and give about the same number of operations.
N-J
X(z) = L:
m=O
xmzm (3.105)
N-J
Y(z) = L: yzzz (3.106)
z=o
3.5 Computation of Convolutions by Recursive Nesting of Polynomials 61
Thus, the computation of Y(z) can be accomplished via the Chinese remainder
theorem. In this case, the main part of the calculation consists in evaluating the
t + 1 polynomial products modulo (Z2'-1 + 1), ... , (z + 1), (z - 1). For the
polynomial products modulo (z + 1), (z - 1), however, we have z == ± 1 and
the polynomial multiplications reduce to simple scalar mUltiplications.
We shall show now that, for higher order polynomials, the process can be
greatly simplified by using a recursive nesting technique. We have seen in Sect.
3.2.2 that if N = N I N 2 , a polynomial product modulo (ZN +
1) could be com-
puted as a polynomial product modulo (ZN1 - Zl) in which the scalars are re-
placed by polynomials of N2 terms evaluated modulo (zf' + 1). If N = 2' and
NI ~ N 2, the polynomial product modulo (ZN Zl) can be computed by inter-
1 -
(3.108)
(3.109)
mials of Nt-I N2 terms, in the second stage by polynomials of Nt- 2N2 terms, and
in the last stage by polynomials of N2 terms.
In this case, the total number of multiplications M for the polynomial pro-
duct of dimension N = 2' = NtN2, computed by a radix-NI algorithm, is
(3.110)
or, with dt l + t2 = t,
(3.111)
62 3. Fast Convolution Algorithms
is the number of mUltiplications for the polynomial product modulo (Z2 + 1).
1
'
Therefore, the total number of multiplications for the polynomial product mod-
ulo (ZN + 1) is
d-I
M = 3 II (2 2'+1 - 1) (3.112)
V~O
Thus,
(3.113)
and, since N = 22', the number of multiplications per output sample, MIN,
satisfies the inequality
Thus, with this approach, the number of multiplications increases much more
slowly with N than for other methods which are based upon the use of constant
radices, or conventional cyclic or noncyclic nesting. Note that, except for the
constant multiplicative factor 9/8, the law defined by (3.114) is essentially the
same as that for convolution via the FFT method.
The mixed radix nesting technique is not limited to dimensions N such that
N = 2', t = 2d. Vector lengths with t =1= 2d can be accommodated with a slight
loss of efficiency by using an initial polynomial other than (Z2 + I) and a set
of increasing radices that are an approximation of an exponential law. We list
in Table 3.8 the arithmetic operation count for various polynomial products that
can be computed by this technique.
3.5 Computation of Convolutions by Recursive Nesting of Polynomials 63
Table 3.8. Number of arithmetic operations for polynomial products modulo (ZN + 1),
N = 2', computed by mixed radix nesting.
Number of Number of
Ring Radices
multiplications additions
Z2 +1 2 3 3
Z4 +1 2,2 9 15
Z8 + 1 2,2,2 27 57
ZI6 + 1 2,2,4 63 205
Z32 + 1 2,2,2,4 189 599
Z64 + 1 2,2,2,8 405 1599
ZI28 + 1 2,2,4,8 945 4563
Z2S6 + 1 2, 2, 4, 16 1953 10531
Z'12 + 1 2,2,2,4, 16 5859 26921
ZI024 + I 2,2,2,4,32 11907 58889
Z2048 +1 2,2,2,8,32 25515 143041
Z4096 +I 2,2,2,8,64 51435 304769
Table 3.9. Optimum block size and number of operations for digital filters computed by
circular convolutions of N terms, N = 2', and mixed radix nesting of polynomials.
2 2.00 4.43 8
4 2.80 9.80 8
8 4.16 16.43 32
16 5.98 23.39 64
32 7.19 31.11 128
64 8.00 43.83 512
128 9.34 51.28 512
256 11.91 62.37 2048
512 12.80 76.09 8192
64 3. Fast Convolution Algorithms
We assume now that each word of the input sequences is binary coded with B
bits
(3.118)
3.6 Distributed Arithmetic 65
(3.120)
This shows that YI is obtained by summing along dimension hI> the B convolu-
tions Yl,b, defined by
Yl,b, = I:
N-I
n=O
hn,b, I: X I- n,b,2 b, •
(B-1
b,=O
)
(3.121)
Each word
N-word sequence
Yl,b,
B-1
{ b~ XI-n,b, 2 b, = X I- n •
I
is a length-N convolution of the N-bit sequence hn,b, with the
Thus, each word of Yl,b, is obtained
by multiplying XI-n by hn,b, and summing along dimension n. Since hn,b, can only
be 0 or 1, multiplication by hn,b, is greatly simplified and can be considered as
a simple addition of N words, where some of the words are zero.
When the sequence Xm is fixed, large savings in number of arithmetic oper-
ations can be achieved by precomputing and storing all possible combinations of
YI,b,' Since hn,b, can only be 0 or 1, the number of such combinations is equal to
2N. Thus, by storing the 2N possible combinations YI,b" the various values of
Yl,b, are obtained by a simple table look-up addressed by the N bits hn,b, and the
computation of YI reduces to a simple shift-addition of B words. This is equi-
valent in hardware complexity to a conventional binary multiplier, except for
the memory required to store the 2N combinations of YI,b,.
Thus, the hardware required to implement a short length convolution in dis-
tributed arithmetic is not particularly more complex than that corresponding to
an ordinary multiplication. In practice, the amount of memory required to store
the precomputed partial products Yl,b, can be halved by coding the words h n
with bit values equal to ± 1 instead of 0, 1. In this case, the 2N partial products
Yl,b, divide into 2 N - I amplitude values which can take opposite signs, and only
the 2 N - I amplitude values need to be stored, provided that the sign inversion is
66 3. Fast Convolution Algorithms
Convolution of 2 Terms
2 multiplications, 4 additions
ao = Xo + XI bo = (h o + h )/2
l
al = Xo - XI b l = (h o - h l )/2
mk = akbk> k = 0,1
Yo = mo + ml
Convolution of 3 Terms
3.7 Short Convolution and Polynomial Product Algorithms 67
4 multiplications, 11 additions
ao = Xo + XI + Xz bo = (h o + hI + hz)/3
bl = ho - hz
bz = hI - hz
a3 = al + az b3 = (b l + bz)/3
rnk = akbk, k = 0, ... ,3
Yo = rno + Uo
YI = rno - Uo - UI
Yz = rno + UI
Convolution of 4 Terms
5 multiplications, 15 additions
~=~+~ ~=~+~
al = XI + X3 bl = hI + h3
az = ao + al bz = (bo + b )/4 l
a3 = ao - al b 3 = (b o - b l )/4
a4 = Xo - Xz b4 = [(h o - h z) - (hI - h 3)]/2
as = XI - X3 b s = [(h o - h z) + (hI - h 3)]/2
a6 = a4 + as b6 = (ho - hz)/2
rnk = ak+Z bk+Z, k = 0, ... , 4
Uo = rno + rn l Uz = rn4 - rn3
Yo = Uo + Uz
YI = UI +U 3
Yz = Uo - Uz
Convolution of 5 Terms
10 multiplications, 31 additions
bo = ho - hz + h3 - h4
bl = hI - hz + h3 - h4
bz = ( - 2ho - 2hl + 3hz - 2h3 + 3h4)/5
h3 = - ho + hI - hz + h3
68 3. Fast Convolution Algorithms
a4 = X3 - X4 b4 = - ho + hi - h2 + h4
as = a3 + a4 bs = (3h o - 2hl + 3h 2 - 2h3 - 2h4)/5
~=~-~ ~=-~+~
a7 = al -a4 b7 = hi - h2
as = a2 - as bs = (- ho - hi + 4h2 - h3 - h4)/5
a9 = Xo + XI + X2 + X3 + X4 b9 = (ho + hi + h2 + h3 + h4)/5
mk = akbk> k = 0, ... , 9
Uo = mo + m2 U3 = m4 + ms
UI = ml + m2 U4 = m6 + ms
U2 = m3 + ms Us = m7 + ms
Yo = Uo - U4 + m9
YI = - Uo - UI - U2 - U3 + m9
Y2 = U3 + Us + m9
Y3 = U2 + U4 + m9
Y4 = UI - Us + m9
Convolution of 7 Terms
16 multiplications, 70 additions
a3 = Xs - X6
a4 = ao + al
as = - ao + al
a6 = a2 + a3
a7 = - a2 + a3
bo = (-h6 - 2hs + 3h4 - h3 - 2h2 + hi + 2ho)/2
bl = (lOh6 +3h s - Ilh4 + IOh3 + 3h2 - Ilhl
- 4ho)/14
alO = as + as b2 = (- 2h6 + 3h s - h4 - 2h3 + 3h2 - hl )/6
all = a9 + a4 + a4 + as b3 = (- h6 + h4 - h3 + hl)/6
au = al b4 = 2h6 - hs - 2h4 + 3h3 - h2 - 2hl + ho
al3 = X3 - X6 bs = (- 2h6 + hs + 2h4 - h3 - 2h2 + 3hl
- ho)/2
3.7 Short Convolution and Polynomial Product Algorithms 69
b l4 = 2h3 - h2 - 2hl + ho
a23 = al 9+ (X2 + XI + Xo) + (X2 + XI + Xo) + X6
bls = (h6 + hs + h4 + h3 + h2 + hi + ho)J7
k = 0, ... , 15
Uo = mo + mlO UI2 = Uo + Ull
UI7 = U6 + U8
UI8 = +
UI7 U7
U21 = U20 - U7
UIO = UI + U3 U22 = (U20 + U8 + U8 + u + U79)
Yo = Ul2 + mlS
YI = UI6 + U23 + mlS
Y2 = U22 + mlS
Y3 = U21 + mlS
Y4 = UI9 + mlS
Ys = UIS + mlS
Y6 = UI4 + mlS
70 3. Fast Convolution Algorithms
Convolution of 8 Terms
14 multiplications, 46 additions
ao = Xo + X4 bo = ho + h4
al = XI + Xs b l = hi + hs
a2 = X2 + X6 b2 = hz + h6
a3 = X3 + X7 b 3 = h3 + h7
a4 = a o + az b = bo + bz
4
as = al + a3 bs = bl + b3
a6 = Xo - X4 b6 = {[- (h o - h4) + (hz - h6)] - [(hi - h s)
- (h3 - h7)]} /2
a7 = XI - Xs b7 = {[- (h o - h4) + (hz - h6)] + [(hi - hs)
+ (h3 - h7)]} /2
as = Xz - X6 b s = {[(h o - h4) + (hz - h6)] + [(hi - h s) + (h3
- h7)]} /2
a9 = X3 - X7 b9 = {[(h o - h4) + (hz - h6)] + [(hi - h s) - (h3
- h7)]} /2
alO = a o - az blO = [- (b o - b z) + (b l - b3)]/4
all = al - a3 b ll = [(b o - bz) + (b l - b3)]/4
al2 = a4 + as b12 = (b4 + bs)/8
al3 = a4 - as b l3 = (b 4 - b s)/8
al4 = a7 + a9 b l4 = [(h o - h4) - (h3 - h7)]/2
alS = a6 + as b ls = [(h o - h4) + (hi - h s)]f2
al6 = al S - al 4 b l6 = (h o - h4)/2
a)7 = as - a9 b l7 = [(h o - h4) + (hz - h6)]/2
alS = a6 - a7 b ls = [- (h o - h4) + (hz - h6)]/2
al 9 = al O + all b l9 = (b o - bz)/4
mk = ak+6 bk+6• k = O•...• 13
Uz = m3 + mll Ull = U6 + Us
U3 = mll - mz UIZ = UI + U3
U4 = ml + mlZ UI3 = U7 + U9
Us = mo - m12 UI4 = Uo + U4
U6 = ml 3 - ms UIS = - U6 + Us
3.7 Short Convolution and Polynomial Product Algorithms 71
U, = ml 3 + m4 UI6 = UI + Us
Us = m, + m6 UI' = - U, + U9
Yo = + Ull
UIO
YI = U 12 + U13
Y2 = U14 + UIS
Y3 = UI6 + UI'
Y4 = - UIO + Ull
Ys = - UI2 + UI3
Y6 = - U14 + UIS
y, = - UI6 + UI'
Convolution of 9 Terms
19 multiplications, 74 additions
bo = - + 2h6
ho - h3
b = - hI - h4 + 2h,
l
b2 = - h2 - hs + 2hs
b3 = ho - 2h3 + h6
b4 = hI - 2h4 + h,
as = Xs - Xs bs = h2 - 2hs + hs
a6 = Xo + X3 + X6 b6 = ho + h3 + h6
a, = XI + X + X,
4 b, = hI + h4 + h,
as = X2 + Xs + Xs bs = h2 + hs + hs
a9 = + a2
ao
alO = a 3 + as
all = a 6 + a, + as b9 = (b 6 + b, + bs)/9
a12 = alO + a 4 blO = (b o + 3b + 2b 2b l 3b 2 - 3 - 4 - bsV18
al 3 = a9 + al b = (b o - b + b + 3b + 2b s)/18
ll 2 3 4
b = blO + b
l2 ll
b13 = ( - bo + b b + bs)/6 l - 4
b l4 = (b o - b2 - b3 + b4 )/6
bls = b13 +b l4
b l6 = (2b o + b l - b2 - 2b 3 + bs)/3
bl , = (2b o - b2 + b4 )f3
72 3. Fast Convolution Algorithms
a20 = aO b ls = b l7 - b l6
a21 = as b l9 = (b o - bl - 2b 2 + b4)/3
a22 = a2 - as b 20 = ( - b l +b 3 - 2b s)/3
a 23 = a2 b21 = b20 - b l9
a 24 = - a 22 +ao - a4 b22 = (b o - b2 - 2b 3 + 2b s)f9
a 25 = al 9 + as - al b23 = (- bo + b2 - b3 + bs)f9
a 26 = - a 2S + a 24 b24 = b23 - b22
a 27 = a6 - as b2S = (b 6 - bs)/3
a2S = a7 - as b26 = (b 7 - bs)f3
a 29 = a27 + a 2S b27 = (b 2S + b26)/3
mk = ak+1l bk + 9 , k = 0, ... , 18
Uo = ml + m2 Ull = Us + mil + U 9
UI = m 4 + ms Ul2 = U4 - Us + U2
U2 = ml 4 + ml S UI3 = U7 + Us + m9 + U6
U3 = Uo + U I U14 = U3 + m l2 + U9 + U2
U4 = m + m3 l UIS = Uo - U + U6 I
Us = m 4 + m6 UI6 = ml 6 - ml S
U6 = ml3 + ml 5 UI7 = ml 7 - ml S
U7 = - U3 + m7 UIS = mo + UI6
Us = U4 + Us UI9 = mo - UI6 - UI7
YI = UI - m3 - m7
Y2 = Uo - m4 + m6
YJ = UI + ms + m6 + ms
Polynomial Product Modulo (Z9 - 1)/(z3 - 1)
15 multiplications, 39 additions
ao = Xo + X2
al = X3 + Xs
a2 = al + X4 b2 = (h o + 3h l + 2h2 - 2h3 - 3h4 - h s)/6
a3 = ao + XI b3 = (h o - h2 + h3 + 3h4 + 2h s)j6
b4 = b2 + b3
bs = ( - ho + hi - h4 + hs)/2
b6 = (h o - h2 - h3 + h4)/2
b7 = bs + b6
bs = 2ho + hi - h2 - 2h3 + hs
b9 = 2ho - h2 + h4
blO = b9 - bs
b ll = ho - hi - 2h2 + h4
b12 = - hi + h3 - 2hs
b13 = b l2 - b ll
3.7 Short Convolution and Polynomial Product Algorithms 75
U3 = Uo + UI
u4 = mo + m2
Us = m3 + ms
u6 = ml 2 + m l4
Yo = m7 + u2 + u7
YI = Us + mlo + u 9
Y2 = u4 - Us + U2
Y3 = u7 + Us + ms + U6
Y4 = u + ml1 + u9 + u2
3
Ys = U o- UI + u6
Polynomial Product Modulo (Z7 - 1)/(z - 1)
15 multiplications, 53 additions
ao = Xo + X2 bo = (- 2hs + 3h4 - h3 - 2h2 + hi + 2ho)/2
al = ao + XI bl = (3h s - llh4 + 10h 3 + 3h 2 - llhl - 4h o)/14
a2 = al + X2 b 2 = (3h s - h4 - 2h3 + 3h2 - h l )/6
a3 = X3 + Xs b3 = (h 4 - h3 + h )/6
l
b13 = (- h3 + h )/6
l
b l4 = 2h3 - hz - 2hl + ho
U4 = m9 + m 4
Us = mlO + mO
U6 = mil + ml
U7 = ml2 + mZ
Us = ml 3 + m3
U9 = ml + m 4 4
Ull = UIO + Uz
Ul2 = U o + Ull
Yo = UI S Y3 = Ul 4 + UZI
Yl = Ul6 + Ul 9 Y4 = Ul2 + UZZ
Yz = UI S + UZ3 Ys = Uzo
aZ = Xo - Xz
3.7 Short Convolution and Polynomial Product Algorithms 77
as = Xs + X7
as = a O + al bo = (ho + hi + h2 - h3 + h4 + hs + h6 + h7)/4
a9 = a4 + as bl = (- ho - hi - h2 - h3 + h4 + hs + h6 - h7)/4
alO = as +a 9 b 2 = (h3 - h4 - hs - h6)/4
b3 = (5ho - 5h l + 5h2 - 7h3 + 5h4 - 5h s + 5h6
- h7 )/20
b4 = ( - 5h o + 5h l - 5h 2 + h3 + 5h4 - 5h s + 5h6
-7h7 )/20
al3 = all + a12 bs = (3h3 - 5h4 + 5h s - 5h6 + 4h7)/20
al4 = a2 + a3 b6 = (h o + hi - h2 - h3 + h4 - hs - h6 + 3h7)/4
GIS = G6 +a 7 b7 = ( - ho + hi + h2 - 3h3 + h4 + hs - h6 - h7)/4
U2 = m3 + ms
u3 = m4 + ms
U4 = m6 + ms
78 3. Fast Convolution Algorithms
U7 = mID + mll
Us = m20 + m l9
U9 = + Us
ml 2 U 2S = Us + Us
UID = U + ml
9 3 U26 = U2S + U2S
UI2 = Uo - U2
3 multiplications, 3 additions
(1 input addition, 2 output additions)
ao = Xo bo = ho
bl = ho + hI
b2 = hI
k = 0, ... ,2
Yo = mo
3.7 Short Convolution and Polynomial Product Algorithms 79
5 multiplications, 20 additions
(7 input additions, 13 output additions)
ao = XI + X2
a2 = Xo bo = h o/2
a3 = Xo + ao bl = (h o + hi + h 2)/2
a4 = Xo + al b2 = (h o - hi + h 2)/6
a6 = X2 b4 = h2
U2 = mo + mo Us = ml + m2
Yo = U2 Y3 = - U4 - US
YI = UI - U3+ U4 Y4 = m4
Y2 = - Uz + U3 + Us - m 4
4. The Fast Fourier Transform
The object of this chapter is to briefly summarize the main properties of the
discrete Fourier transform (DFT) and to present various fast DFT computation
techniques known collectively as the fast Fourier transform (FFn algorithm.
The DFT plays a key role in physics because it can be used as a mathematical
tool to describe the relationship between the time domain and frequency do-
main representation of discrete signals. The use of DFT analysis methods has
increased dramatically since the introduction of the FFT in 1965 because the
FFT algorithm decreases by several orders of magnitude the number of arithme-
tic operations required for DFT computations. It has thereby provided a prac-
tical solution to many problems that otherwise would have been intractable.
Xk =" ~
m=O
X m W mk , k = 0, ... , N - I
(4.1)
It can easily be verified that (4.2) is the inverse of (4.1) by substituting Xk , given
by (4.1), into (4.2). This yields
N-I 1 N-I
YI = L: Xm - L: W(m-I)k. (4.3)
m=O N k=O
4.1 The Discrete Fourier Transform 81
N-I
y/ = L:
n=O
hn X/- n , 1= 0, ... , N - l. (4.4)
N-I
Using the same procedure as above, one finds that S = L: w<m+n-llk be-
k=O
comes S == 0 for m + n - 1 $. 0 modulo Nand S == N for m + n - 1 == 0
modulo N. Thus, m == I - nand y/ = C/. Hence an N-point circular convolution
is calculated by three DFTs plus N multiplications. When the DFTs are com-
puted directly, this approach is not of practical value because each DFT is
computed with N 2 multiplications whereas the direct computation of the con-
volution requires N 2 mUltiplications. We shall see however that the method be-
comes very efficient when the DFTs are evaluated by a fast algorithm.
(4.6)
Assuming a second sequence hn and its DFT Hk , we now establish the following
DFT properties.
Linearity
These properties follow directly from the definitions (4.1) and (4.2).
82 4. The Fast Fourier Transform
Symmetry
(4.9)
Time Shifting
wmk =
N-l N-l
A k = .L...
'" X m+1 '"
.L... X m+1 W (m+1-1) k • (4.12)
m=O m=O
Hence
(4.13)
Frequency Shifting
This property follows directly from the proof given for the time shifting property
by replacing the direct DFT with an inverse DFT.
(4.15)
(4.17)
(4.18)
4.1 The Discrete Fourier Transform 83
(4.19)
with
(4.20)
Since h n is real, (4.20) therefore implies that il-k = ilt, where il: is the complex
conjugate of ilk'
Parseval's theorem
(4.21)
(4.22)
In many practical applications, the input sequence Xm is real. In this case, the
OFT Xk of Xm has special properties that can be found by rewriting (4.1) as
N-\ N-\
Xk = I: Xm cos(2rrmk/N)- j m=O
m=O
I: Xm sin(2rrmk/N). (4.23)
Since Xm is real, (4.23) implies that the real part Re {Xk } of Xk is even and that the
imaginary part 1m {X k } of Xk is odd
Re {X k } = Re {X- k } (4.24)
Re {X k } = - Re {X- k } (4.27)
(4.30)
(4.31)
Hence
Then} by using the symmetry property of pure real and pure imaginary sequences
Ym = Xm wm + x!,. (4.40)
(4.41)
Since Xm and x! are conjugate symmetric sequences, their DFTs Xk and Xl are
such that Xk+l = X-k- 1 and Xl = X~k' This implies that
(4.42)
N-I
XJ = L:
m-O
x!,. (4.44)
(4.46)
Similar techniques can be used to compute the DFTs of two conjugate antisym-
metric sequences or of four real even or odd sequences in one transform step.
(4.47)
If N is the product of two factors, with N = N1Nz, we can redefine the indices m
and k by
(4.48)
86 4. The Fast Fourier Transform
(4.49)
(4.50)
(4.51)
which shows that the OFT of length N1N2 can be viewed as a OFT of size Nl X
N 2 , except for the introduction of the twiddle/actors Wm,k,. Thus, the computa-
tion of Xk by (4.51) is done in three steps, with the first step corresponding to the
evaluation of the Nl OFTs Ym,.k, corresponding to the Nl distinct values of ml
(4.52)
(4.53)
Note that the computation procedure could have been organized in reverse
order, with the multiplications by the twiddle factors preceding the evaluation
of the first OFTs instead of being done after the calculation of these OFTs. In
this case,
(4.54)
Hence there are generally two different forms for the FFT algorithm, each being
equivalent in terms of computational complexity. It should be noted that, in
both procedures, the order of the input and output row-column indices are per-
muted. Thus, while the input sequence can be viewed as N2 polynomials of Nl
terms, the output sequence is organized as Nl polynomials of N2 terms. This
implies that a permutation step must be added at the end of the three basic steps
described above to complete the FFT procedure.
The FFT algorithm derives its efficiency by replacing the computation of one
large OFT with that of several smaller OFTs. Since the number of operations
required to directly compute an N-point OFT is proportional to N 2 , the number
of operations decreases rapidly when the computation structure is partitioned
into that of many small OFTs. In the simple case of a OFT oflength N 1N 2 , the
4.2 The Fast Fourier Transform Algorithm 87
(4.55)
which is obviously less than NiN't. In practice, the FFT algorithm is extremely
powerful because the procedure can be used iteratively when N is highly compo-
site. In such cases, and with the two-factor decomposition discussed above, NI
and N2 are composite and the DFTs of lengths NJ and N2 are again computed
by an FFT procedure. With this approach, each stage provides an additional
reduction in number of operations so that the algorithm is the most efficient
when N is highly composite. This feature, together with the need for a regular
computational structure, motivates application of the FFT algorithm to DFT
lengths which are the power of an integer. In most cases, N is chosen to be a
power of two, and this was the original form of the FFT algorithm [4.6].
+ Wk 2:
_ N/2-J N/2-1
Xk = 2: X2m W2mk X2m+1 W2mk (4.56)
m=O m=O
N/2-1 N/2-1
Xk+N / 2 = 2: X2m W 2mk - Wk 2: X2m+l W 2mk
m=O m=O
M = (Nj2)log2N (4.58)
(4.59)
The decimation in time approach is illustrated in Fig. 4.1 for an 8-point DFT. In
;=1 ; =2 ;= 3
x 1
o
XO~------------~~-----------'~------------~iO
X7~------------"-------------'~------------"i7
-1 _w2 _w3
this signal flow graph, each node represents a variable and each arrow terminat-
ing at a node represents the additive contribution of the variable at the originat-
ing node of the arrow. Multiplications by a constant are represented by the
constant written near the arrowhead.
A second form of the FFT algorithm can be obtained by simply splitting the
N-point input sequence Xm into two (NI2)-point sequences Xm and X m + N /2 cor-
responding, respectively, to the NI2 first samples and to the NI21ast samples of
X m • With this approach, called decimation infrequency, Xk becomes
X-k --
N!2-J
"
k...o (Xm + W Nk / 2 )wmk
Xm+N/Z' (4.60)
m=O
-
X Zk =
N!2-J
~
m=O
(Xm + X m+ N /2) W 2 k
m , k = 0, ... , NI2 - 1. (4.61)
Thus Xk is computed by (4.61) and (4.62) in terms of two DFTs of length N12,
but with a premultiplication by wm of the input sequence in (4.62). Therefore,
the computation of a DFT of N terms is replaced by that of two DFTs of NI2
terms at the cost of N complex additions and NI2 complex multiplications. As
with the decimation in time algorithm, the same procedure can be used recursive-
ly to compute the DFT in log2N stages, each stage converting 21 DFT of length
21-1 into 21+1 DFTs of length 2,-1-1 at the cost of N additions and NI2 multipli-
cations. This means that the decimation in frequency algorithm requires the
same number of operations as the decimation in time algorithm. The compu-
tation structure for decimation in frequency is shown in Fig. 4.2 for N = 8. It
can be seen that the flow graph has the same geometry as the decimation in time
flow graph, but different coefficients.
Since the FFT algorithm computes a DFT with NlogN operations instead of
N 2 for the direct approach, practical reduction of the computation load can be
very large. In the case of a 1024-point DFT, for instance, we have N = 210 and
the direct computation requires 220 complex multiplications. On the other hand,
the FFT algorithm computes the same DFT with only 5.2 10 complex multipli-
cations or about 200 times fewer multiplications. Significant additional reduction
can be obtained by noting that a number of the multiplications are trivial multi-
plications by ± I or ±j.
In the case of a decimation in time algorithm, the twiddle factors in the first
stage are given by W kN /2 = (-I)k. Thus all multiplications in the first stage are
90 4. The Fast Fourier Transform
Xo--------------~~------------~~------------ ..
-/
trivial. The multiplications in the second stage are also trivial because the
twiddle factors are then defined by W kNI4 = (_j)k. In the following stages, the
twiddle factors are given by W k NI8, W k NII6, ... and the number of trivial multi-
plications is N/4, N/8, .... Under these conditions, the number of nontrivial
complex multiplications becomes (N/2)( -3 + log2 N) + 2 and, if the complex
multiplications are implemented with 4 real multiplications and 2 real additions,
the numbers of real multiplications M and real additions A required to imple-
ment the radix-2 FFT algorithm are
4.2 The Fast Fourier Transform Algorithm 91
Hence significant additional reduction is obtained when full use is made of the
symmetries in sine and cosine functions. In the case of a DFT of 1024 points,
for instance, the straightforward computation by (4.56) would require 20·2\0
real multiplications, as opposed to about 10·2\0 real multiplications when the
FFT is computed by the approach corresponding to (4.69).
(4.71)
and, since W N / 4 =- j,
(4.72)
(4.73)
3 N/4-1
" J'1 W'k ~
X- k+3N/4 = ~ " X4m+/ W 4mk
1=0 m=O
(4.76)
(4.77)
(4.78)
With regard to multiplications,we note that the twiddle factors in the successive
stages are given by W/k, W41k, W 161k , .". Thus, the twiddle factors take the values
Wlk4' for i = 0, 1, .... Each stage i splits the computation of 41 OFTs of length
Nj4I into that of 41+1 OPTs of length N/41+1, with k = 0, .. " N/41+1 - 1. Since
4.2 The Fast Fourier Transform Algorithm 93
the total number of multiplications by twiddle factors per stage is N, each stage
divides into 41 groups of twiddle factors Wlk4', with k = 0, ... , N/4i+! - 1 and
1= 0, ... ,3. For the last stage, the twiddle factors correspond to W~ and are
computed by trivial multiplications by ± 1, ±j. For the other stages and I = 1,
°
3, the only simple multiplications correspond to k = and k = (N/2)4i+'. These
cases correspond, respectively, to a multiplication by 1 and a multiplication by
Wi = [(1 - j)/ ,J-Z]l. For 1= 0, we have W 1k = 1. For 1= 2, the multiplications
by Wlk4' are implemented with 2 trivial multiplications, 2 multiplications by an
odd power of W s , and (N/4i+!) - 4 complex multiplications. Since we have 41
groups per stage, this corresponds to (3N/4) - 8.41 complex multiplications
and 41+' multiplications by odd powers of Ws per stage. Moreover, N = 2' =
4'/2. We must therefore sum these numbers of mUltiplications over (t/2) - 1
stages. Thus, the number M, of nontrivial complex multiplications is given by
M2 = (N - 4)/3. (4.82)
M = (3N/2)log2N - 5N + 8 (4.83)
Table 4.1. Number of nontrivial real operations for radix-2 and radix-4 FFTs where the
complex mUltiplications are implemented with 3 real multiplications and 3 real additions and
where the symmetries of the trigonometric functions are fully used
4 0 16 0 16
16 24 152 20 148
64 264 1032 208 976
256 1800 5896 1392 5488
1024 10248 30728 7856 28336
4096 53256 151560 40624 138928
a one-stage radix-2 FFT. When properly designed, such mixed radix methods
can be optimum from the standpoint of the number of arithmetic operations,
but the additional computational savings are achieved at the expense of a some-
what more complex implementation.
(4.87)
(4.88)
where xl-I and xl are, respectively, the input and output data samples corre-
sponding to the ith stage. Since the input and output samples in (4.87, 88) have
the same indices, the computation may be executed in place, by writing the out-
put results over the input data. Thus, the FFT may be implemented with only N
complex storage locations, plus auxiliary storage registers to support the butter-
fly computation.
The complex value of Wd as a function of index I and stage i can be deter-
mined by using a bit-reversal method. This is done by writing I as a t-bit binary
number, scaling this number by t-i bits to the right, and reversing the order of the
bits. Thus, if we consider, for instance, the node x~ corresponding to the second
stage of the 8-point OFT illustrated in Fig. 4.1, we have I = 5 and i = 2. In this
4.2 The Fast Fourier Transform Algorithm 95
°and1 0.weFinally ° °°
case, I is 1 Olin binary notation. Scaling by t - i = 1 bit to the right yields
d is obtained by reversing 1 0, which gives also 1 or integer 2,
x~ = x~ + W 2 xi.
have
The bit-reversal process can also be implemented very simply by counting
in bit-reversed notation. For an 8-point DFT, a conventional 3-bit counter
yields the successive integers 0, 1,2,3,4,5,6,7. If the counter bit positions are
reversed, we have 0, 4, 2,6, 1,5,3,7, which gives the one-to-one correspondence
between the natural order sequence and the bit-reversed order sequence.
The coefficients Wd may also be computed, in each stage, via recursion
formula with
(4.89)
These coefficients may be precomputed and stored for each stage in order to
save computation time at the expense of increased memory.
The algorithm illustrated in Fig. 4.1 produces the DFT output samples Xk
in bit-reversed order. Thus, these samples must usually be reordered at the end
of the computation by performing a bit-reversal operation on the indices k. We
shall see in Sect. 4.6, however, that this operation is unnecessary when the DFTs
are used to compute convolutions.
The foregoing considerations also apply generally to algorithms using
radices greater than 2, and a Fortran program for these FFT forms can be found
in [4.8]. In practice, there are many variations of the basic FFT algorithm which
correspond to different trade-offs between speed of execution and memory re-
quirements. It is possible, for instance, to devise schemes with identical geometry
from stage to stage or with input data and output data in natural order. When
the FFT is programmed in a high-level language with sophisticated functions for
the manipulation of arrays such as APL [4.9], the implementation can be
strikingly simple. This is well illustrated in the radix-2 FFT program designed
by McAuliffe and reprinted in Fig. 4.3 with the kind permission of the author.
V Z+TF33 A,K,M,W,O,P,Q,R,S,V,N
[1] W+ 2 1 o.OO(.~(2.P)p(PpV+O).-O-V[-O-2xtP1HP.OpO+t1.0pS+2.N.OpR+
(M+1)p2.0pZ+A[,V+.(~tM)~«K+M+1+2.P+O.5xN)p2)ptN+-1tpA]
[2] +(O<K+K-1)/2.0pW+W[,.~(2.P)pIP]+O.OpZ+Sp(-/[O] WxZ).+/[O] WxeZ+S
p«O+K).«-K)~O.Mp1)/IM+1)~Rp( .+/[K+O] Z) •• -/[K+O] Z+RpZ
v
This APL program uses just two instructions, the first one for generating
coefficient values and the second for performing the actual data computation.
In this program, the N-point DFT is computed by executing
Z-TF33 A, (4.90)
96 4. The Fast Fourier Transform
where TF33 is the name of the FFT subroutine and A is an array of2lines and N
columns, the first line representing the real part of the input sequence and the
second line representing the imaginary part of the input sequence. The input
sequence is given by the array Z which has the same structure as the input data
array A. The reader is cautioned, however, to note that this program actually
computes an inverse OFT rather than the direct OFT as defined by (4.1).
We give the execution times for various OFT lengths computed with this
program on an IBM 370/168 computer operating on APL under VM370 in
Table 4.2. These figures can be compared with the execution times for direct
computation of the same OFTs in APL with the same system. It can be seen
that the reduction in arithmetic load made possible by the FFT algorithm does
translate into a comparable reduction in execution time. This is quite apparent
for large OFTs and, for instance, a 1024-point OFT is calculated in only 791 ms
vis the FFT program, as opposed to 165335 ms for direct computation.
Table 4.2. Comparative execution times in milliseconds for DFTs computed by the FFT prog-
ram of Fig. 4.3 and by direct computation. IBM 370j168-APL VM370
4 17 16
8 24 32
16 32 80
32 43 234
64 63 776
128 109 2840
256 188 10765
512 368 41708
1024 791 165335
Since the FFT is implemented with finite precision arithmetic, the results of the
computation are affected by the roundoff noise incurred in the butterfly calcula-
tions, the scaling of the data, and the approximate representation of the coeffi-
cients Wd. These effects have been studied for fixed point and floating point com-
putations [4.10-12]. We shall restrict our discussion here solely to fixed point
radix-2 FFT algorithms.
Consider first the impact of scaling. At each stage, we must compute the
butterflies,
(4.91)
4.2 The Fast Fourier Transform Algorithm 97
Thus, the magnitude of the signal samples tends to increase at each stage, the
upper bounds on the modulus of xi being given by
Hence the signal magnitude increases by a maximum of one bit at each stage and
a scaling procedure is needed to avoid overflow. An especially efficient scaling
procedure would be to compute each stage without scaling, then to scale the
entire sequence by one bit, only if an overflow is detected. Alternatively, a
simpler, but less efficient method based upon systematic scaling by one bit at
each stage can also be employed. In this case, the implementation is simple, but
sUboptimum. Nevertheless, an evaluation of the quantization effects using this
simple scheme provides an upper bound on quantization noise. Thus, in the
following analytical development, we shall assume that the data is scaled by one
bit at each stage.
It is well known [4.1] that if the product of two B-bit numbers is rounded to
B bits, the error variance is given by
(4.94)
Moreover, when two B-bit numbers are added together, the sum may be a
(B + I)-bit number. Thus, when there is an overflow, the sum must be scaled by
1/2 and one bit is lost. The variance of the corresponding error is
(4.95)
We shall now assume that errors are uncorrelated and that an overflow occurs
at each stage. Since the data input at the first stage of the transform is scaled by
1/2, the variance V(x,,) of x" is given by
(4.96)
(4.97)
where the factor of 4 accounts for the fact that the error caused by scaling at the
first stage is twice the error at the zeroth stage. Similarly, the second stage im-
plements multiplications by only ± 1, ±j and we have
(4.98)
98 4. The Fast Fourier Transform
(4.99)
In the third stage, half the butterfly operations are nontrivial. For these, we have
(4.100)
which yields
where the bars over the symbols represent here an average over the sequence.
Thus,
(4.102)
where the first term in (4.102) is the variance of the first term in (4.100) and
the two next terms in (4.102) correspond to the complex multiplication. The
terms 4 3 (J'2 derive from the rounding after addition and 4 3 6 (J'2 corresponds to
rescaling. We now define Aas the average squared modulus of the input sequence
(4.103)
(4.104)
However, since the second and third terms in (4.104) appear only when the
multiplications are nontrivial and, since half the multiplications in the third stage
are trivial, Vex!) reduces to
(4.105)
(4.106)
and, assuming a similar computation procedure for all other stages, we have, for
the last stage,
(4.107)
Since the mean square of the absolute values of the output sequence X k (we
delete here our usual bar sign on transforms in order to avoid confusion with
averaging) is 2',1, the ratio ofrms noise output to rms signal output is, for large
DFTs,
4.3 The Rader-Brenner FFf 99
which demonstrates that the error-to-signal ratio of the FFT process increases
as ,./ N or 1/2 bits per stage.
Another source of error is due to the use of truncated coefficients. Wein-
stein [4.12] has shown, by a simplified statistical analysis, that this effect tran-
slates into an error-to-signal ratio which increases very slowly with N. Experi-
mental results have tended to confirm this analysis result.
via a decimation in frequency radix-2 FFT form, which for k even, yields
Thus, the first stage of the decimation in frequency FFT decomposition replaces
one OFT of length N by two OFTs oflength N/2 at the cost of N complex addi-
tions and N/2 complex multiplications. In order to simplify the calculation of
the OFT X2k +" we define the (N/2)-point auxiliary sequence am by
(4.114)
or
(4.115)
I~2k+1
X2k + 1
=
=
~k
Ak
+ ~k+l + Va
+ + VI
Ak+l
for k even
for k odd (4.116)
with
(4.117)
(4.118)
Under these conditions, the N/2 complex multiplications by the twiddle factors
wm in the first stage are replaced with (N/2) - 2 multiplications by the pure real
numbers 1/[2 cos (2rcm/N)]. Note here that the contributions of Xa - XN/2 and
X N / 4 - x 3N/4must be treated separately, because cos (2rcm/N) = 0 for m = N/4.
The same method is used recursively to compute the (N/2)-point transforms
X2k and Ak , and then the transforms of dimensions N/4, N/8 ... until complete
decomposition is achieved.
Since the mUltiplication of a complex number by a scalar value is imple-
mented with two real multiplications, each stage is computed with N-4 nontrivial
real multiplications. We need also N complex additions for evaluating Xm +
X m +N/2 andxm -xm+N/2 plus N + 2 complex additions for calculating (4.116-118).
However, two complex additions are saved in the computation of lh because
aa = 0 and aN/4 = O. Thus, for each stage, the number of real mUltiplications
M and real additions A become
M=N-4 (4.119)
A=4N. (4.120)
Table 4.3. Number of nontrivial real operations tor complex DFTs computed by the Rader-
Brenner method
8 4 52 0.50 6.50
16 20 148 1.25 9.25
32 68 424 2.12 13.25
64 196 1104 3.06 17.25
128 516 2720 4.03 21.25
256 1284 6464 5.02 25.25
512 3076 14976 6.01 29.25
1024 7172 34048 7.00 33.25
2048 16388 76288 8.00 37.25
Table 4.4. Number of nontrivial real operations for complex reduced DFTs computed by
the Rader-Brenner algorithm
8 16 64
16 48 212
32 128 552
64 320 1360
128 768 3232
256 1792 7488
512 4096 17024
1024 9216 38144
102 4. The Fast Fourier Transform
_ NIZ-I
Yk = L; YmwmW2m\ k = 0, ... , NI2 - 1
m=O
(4.121)
Such a modified DFT, which occurs naturally in the first stage of a decimation
in frequency FFT algorithm, is often called a reduced DFT [4.16] or an odd
DFT [4.17] and is used, for instance, in the computation of multidimensional
DFTs by polynomial transforms (Chap. 7). The Rader-Brenner algorithm
applies directly to the calculation of such reduced DFTs and we give, in Table
4.4, the number of nontrivial real operations for reduced DFTs computed via
this method.
(4.123)
(4.124)
(4.125)
This approach is often called the row-column method because it can be viewed
as equivalent to organizing the input data into sets of row and column vectors
in an array of size Nl X N z and computing, in sequence, first the DFTs of the
columns and then the DFTs of the rows. With this technique, the two-dimen-
sional DFT is mapped, respectively, into NI DFTs of N z terms plus N2 DFTs
of Nl terms. If Nl and N z are powers of two, the one-dimensional DFTs of
4.4 Multidimensional FFTs 103
(4.126)
(4.127)
(4.128)
and, in particular,
when the DFTs of N terms are evaluated with a simple radix-2 FFT-type algo-
rithm. We shall see in Chap. 7 that the multidimensional to one-dimensional
DFT mapping obtained with the row-column method is sUboptimal and that
better methods can be devised by using polynomial transforms. In order to
support a quantitative comparison of the computational complexities for the
two methods, we present in Table 4.5 the number of real operations for various
complex two-dimensional DFTs calculated by the row-column method and the
Rader-Brenner algorithm.
Table 4.5. Number of nontrivial real operations for complex DFTs of size N x N computed
by the row-column method and the Rader-Brenner algorithm
We shall now discuss an algorithm introduced by Bruun [4.18] which has both
theoretical and practical significance. The practical value of this algorithm
relates to the fact that the DFT of real data can be computed almost entirely
with real arithmetic, thereby simplifying the implementation of DFTs for real
data. We shall present here a modified version of the original algorithm which
will allow us to introduce a polynomial definition of the DFTs that will be used
in later parts of this book.
We consider again an N-point DFT, with N = 2'
x: - "x
N-l
k - """-.I m wmk , k = 0, .'" N - 1
m=O
j = ,J-I. (4.130)
Equations (4.131) and (4.132) are equivalent to (4.130) because the definition of
(4.132) modulo (z - Wk) means that we can replace z by Wk in (4.131). At this
point, the definition of (4.131) modulo (ZN - 1) is unnecessary. However, this
definition is valid because ZN == WkN = 1. We note that the N roots of ZN - 1
are given by W k for k = 0, ". , N - I, with
N-l
ZN - 1= II (z - Wk). (4.133)
k=O
and
(4.135)
Hence, for k even, all the values of W k correspond to the polynomial ZNIZ - 1
and we can replace (4.131) and (4.132) with
Ni2-1
X\(z) = L
m=O
(xm + Xm+NIZ)Zm == X(z) modulo (ZNIZ - 1) (4.137)
k even. (4.138)
Similarly, for k odd, all the values of W k correspond to the polynomial ZNIZ +1
and Xk(z) is computed by
+ 1)
N/2-1
Xz(z) = L
m=O
(xm - Xm+NIZ)zm == X(z) modulo (ZNIZ (4.139)
k odd. (4.140)
The form (4.137-140) can be easily recognized as equivalent to the first stage of a
decimation in frequency FFT decomposition, since (4.137,138) represent a OFT
of NI2 terms while (4.139,140) represent an odd OFT of NI2 terms. At this stage,
we depart from the conventional FFT decomposition by noting that any poly-
nomial of the form z4q + az Zq + 1 factors into two real polynomials,
with
where B\ is the set of NI4 values of k\ such that W2k,+1 is a root of ZNI4
+ -/2" ZNI8 + 1 and Bz is the set of the NI4 other values of k\. Under these
conditions, the odd OFT represented by (4.139,140) can be replaced by
(4.146)
and
(4.148)
plex additions. In the second stage, the reductions modulo (ZNI4 - 1) and
modulo (ZNI4 + 1) are computed with NI2 complex additions. For the reduc-
tions modulo (ZNI4 + -/2 ZNI8 + 1), we have ZNI4 == - -/2 ZNI8 - 1 and
Z3NI8 = ZNI8 + ,J2,andforthereductionsmodulo(zNI4 - -/2 ZNI8 + 1),zNI4
== -/2 ZNI8 - 1 and Z3NI8 == ZNI8 - -/2. Since -/2 is real, the complex
multiplications are implemented with two real multiplications and the two
reductions are implemented with NI2 real multiplications and 3Nj2 additions.
The second stage corresponds to a = 0 in (4.141). In the following stages, a
takes successively the values ±-/2, ± -/2 ±-/ 2 ... and the reductions pro-
ceed similarly, with mUltiplications by the real factors a and -/2 - a, the only
difference with the second stage being that the multiplications by a are no lon-
ger trivial.
In the last stage, we have two reductions with trivial multiplications by ± 1,
±j and two reductions with multiplications by powers of WNl8 which require 2
real multiplications for each reduction. The Nj2 - 4 other reductions correspond
to multiplications by Wk, W-k which are implemented with 4 real multiplications
and 8 real additions. Hence, with the exception of the last stage, all multiplica-
tions are done with real factors. In practice, the original algorithm proposed by
Bruun uses aperiodic convolutions instead of reductions, and this original ap-
proach requires slightly fewer arithmetic operations than the method described
here.
The principal use of the Bruun algorithm is in the calculation of the DFT for
real data sequences. In this case, since the coefficients in the t - 1 first stages are
real, these stages are implemented in real arithmetic. Moreover, since the reduc-
tions in the last stage correspond to multiplications of a real data sample by the
complex conjugate coefficients Wk and W-k, the operations in the last stage can
also be viewed as implemented in real arithmetic, with 2 real multiplications and
1 real addition for each reduction modulo (z - Wk) and modulo (z - W-k).
Thus, the Bruun algorithm provides a convenient way of computing the DFT of
a real data vector using only real arithmetic.
It should also be noted that the Bruun algorithm is closely related to the
Rader-Brenner algorithm, since
Hence the various coefficients used in the Bruun algorithm are identical to cor-
responding coefficients in the Rader-Brenner algorithm, and the main difference
relates to multiplications by Wk and W-k in the last stage.
We have seen in Sect. 4.1 that the DFT has the convolution property. This
means that the circular convolution y, of two sequences hn and Xm can be com-
108 4. The Fast Fourier Transform
Since a OFT can be computed by the FFT algorithm, this method requires a
number of operations proportional to N log N and, therefore, requires consider-
ably less computation than the direct method. More precisely, if the OFTs are
calculated via simple radix-2 algorithm with one of the input sequences fixed,
the circular convolution of length N, with N = 2', requires the computation of
two FFTs and N complex multiplications. Consequently, the number of com-
plex multiplications M required to evaluate the convolution is
Table 4.6. Number of real operations for real circular convolutions computed by the Rader-
Brenner algorithm (2 real convolutions per DFT; one input sequence fixed)
8 16 64 2.00 8.00
16 44 172 2.75 10.75
32 116 472 3.62 14.75
64 292 1200 4.56 18.75
128 708 2912 5.53 22.75
256 1668 6848 6.52 26.75
512 3844 15744 7.51 30.75
1024 8708 35584 8.50 34.75
2048 19460 79360 9.50 38.75
Table 4.7. Number of real operations for real circular convolutions of size N x N computed
by the Rader-Brenner algorithm. (2 real convolutions per DFT; one input sequence fixed)
convolutions simultaneously with separate hardware for the DFT and the inverse
DFT. This can be accommodated using an approach, proposed by McAuliffe
[4.19], which is based on the computation of the DFTs of two real sequences in
a single complex DFT step.
We have already seen in Sect. 4.1 that the DFTs Xk and Xlc of two real
N-point sequences Xm and x~ can be evaluated as a single complex DFT by
computing the D FT Yk of the auxiliary sequence Xm + jx~. The sequences Xk and
Xl are then deduced from Yk by
Xk = (Yk + Y!k)/2 (4.153)
where Y!k is the complex conjugate of Y-k' Following this procedure, the con-
volution YI of the two real sequences Xm and hn is computed as shown in Fig. 4.5.
110 4. The Fast Fourier Transform
MA GINAR Yij------,
INPUT
FFT
REAL INPUT
y/
1 N-l
xl =- ~
N-l
~ hnxm[W(m+n)k + W-(m+n)k
2N .=0 m=O
(4.155)
The sequence xl is then used as the imaginary input to the FFT Yk , and the
transform Xl of xl is obtained by (4.154). Hence
1
Xl = -
N-l N-l
~
N-l
2:: hnxm 2:: [w(m+.+l>k + w-(m+n-l>k
2N .=0 m=O k=O
(4.156)
The terms in the summation over Ware different from zero only for m +n
+ I == 0 modulo Nand m + n - I == 0 modulo N. Thus, we have
N-l
Xl = ~ (hnx-n- r + hnxr-. - jh.x-n-r + jh.xr-n)/2, (4.157)
n=O
where a summation of the real and imaginary parts of Xl obviously yields the
convolution Yr. Clearly, one must account for the FFT computation delay in the
process and, in practice, the imaginary input to the FFT hardware usually cor-
responds to the block X m - N , while the real input corresponds to the block X m •
Hence real convolutions of dimension N can be computed with a single N-point
FFT hardware structure.
It should also be noted that some simplifications of the FFT process are
possible when used to compute convolutions. In particular, when a OFT is
computed by an FFT algorithm, since either the input sequence or the output
4.6 FFT Computation of Convolutions 111
(5.1)
(5.2)
5.1 The Chirp z-Transform Algorithm 113
which shows that Xk may be computed by convolving the sequence Xn W"/2 with
the sequence W-n'!2, and postmultiplying by Wk'/2 as indicated on Fig. 5.1.
With this method, the DFT is computed by N complex premultiplications, N
complex postmultiplications, and one complex finite impulse response (FIR)
filter. The impulse response of the FIR filter is that of a chirp filter, well known in
radar signal processing; hence the name chirp z-transJorm given to this DFT
computation technique [5.1-3, 5].
CONVOLUTION
Xn ---+\ X }---I
x n*w- n2j 2
wn 2 / 2 ~/2
5.1.1 Real Time Computation of Convolutions and DFfs Using the Chirp .t-
Transform
N-I
YI = L:
.. -0
x"h l -" • (5.4)
This is done via the chirp z-transform Xk of XII by (5.3) and the chirp z-trans-
form i ik of hm by
L:
N-I
i ik = Wk'/2 hm wm'/2 W-(k- ml'/2. (5.5)
m=O
L:
N-I
YI = (ljN) W-1'12 Yk W-k l /2 W(I-k)1/2. (5.6)
k=O
When a DFT is evaluated by the chirp z-transform technique, most of the com-
putation occurs in the chirp filtering process. The z-transform, H(z) of the
impulse response of the chirp filter is given by
5.1 The Chirp z-Transform Algorithm 115
2N-t
H(z) = L W-n'/2 z-n. (5.7)
n=O
We assume now that N is a perfect square, with N = Nr, and we change the
index n with
n2 = 0, ... , 2N t - 1
nt = 0, ... , N1 - 1. (5.8)
Hence,
N,-t 2N,-1
H{z) = L W- nl /2 z-n, L W-N,n,n,{ -l)n, z-N,n" (5.9)
n 2 =0 "2:=0
N,-t
H{z) = L W- nl12 z-n,[(l - Z-2N)/{l + W-N,n, z-N,)]. (5.10)
"1=0
We return now to the transversal filter form of the chirp z-transform shown in
Fig. 5.1. The tap coefficients of this filter are given by W-n'/2 for n = 0, ... , N -
1. These Ntap values cannot be all distinct because -n 2 /2 is defined modulo N,
and the congruence -n 2 /2 == a modulo N has no solution for certain values of a.
Thus, the chirp filter can be implemented with less than N distinct multipliers
by adding, prior to multiplication, the data samples which correspond to the
same tap value. We note that the number of distinct taps is given by the number
of distinct quadratic residues modulo N [5.2, 7]. It is therefore possible to use the
results of Sect. 2.1.4 to find the number of distinct multipliers required to im-
plement a given chirp filter.
Consider first the case corresponding to N, an odd prime, with N = p. Then,
we know by theorem 2.9 that the number of distinct quadratic residues is given
by
M = (p - 1)/2. (5.12)
For N composite, the two following theorems can be used to find the number of
distinct quadratic residues:
Theorem 5.1: If N is composite, with N = N\N2 ... Nk and NI = p?, with the
PI being distinct primes, the number Q(N) of quadratic residues modulo N is
given by
(5.13)
(5.15)
We have seen that any OFT can be converted into a convolution by the chirp
z-transform algorithm at the cost of 2N complex multiplications performed on the
input and output data samples. We shall see now that OFTs can also be con-
verted into circular convolutions by an entirely different method initially intro-
duced by Rader [5.3]. This method is, in some cases, computationally more
efficient than the chirp z-transform algorithm because it replaces the premulti-
5.2 Rader's Algorithm II?
k = 0, ... ,p-l
(5.16)
(5.17)
(5.18)
The indices nand k are defined modulo p. We have seen in Sect. 2.1.3 that, if u is
the set of integers 0, 1, ... , p - 2, there are always primitive roots g defined
modulo p such that g" modulo p takes once and only once all the values
1, 2, ... ,p - 1 when u takes successively the values 0, 1, ... ,p - 2. Thus, for
n, k =1= 0, we can replace nand k by u and v defined by
n == g" modulo p
k == g. modulo p , u, v = 0, ... , p - 2. (5.19)
(5.20)
CONVOLUTION
Xn - - ' - ' " PERMUTATION PERMUTATION
" '" 0
Let us now consider DFTs of size N = pC, where p is an odd prime. We have
seen in Sect. 2.1.3 that primitive roots g modulo pC always exist and that these
primitive roots are of order pc-l (p - 1). Thus, we can expect to convert a DFT
of dimension pC into a circular convolution of length pC-l(p - 1) plus some
additional terms. To demonstrate this point, we first define a change of index
kl = 0, ... ,pc-l - 1
k2 = 0, ... , p - 1. (5.21)
(5.22)
nl = 0, ... ,p - 1
n2 = 0, ... ,pc-l - 1. (5.23)
(5.24)
k $. 0 modulo p (5.26)
k $. 0 modulo p (5.27)
for n == 0 modulo p,
and
(5.30)
n == g" modulo pc
k == gV modulo pC ,u, v = 0, ... , [pc-l(p - 1) - 1] (5.31)
and, by substituting the indices defined by (5.31) into (5.27), we obtain the cor-
relation of dimension pC-l(p - 1)
(5.32)
Thus, the DFTofsizepc has been partitioned into two DFTs of size pc-l and one
correlation of length pC-l(p - 1). The same method can be used recurisvely to
convert the DFTs of size pc-l into correlations. With this approach, a 9-point
DFT is evaluated with a 3-point DFT and a 6-point convolution, plus a 3-point
DFT where the first output term is not computed. When the 3-point DFTs are
also reduced to correlations, the 9-point DFT is computed with I multiplication
by Wo, 2 convolutions of 2 terms and one convolution of 6 terms.
When N is a power of two, the N-point DFT is partitioned into DFTs of size
NI2 by the same method, and the DFT terms corresponding to nand k odd are
computed as a correlation. However, there are no primitive roots for N> 4.
120 5. Linear Filtering Computation of Discrete Fourier Transforms
Thus, for N> 4, one uses a product of roots (_I) nI 3n" with n, = 0, 1 and
nz = 0, ... , (N/4 - 1). These roots generate a two-dimensional correlation of
size 2 X (Nj4).
Reducing a DFT into a set of convolutions may become very complex when N
is composite. We shall now introduce a polynomial representation of the DFT
[5.10] which greatly simplifies the formulation of Rader's algorithm. We begin
once again with the N-point DFT
N-I
X- k = k= 0, ... ,N-I
.-0 Xn W·,
~
'" k
j=,v'-I (5.33)
N-I
X(z) == L x.zn modulo (zH -
11=0
1). (5.34)
Note that (5.34) does not need to be defined modulo (zH - 1). This represen-
tation is therefore superfluous at this stage. However, it is valid, since n is defined
modulo N. Equation (5.35) implies that Xk is obtained by substituting Wk for z in
X(z). A simple inspection shows that (5.34, 35) are a valid alternate represen-
tation of (5.33).
We suppose now that N is an odd prime, with N = p. Since the only divisors
of pare 1 and P. zl> - 1 factors into two cyclotomic polynomials, with
Note also that the roots of zl> - 1 are given by z = Wk for k = 0, ... , p - 1.
5.2 Rader's Algorithm 121
(5.40)
,k =1= O. (5.41)
(5.42)
with
Thus, for k =1= 0, Xk is a reduced DFT of p terms in which the last input sample
is zero and the first output sample is not computed
k =1= O. (5.44)
The final result will not be changed if X(z) is multiplied by Zp-l modulo (zp - 1)
and X 2(z) is multiplied by z modulo P(z). In this case, X 2(z) becomes
(5.45)
with
and Xk reduces to
p-t
Xk == zX2 (z) modulo(z - Wk) = L: bn- t
n=1
W"k. (5.47)
zP-l
Fig. 5.3. Polynomial representation of Rader's algorithm for a p-point DFT, p odd priqIe
l
+
Rl:.IJUCTION
MODULO Z6+Z3+ 1
•
REDUCTION
MODULO (Z3_ 1)
•
REDUCED DFT
(6-POINT
CONVOLUTION
------------
3-POINT DFT
+
~i ------------
2_P01NT
CONVOLUTION
Xk
k *0 MODULO 3
We have seen in Chap. 3 that short convolutions can be computed very efficiently
by interpolation techniques. Thus, Rader's algorithm yields efficient implementa-
tions for small DFTs. In practice, we shall not have to use Rader's algorithm for
large DFTs because there are several other methods, to be discussed in the
following sections, which allow one to construct a large DFT from a limited set
of small DFTs. Thus, we shall be concerned here only with the efficient imple-
mentation of Rader's algorithm for small DFTs.
In practice, the convolutions derived from Rader's method are computed by
using the same techniques as those described in Chap. 3. However, some ad-
ditional simplification is possible because here the sequence of coefficients, we',
has special properties.
Consider first the case of a p-point DFT, where p is an odd prime. Then, the
convolution is of length d = p - 1, with d even. Since gft modulo p generates a
cyclic group of order p - 1, we have gP-1 == 1 modulo p. Therefore, we have
g(p-Il 12 == -1 modulo p and
(5.48)
124 5. Linear Filtering Computation of Discrete Fourier Transforms
Moreover, since
(5.49)
we have
(5.50)
which suggests an even symmetry about midpoint for real coefficients and an
odd symmetry for imaginary coefficients. Thus, when the coefficient polynomial
4-1
L: W"z" is reduced modulo(zd /2 - 1) and modulo(z4/2 + 1), all the coefficients
,,-0
in the reduced polynomials become pure real numbers and pure imaginary
numbers respectively. This means that all complex multiplications reduce to the
multiplication of a complex number by either a pure real of a pure imaginary
number and are therefore implemented with only two real multiplications. This
feature is common to all convolutions derived by partitioning a OFT via Rader's
algorithm.
When N = pC, where p is an odd prime, some additional simplification is
possible [5.10]. We give in Sect. 5.5 the most frequently used small OFT algo-
rithms and, in Table 5.1 the correspondiQ.g number of complex arithmetic opera-
tions.
Table 5.1. Number of complex operations for short DFTs computed by Rader's algorithm.
Trivial multiplications by ±1, ±j are given between parentheses. The number of real opera-
tions is twice the number of operations given in this table
2 2 (2) 2
3 3 (1) 6
4 4 (4) 8
5 6 (1) 17
7 9 (1) 36
8 8 (6) 26
9 11 (1) 44
16 18 (8) 74
p-z
an odd prime, this is done by using the property 2: W =
go -1 and this gives an
n"O
algorithm with a scaling factor equal to p - 1 and with two trivial multiplica-
tions by ± 1 instead of one. The corresponding number of operations are given
in Sect. 5.5.
For large DFTs, the derivation of Rader's algorithm becomes cumbersome and
computationally inefficient. In this section, we shall discuss an alternative com-
putation technique which allows one to compute a large DFT of size N by
combining several small DFTs of sizes Nlo N z, ... , N. which are relative prime
factors of N. This technique, which is known today as the prime factor FFT, was
proposed by Good [5.11, 12] prior to the introduction of the FFT and has both
theoretical and practical significance. Its main theoretical contribution is in
showing how a one-dimensional DFT can be mapped by simple permutations
into a multidimensional DFT. This approach has also been shown recently
[5.13, 14] to be of practical interest when it is combined with Rader's algorithm.
Furthermore, Good's algorithm provides one of the foundations on which the
very efficient Winograd Fourier transform algorithm is based [5.4].
We first consider the simple case of a DFT %k of size N, where N is the product
of two mutually prime factors Nl and N2
N-l
Xk =" ~
n-O
Xn W nk , k = 0, ... , N - 1
(5.51)
(5.52)
Note that this definition is valid only for (N), N z) = 1. Now, since NINz ==
modulo N, substituting nand k defined by (5.53) into (5.51) yields
°
(5.54)
with
(5.55)
(5.56)
Then, Xk reduces to
(5.57)
(5.58)
d
k == L: NtlkllNI modulo N,
1=1
kl = 0, ... , NI - I, (5.60)
where t I is given by
(5.61)
5.3 The Prime Factor FFT 127
It can be verified easily that, in the product nk modulo N, with nand k defined
by (5.59) and (5.60), all cross-products n/ku for i =1= u cancel, so that
d
nk == L. Nn/kdN/ modulo N,
/~t
(5.62)
k t = 0, ... , NJ - 1
k2 = 0, ... , N2 - 1. (5.63)
(5.64)
This illustrates that Xk"k, can be evaluated by first computing one DFT of NJ
terms for each value of n2' This gives NJ sets of N2 points X.,.k l which are the
input sequences to NJ DFTs of N2 points. Thus, with this method, Xkl •k , is
calculated with N2 DFTs of length NJ plus NJ DFTs of length N 2. A detailed
representation of the computation process is shown in Fig. 5.5 for a 12-point
DFT using the 3-point and 4-point DFT algorithms of Sects. 5.5.2 and 5.5.3.
In order to evaluate the number of multiplications, M, and additions, A,
which are necessary to compute a DFT by the prime factor algorithm, we assume
that MJ> M2 and Ai> A2 are the number of multiplications and additions required
the calculate the DFTs of lengths N J and N 2 , respectively. Then, we have
obviously
(5.65)
128 5. Linear Filtering Computation of Discrete Fourier Transforms
'4 -3/2
Fig. 5.5. Flow graph of a 12-point DFT computed by the prime factor algorithm
(5.66)
The same method can be extended recursively to cover the case of more than
two factors. Thus, for ad-dimensional DFT, we have
(5.67)
and
(5.68)
d
A = 1-1
L. NAtfN" (5.69)
5.3 The Prime Factor FFT 129
Table 5.2. Number of nontrivial real operations for DFTs computed by the prime factor and
Rader algorithms
Note that the data given in Table 5.2 apply to multidimensional DFTs as well
as to one-dimensional DFTs. For example, it may be seen that 100 nontrivial
mUltiplications are required to compute a DFT of size 30. The number of
multiplications would be the same for DFTs of sizes 2 X 3 X 5, 6 X 5, 10 X 3,
or 2 X 15, since the only difference between the members of this group is the
index mapping. We shall see, however, in Chap. 7, that it is possible to devise
even more efficient computation techniques for multidimensional DFTs. There-
fore, the main utility of the prime factor algorithm resides in the calculation of
one-dimensional DFTs.
We shall now show that the efficiency of the prime factor algorithm can be
improved by splitting the calculations [5.10]. This can be seen by considering
again a two-dimensional DFT of size N, X N].
130 5. Linear Filtering Computation of Discrete Fourier Transforms
kl = 0, ... , NI - I
k2 = 0, ... , N2 - 1. (5.70)
In order to simplify the discussion, we shall assume that NI and N2 are both odd
primes. In this case, Rader's algorithm reduces each of the DFTs of size NI or N2
to one multiplication plus one correlation of size NI - I or N2 - 1. Therefore,
Xk,.k, is evaluated via the prime factor algorithm as one DFT of NI points, one
correlation of N2 - I points and one correlation of (N2 - I) X (NI - I)
points, with
(5.71)
(5.72)
(5.73)
kl == h modulo Nh
V, nl == ho , modulo NI
k2 == gV, modulo N 2, n2 == gO, modulo N2
Uh VI = 0, ... , NI - 2
U2, V2 = 0, ... , N2 - 2. (5.74)
(5.75)
Since the conventional prime factor algorithm would have required NIM2 +
N2
MI multiplications, splitting the computation eliminates NI + N2 - I complex
multiplications. When the two-dimensional convolution is reduced modulo
cyclotomic polynomials, the various terms remain half separable and additional
5.3 The Prime Factor FFT 131
savings can be realized. This can be seen more precisely by representing the
DFT Xk,.k , defined by (5.70) in polynomial notation, and employing an ap-
proach similar to that described in Sect. 5.2.2
N,-I N,-I
X(Zh Z2) == ~ ~ X""'" zi' Z~l modulo (zf' - 1), (Zfl - 1) (5.76)
11 1 -0 "2- 0
(5.77)
Ignoring the permutations and the multiplications by Zh Zf,-I, Z2, and Zr, - I, we
can use this polynomial formulation to represent the split prime factor algorithm
very simply, as indicated by the diagram in Fig. 5.6, which corresponds to a
DFT of size 5 X 7. With this method, the main part of the computation corre-
sponds to the evaluation of a correlation of dimension 4 X 6 which can be
regarded as a polynomial product modulo (zt - 1), (z~ - 1). Since both zt - 1
and zg - 1 are composite, the computation of the correlation of dimension
4 X 6 can be split into that of the cyclotomic polynomials which are factors of
zt - 1 and z~ - 1 and given by
7-POINT DFT
REDUCTION MODULO REDUCTION MODULO
7
(Z2- 1)/(Zr/) (Zr l )
~
::!1
[
::l
QQ
I
g"
g,
~
!61
POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL 5.
~
PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT
2
(~+l) , (~-Z2+}) (~-Z2+l) (~+}) Z2+Z2+} ~+Z2+}
(~-Z2+l) (~+Z2+})
l
0-
3on
Fig. 5.7. Calculation of the correlation of 4 x 6 points in the complete split prime factor evaluation of a
DFf of size 5 x 7
5.4 The Winograd Fourier Transform Algorithm (WFTA) 133
The complete method is illustrated in Fig. 5.7 with the various reductions
modulo cyclotomic polynomials. Since all the expressions remain half separable
throughout the decomposition, the two-dimensional polynomial products are
computed by the row-column method. Thus, for instance, the two-dimensional
polynomial product modulo (zr + 1), (zi + Zz + 1) is calculated as 2 polynomi-
al products modulo (z~ + Zz + 1) plus 2 polynomial products modulo (zr + 1).
With split-prime factorization, a DFT of size 5 X 7 is evaluated with 76
complex multiplications and 381 additions if the correlation of size 4 X 6 is
computed directly by the row-column method. If the computation of the cor-
relation of size 4 X 6 is reduced to that of polynomial products, as shown in
Fig. 5.7, this correlation is calculated with only 46 multiplications and 150 ad-
ditions instead of 62 multiplications and 226 additions with the row-column
method. Thus, the complete split-prime factor computation reduces the total
number of operations to 60 complex multiplications and 305 complex addi-
tions. By comparison, the conventional prime factor algorithm requires 87
complex multiplications and 299 additions. Thus, splitting the computations
saves, in this case, about 30 %of the multiplications.
The same split computation technique can also be applied to sequence
lengths with more than two factors as well as those with composite factors. It
should also be noted that the computational savings provided by the method
increases as a function of the DFT size. Thus, for large DFTs, the split-prime
factor method reduces significantly the number of arithmetic operations at the
expense of requiring a more complex implementation.
We have seen in the preceding section that a composite DFT of size N, where N
is the product of d prime factors NIJ N z, ... , N d , can be mapped, by simple index
permutations, into a multidimensional DFT of size Nl X N z X ... X N d • When
this multidimensional DFT is evaluated by the conventional row-column
method, the algorithm becomes the prime factor algorithm. In the following,
we shall discuss another way of evaluating the multidimensional DFT which is
based on a nesting algorithm introduced by Winograd [5.4, 15, 16]. This method
is particularly effective in reducing the number of multiplications when it is
combined with Rader's algorithm.
kl = 0, ... , NI - 1, k z = 0, ... , N z - 1
(5.80)
(5.81)
Xo.n,
Xl,,.;,
Xn, = [ ........ . (5.82)
.........
XN 1 -l,1I2
(5.83)
Thus, the polynomial DFT defined by (5.81) is a DFT oflength N z where each
multiplication by W1,k, is replaced with a multiplication by W~,k'A. This last
operation itself is equivalent to a DFT oflength NI in which each multiplication
by W7,k, is replaced with a multiplication by W~,k, W!,k,.
It can be seen that the Winograd algorithm breaks the computation of a D FT
of size NINz or NI X N z into the evaluation of small DFTs of length NI and N z
in a manner which is fundamentally different from that corresponding to the
prime factor algorithm. In fact, the method used here is essentially similar to the
nesting method described by Agarwal and Cooley for convolutions (Sect. 3.3.1).
The Winograd method is particularly interesting when the small DFTs are
evaluated via Rader's algorithm. In this case, the small DFTs are calculated with
Al input additions, M complex multiplications, and AZ output additions. Thus,
the Winograd algorithm for a DFT of size NI X N z can be represented as shown
in Fig. 5.8. In this case, if M z, A~ and A~ are the number of complex multiplica-
tions and input, output additions for the DFT of length N z, the total number of
mUltiplications M and additions A for the DFT of size NI x N z becomes
5.4 The Winograd Fourier Transform Algorithm (WFTA) 135
N] POLYNOMIALS OF NI TERMS
INPUT
ADDITIONS
I
A]
NI TERMS
OUTPUT
ADDITIONS
1
A]
(5.84)
(5.85)
(5.86)
The same method can be applied recursively to cover the case of more than two
factors. Hence a multidimensional DFT of size NI X N z ... X Nd or a one-
dimensional DFT of length NIN2 ... N d , where (NI' N k ) = 1 for i "*
k, is com-
puted with M multiplications and A additions, where M and A are given by
(5.87)
(5.88)
where MI and AI are the number of complex multiplications and additions for
an NI-point DFT.
Note that the number of additions depends upon the order in which the
136 5. Linear Filtering Computation of Discrete Fourier Transforms
(5.89)
The first nesting method will require less additions than the second nesting
method if NJA2 + M2AJ < N2AJ + MJA2 or
(5.90)
Thus, the values (MI - N I)/ AI characterize the order in which the various short
algorithms must be nested in order to minimize the number of additions.
It should be noted that, in (5.84, 86-88), M and A are, respectively, the total
number of complex multiplications and additions. However, when the small
DFTs are computed by Rader's algorithms, all complex multiplications reduce
to multiplications of a complex number by a pure real or a pure imaginary
number and are implemented with only two real additions. Moreover, some of
the multiplications in the small DFT algorithms are trivial multiplications by
± 1, ±j. Thus, if the number of such complex trivial mUltiplications is LI for a
NI point DFT, then the number of nontrivial real mUltiplications becomes
(5.91)
We illustrate the Winograd method in Fig. 5.9 by giving the flow diagram cor-
responding to a 12-point DFT using the 3-point and 4-point DFT algorithms
of Sects. 5.5.2 and 5.5.3.
Table 5.3. Number of nontrivial real operations for one-dimensional DFTs computed by the
Winograd Fourier transform algorithm
X4
Xs
X9
XI
x5
x6
X/O
x]
xJ
X7
Xli
-I -I
Fig. 5.9. Flow graph of a 12-point DFT computed by the Winograd Fourier transform aJgori-
them
Table 5.3 lists the number of nontrivial real operations for various DFTs
computed by the Winograd Fourier transform algorithm, with the small DFTs
evaluated by the algorithms given in Sect. 5.5 and calculated with the number of
operations summarized in Table 5.1. It can be seen, by comparison with the
prime factor technique (Table 5.2), that the Winograd Fourier transform algo-
rithm reduces the number of multiplications by about a factor of two for DFTs
of length 840 to 2520, while requiring a slightly larger number of additions. If
we now compare with the conventional FFT method, using, for instance, the
Rader Brenner algorithm (Table 4.3), we see that the Winograd Fourier trans-
form algorithms reduce the number of multiplications by a factor of 2 to 3, with
a number of additions which is only slightly larger. These results show that the
principal contribution of the Winograd Fourier transform algorithm concerns a
reduction in number of multiplications. It should be noted, however, that the
short DFT algorithms can also be redesigned in order to minimize the number of
additions at the expense of a larger number of multiplications. Thus, the
Winograd Fourier transform approach is very flexible and allows one to adjust
138 5. Linear Filtering Computation of Discrete Fourier Transforms
(5.92)
nique, and exactly the same number of additions. For larger OFTs, however, the
prime factor method is better than the Winograd method from the standpoint of
the number of additions. Thus, for large OFTs, it may be advantageous to com-
bine the two methods when the relative cost of multiplications and additions is
about the same [5.13]. For example, a lOO8-point OFT could be computed by
calculating a OFT of size 16 x 63 via the prime factor technique and calculating
the OFTs of 63 terms by the Winograd algorithm. In this case, the OFT would
be computed with 4396 real multiplications and 31852 real additions, as opposed
to 3548 real multiplications and 34668 additions for the Winograd method and
5804 multiplications and 29548 additions for the prime factor technique. Thus,
a combination of the two methods allows one to achieve a better balance be-
tween the number of additions and the number of multiplications.
We have seen in Sect. 5.3.3 that a multidimensional OFT can be converted into
a set of one-dimensional and multidimensional convolutions by a sequence of
reductions if the small OFTs are computed by Rader's algorithm. In particular,
if NJ and N z are odd primes, a OFT of size NJ x N z can be partitioned into a
OFT of length NJ plus one convolution of length N z - 1 and another of size
(NJ - 1) X (Nz - 1). This is shown in Fig. 5.6 for a OFT of size 7 x 5. Thus,
the Winograd algorithm can be regarded as equivalent to converting a OFT
into a set of one-dimensional and multidimensional convolutions, and comput-
ing the multidimensional convolutions by a nesting algorithm (Sect. 3.3). Con-
sequently, it can be inferred from Sect. 3.3.2 that a further reduction in the
number of additions could be obtained by replacing the conventional nesting of
convolutions by a split nesting technique. With such an approach, the short
OFTs are reduced to convolutions by the Rader algorithm discussed in Sect.
5.2 and the convolutions are in turn reduced into polynomial products, defined
modulo cyclotomic polynomials.
In practice, however, this method cannot be directly applied without minor
modifications. Alert readers will notice that the number of additions correspond-
ing to some of the short OFT algorithms in Sect. 5.5 does not tally with the
number of operations derived directly from the reduction into convolutions
discussed in Sect. 5.2. A 7-point OFT, for instance, is computed by the algorithm
described in Sect. 5.5.5 with 9 multiplications and 36 additions. The same OFT,
evaluated by Rader's algorithm according to Fig. 5.3, however, requires 12
additions for the reductions modulo (z - I) and (Z7 - I)/(z - I) and 34 ad-
ditions for the 6-point convolution, a total of 46 additions. This difference is due
to the fact that the reductions can be partly embedded in the calculation of the
convolutions. In the case of a N-point OFT, with N an odd prime, this procedure
reduces the number of operations to one convolution of length N - 1 plus 2
additions instead of one (N - I)-point convolution plus 2(N - 1) additions.
140 5. Linear Filtering Computation of Discrete Fourier Transforms
Thus, direct application of the split nesting algorithm is not attractive because
it reduces the number of additions in algorithms which already have an inflated
number of additions.
In order to overcome this difficulty, one approach consists in expressing the
polynomial products modulo irreducible cyclotomic polynomials of degree
higher than 1 in the optimum short OFT algorithms of Sect. 5.5. This is done
in Sects. 5.5.4, 5, 7, and 8, respectively, for OFTs of lengths 5, 7, 9, and 16.
With this procedure, a 5-point OFT breaks down (Fig. 5.10) into 14 input and
output additions, 3 multiplications, and one polynomial product modulo (Z2 +
1), while the 7-point OFT reduces into 30 input and output additions, 3 multi-
plications, and the polynomial products modulo (Z2 + Z + 1) and modulo
(Z2 - Z + I). Therefore, nesting these two OFTs to compute a 35-point OFT
requires 9 multiplications, 3 polynomial products modulo (z~ + 1),3 polynomial
products modulo (zi + Zz + I) and modulo (zi - Z2 + 1), and the polynomial
products modulo (z~ + 1), (zi + Zz + 1) and modulo (z~ + 1), (zi - Z2 + 1).
This defines an algorithm with 54 complex multiplications and 305 additions,
as opposed to 54 multiplications and 333 additions for conventional nesting.
XII
POLYNOMIAL PRODUCT
MODULO (Z2+l)
3 MULTIPLICATIONS
(3 MULTIPLICATIONS.
3 ADDITIONS)
In Table 5.4, we give the number of real additions for various OFTs com-
puted by split nesting. The nesting and split nesting techniques require the same
number of multiplications, and it can be seen, by comparing Tables 5.3 and 5.4,
5.4 The Winograd Fourier Transform Algorithm (WFTA) 141
Table 5.4. Number of real additions for one-dimensional DFTs computed by the Winograd
Fourier transform algorithm and split nesting
that, for large OFTs, the split nesting method eliminates 10 to 15 % of the
additions required by the conventional nesting approach. Additional reduction
is obtained when the split nesting method is used to compute large multidimen-
sional OFTs. The implementation of the split nesting technique can be greatly
simplified by storing composite split nested OFT algorithms, thus avoiding the
complex data manipulations required by split nesting. With this approach, a
504-point OFT, for instance, can be computed by conventionally nesting an
8-point OFT with a 63-point OFT algorithm that has been optimized by split
nesting.
Until now, we have considered the use of the Winograd nesting method only
for the computation of OFTs of size N 1N 2 ... Nd or Nl X N2 X ... X N d,
where the various factors Nt are mutually prime in pairs. Note, however, that the
condition (Nio N u ) = 1 for i =f.= u is necessary only to convert one-dimensional
OFTs into multidimensional OFTs by Good's algorithm. Thus, the Winograd
nesting algorithm can also be employed to compute any multidimensional OFT
of size Nl X N2 X ... X Nd where the factors Nt are not necessarily mutually
prime in pairs. If each dimension Nt is composite, with Nt = N t.1 N t.2 ... Nt, ..
and (Nt,u, Nt,.) = 1 for u =f.= v, the index change in Good's algorithm maps the
d-dimensional OFT into a multidimensional OFT of dimension ele2 ... ed'
In order to illustrate the impact of this approach, we give in Table 5.5, the
number of real operations for various multidimensional OFTs computed by the
Winograd algorithm. It can be seen that the Winograd method is particularly
effective for this application, since a OFT of size 1008 X 1008 is calculated with
only 6.25 real multiplications per point, or about 2 complex mUltiplications per
point. Moreover, for large multidimensional OFTs, the split nesting technique
gives significant additional reduction in number of additions. For a OFT of size
1008 X 1008, for instance, split nesting reduces the number of real additions
142 5. Linear Filtering Computation of Discrete Fourier Transforms
Table 5.5. Number of nontrivial real operations for multidimensional DFTs computed by the
Winograd Fourier transform algorithm
are significantly longer than for addition. Another important factor concerns the
relative size of FFT and WFTA programs. When the FFT programs are built
around a single radix FFT algorithm (usually radix-2 or radix-4 FFT algorithm),
the computation proceeds by repetitive use of a subroutine implementing the
FFT butterfly operation. Thus, the FFT programs can be very compact and es-
sentially independent of the OFT size, provided that the OFT size is a power of
the radix. By contrast, the WFTA uses different computation kernels for each
OFT size and each of these is an explicit description of a particular small OFT
algorithm, as opposed to the recursive, algorithmic structure used in the FFT.
Thus, WFTA programs usually require more instructions than FFT programs
for OFTs of comparable size and they must incorporate a subroutine which
selects the proper computation kernels as a function of the OFT size. This
feature prompts one to organize the program structure in two steps; a genera-
tion step and an execution step. The program can then be designed in such a way
that most bookkeeping operations such as data routing and kernel selection, as
well as precomputation of the multipliers, are done within the generation step
and therefore do not significantly impact the execution time.
The WFTA program is divided into five main parts: input data reordering,
input additions, multiplications, output additions, and output data reordering.
The input and output data reordering requires a number of modular multiplica-
tions and additions which can be eliminated by precomputing reordering vectors
during the generation step. These stored reordering vectors may, then, be used
to rearrange the input and output data during the execution step. The input
additions, except for the innermost factor, correspond to a set of additions that
is executed for each factor N/> and operates on N/ input arrays to produce M/
output arrays. Since M/ is generally larger than N/, the calculations cannot
generally be done "in-place". Thus, the generated result of each stage cannot be
stored over the input data sequence to the stage. However, it is always possible
to assign N/ input storage locations from the M/ output storage locations and,
since M/ is not much larger than N 1 , this results in an algorithm that is not
significantly less efficient than an in-place algorithm, as far as memory utiliza-
tion is concerned. The calculations corresponding to the innermost factor Nd
are executed on scalar data and include all the mUltiplications required by the
algorithm to compute the N-point OFT. If M is the total number of multiplica-
tions corresponding to the N-point OFT and M d is the number of multiplications
corresponding to the Nd-point small OFT algorithm, the calculations for the in-
nermost factor Nd reduce to MjMd OFTs of Nd points. The Md coefficients here
are those of the Nd-point OFT, multiplied by the coefficients of the other small
OFT algorithms. In order to avoid recalculating this set of Md coefficients for
each of the M/ Md OFTs of Nd points, one is generally led to precompute, at
generation time, a vector of M coefficients divided into M/ Md sets of Md coeffici-
ents. Since these coefficients are simply real or imaginary, a total of M real me-
mory locations are required, or significantly less than for an FFT algorithm in
which the coefficients are precomputed.
144 5 Linear Filtering Computation of Discrete Fourier Transforms
From this, we can conclude that, although the WFTA is not an in-place
algorithm, the total memory requirement for storing data and coefficients can
be about the same as that of the FFT algorithm. The program sizes will generally
be larger for the WFTA than for the FFT, but remain reasonably small, because
the number of instructions grows approximately as the sum of the number of
additions corresponding to each small algorithm. Thus, if N = NIN2 ... Nt
... Nd and if At is the number of additions corresponding to a Nt-point DFT,
L;A t is a rough measure of program size. L; At grows very slowly with N, as can
/ j
be verified by noting that L;A t = 25 for N = 30 and L;A/ = 154 for N = 1008.
i i
Thus WFTA program size and memory requirements can remain reasonably
small, even for large DFTs, provided that the programs are properly designed to
work on array organized data. Hence, the WFTA seems to be particularly well
suited for systems, such as APL, which have been designed to process array data
efficiently.
Another important issue concerns the computational noise of the Winograd
algorithms, and only scant information is currently available on this topic. The
preliminary results given in [5.18] tend to indicate that proper scaling at each
stage is more difficult than for the FFT because of the fact that all moduli are
different and not powers of 2. In the case of fixed point data, this significantly
impacts the signal-to-noise ratio of the WFTA, and thus, the WFTA generally
requires about one or two more bits for representing the data to give an error
similar to the FFT.
This section lists the short DFT algorithms that are most frequently used with
the prime factor method or the WFTA. These algorithms compute short N-point
one-dimensional DFTs of a complex input sequence Xn
N-l
Xk = L; Xn wnk, k = 0, ... , N - 1
n=O
j=,J-1 (5.93)
The operations are executed in the order II> ml> SI> XI' with indices in natural
order. For DFTs oflengths 5, 7, 9, 16, the operations can also be executed using
the form shown in Sects. 5.6.4, 5, 7, 8 which embeds the various polynomial
products.
The figures between parentheses indicate trivial multiplications by ± I, ±j
At the end of each of algorithm description for N = 3, 5, 7, 9, we give the num-
ber of operations for the corresponding algorithm in which the number of non-
trivial multiplications is minimized and the output is scaled by a constant factor.
X2 =ml
X3 = m 2 - m3
ts = tl + 12
mo = l·(xo + Is)
ml = [(cos U + cos 2u)/2 - 1]15
m2 = [(cos u - cos 2u)/2](/1 - (2)
Polynomial product modulo (Z2 + 1)
m3 = -j(sin U)(/3 + (4)
m4 = -j(sin u + sin 2U)'/ 4
ms = j(sin u - sin 2u)t3
S3 = m3 - m 4
Ss = m3 + ms
Xo = mo X3 = S4 - SS
XI = S2 + S3 X4 = S2 - S3
X2 = S4 + Ss
Corresponding algorithm with scaled output:
6 multiplications (2), 21 additions, scaling factor: 4
t8 = tl - t3 t9 = t3 - t2
tll = t7 - ts t12 = t6 - t7
5.5 Short Off Algorithms 147
t13 = -ts - t9
mZ = [(2cos u - cos 2u - cos 3u)/3] ts
m3 = [(cos u - 2eos 2u + cos 3u)/3] t9
m4 = [(cos u + cos 2u - 2cos 3u)j3] tl3
Polynomial product
modulo (zZ + z + 1)
m, = -j[(sin u + sin 2u - sin 3u)/3]1 10
Xo = mo XI = S, + Ss Xz = S6 + S9
X4 = S, + SIO Jl, = S6 - S9 X6 = S, - Ss
t, = tl + tz ts = t3 + Is
Xo =mo XI = SI + S3 X2 = + ms
m2
X3 = S2 - S4 X4 =ml Xs = S2 + S4
X6 = m2 - ms X7 = SI - S3
17 = X7 - X2 ts = X3 - X6 Ig = X4 - Xs
110 = 16 + 17 + Ig 111 = tl - t2 tl2 = 12 - 14
tl3 = t7 - t6 tl4 = t7 - Ig
mo = l.(xo + 13 + Is)
ml = (3/2)t3
m2 = -ls/2
t l6 = - t13 + tl4
ms = j sin u·t 13
mg = j sin 4U·t 14
mlO = j sin 2U·116
Polynomial product
modulo (Z2 - Z + 1)
5.5 Short DFr Algorithms 149
S4 + m2 + m2
= mo
S6 = S4 + m 2 S, = s, - So
Ss = SI + s, S9 = So - SI + s,
X3 = S6 + m6 X 4 = S9 + S12 X, = S9 - S12
X6 = S6 - m6 X, = Ss + SII XS = s, - SID
m6 = cos 2U'(14 - ( 6)
SI = m3 + ms S2 = m3 - ms S3 = mil + ml3
S4 = m13 - mil Ss = m 4 + m6 S6 = m 4 - m6
S9 = Ss + S7 SIO = Ss - S7 SII = S6 + Ss
Sl2 = S6 - Ss
Xo =mo XI = S9 + SI7 X2 = SI + S3
X3 = Sl2 - S20 X 4 = m2 + mlO X5 = SII + SI9
XIS = S9 - SI7
6. Polynomial Transforms
(6.2)
N-I
Hm(z) = ~ hn•m zn, m = 0, ... , N - I (6.3)
11=0
N-I
Xr(z)=~x•. rz·, r=O, ... ,N-I, (6.4)
s=o
1= 0, ... , N - 1 (6.5)
152 6. Polynomial Transforms
zp - 1 = (z - l)P(z) (6.6)
Since Y/(z) is defined modulo (zP - 1), it can be computed by reducing Hm(z)
and X,(z) modulo (z - 1) and P(z), computing the polynomial convolutions
Y1,/(z) == Y/(z) modulo P(z), and Yz,/ == Y/(z) modulo (z - 1) on the reduced
polynomials and reconstructing Y/(z) by the Chinese remainder theorem (Sect.
2.2.3) with
(6.8)
ISI(Z) == 1
SI(Z) == 0
Sz(z)
Sz(z)
== 0 modulo P(z)
== 1 modulo (z - 1) (6.9)
with
and
p-l
~ h",m
Hz,m = n=O (6.13)
p-l
X z" = ~x.". (6.14)
.=0
XJ,r(Z) == ..L
p
Pi: Xk(z)z-rk
k-O
modulo P(z),
We shall now establish that the polynomial transforms support circular convolu-
tion, and that (6.17) is the inverse of(6.15). This can be demonstrated by calcu-
lating the transforms Hk(z) and Xk(z) of HJ,m(z) and XJ,r(z) via (6.15), mUltiplying
Hk(z) by Xiz) modulo P(z), and computing the inverse transform QI(Z) of Hk(z)
Xk(z). This can be denoted as
+r -
p-I
with q = m = 2.: zqk. Since zP == 1, the exponents of z are defined
l. Let S
k-O
modulo p. For q == 0 modulo p, S = p. For q $. 0 modulo p, the set of exponents
qk defined modulo p is a simple permutation of the integers 0, 1, ... , p - 1. Thus,
p-I
S == 2.: Zk == P(z) == 0 modulo P(z). This means that the only nonzero case cor-
k-O
responds to q == 0 or r == I - m modulo p and that Q/(z) reduces to the circular
convolution
(6.19)
The demonstration that the polynomial transform (6.15) and the inverse poly-
nomial transform (6.17) form a transform pair follows immediately by setting
HJ,m(z) = 1 in (6.19).
Using the foregoing method, YJ./(z) is computed with three polynomial
transforms and p polynomial multiplications Hk(z)Xk(z) defined modulo P(z).
In many digital filtering applications, the input sequence h",m is fixed and its
transform Hk(z) can be precomputed. In this case, only two polynomial trans-
forms are required, and the Chinese remainder reconstruction can also be
simplified by noting, with (2.87), that
POLYNOMIAL
TRANSFORM ..-
(lip) H:z.m
MODULO pYZ)
SIZE p. ROOT Z
P POLYNOMIAL
MULTlPliCA TlONS
MODULO pYZ)
..
INVERSE
POLYNOMIAL
TRANSFORM
MODULO pYZ)
SIZE P. ROOT Z
reconstruction are calculated, respectively, by (6.14) and (6.22) without the use
of multiplications. The reductions modulo P(z) also require no multiplications
because Zp-I == _Zp-2 - Zp-3 ... - 1 modulo P(z), which gives for X1,r(z)
(6.23)
XI,p-l,r = ° (6.25)
p-q-I p-I
X1,r(z)zq modulo (zP - 1) = 1:
3=0
Xl,s,rZs+q + 1:
s=p-q
Xl,s,rZs+ q
(6.26)
p-2
X1,r(z)zq modulo P (z) = 1:
3=0
(XI,<s-q).r - XI,<p_q_I),r)Z', Xl,p-l,r = 0, (6.27)
where the symbols <> define s - q and p - q - 1 modulo p. Thus, the polynomi-
al transforms are evaluated with additions and simple rotations of p-word
polynomials, and the only multiplications required to compute the two-dimen-
sional convolution Yu.l correspond to the calculation of one convolution of
length p and to the evaluation of the p one-dimensional polynomial products
T1(z)ilk(z)Xk(z) defined modulo P(z). This means that, if the polynomial products
and the convolution are evaluated with the minimum multiplication algorithms
defined by theorems 2.21 and 2.22, the convolution of size p x p, with p prime,
is computed with only 2pz - p - 2 multiplications. It can indeed be shown
that this is the theoretical minimum number of multiplications for a convolution
of size p x p, with p prime [6.3].
b-\
Hm(z) = L: hn.mzn
n=O
(6.32)
b-\
X,{z) L: x •. ,z·
= $=0 (6.33)
N-\
iik(z) == L: Hm(z)[G(z)]mk modulo P(z),
m=O
(6.34)
where b is the degree of P{z). Y/(z) is obtained by evaluating the inverse trans-
form of iik(z)Xk{z) by
(6.35)
The polynomial transforms have the same structure as DFTs, but with complex
exponential roots of unity replaced by polynomials G(z) and with all operations
defined modulo P(z). Therefore, these transforms have the same general proper-
ties as the DFTs (Sect. 4.1.1), and in particular, they have the linearity property,
with
The degree of each cyclotomic polynomial p.,(z) is ,pee,), where ,p{e,) is Euler's
totient function (Sect. 2.1.3). Since the various polynomials p. (z) are irreducible,
the polynomial convolution defined modulo (ZN - 1) can be computed sepa-
rately modulo each polynomial p.,(z), with reconstruction of the final result by
the Chinese remainder theorem. We show first that there is always a polynomial
transform of dimension N and root z which supports circular convolution when
defined modulo P••(z), the largest cyclotomic polynomial factor of ZN - 1.
This can be seen by noting that, since ZN == 1 modulo (ZN - 1) and P••{z) is a
factor of ZN - 1, ZN == 1 modulo P•• {z). Thus, condition (6.28) is satisfied. Con-
ditions (6.29) are also satisfied because N always has an inverse in ordinary
arithmetic (coefficients in the field of rationals) and because Z-I == ZN-I modulo
P••(z). Consider now the conditions (6.30). Since ZN == 1 moduloP••(z), we have
N-I
S == ~ zqk == N for q == 0 modulo N. For q =t= 0 modulo N, we have
k=O
The complex roots of zq - 1 are powers of e-j21t,q, while the complex roots of
P ••(z) are powers of e- j21<'N. Thus, for (q, N) = 1 these complex roots are differ-
ent and zq - 1 is relatively prime to P•.(z), which implies by (6.39) that S == O.
For (q, N) =1= 1, q can always, without loss of generality, be considered as a
factor of N. Then, rfi(N) > ,p(q), and the largest polynomial factors of ZN - 1
158 6. Polynomial Transforms
and of zq - 1 are, respectively, the polynomials P.,(z) and Q(z) of degree 1>(N)
and 1>(q). These polynomials are necessarily different because their degrees 1>(N)
and 1>(q) are different. Moreover, Q(z) cannot be a factor of P•.(z) because
P.,(z) is irreducible. Thus, zq - 1 $. 0 modulo P•.(z) and S == 0 modulo p •.(z),
which completes the proof that conditions (6.30) are satisfied.
Consequently, the convolution Yu.l of dimension N X N is computed by
ordering the input sequence into N polynomials of N terms which are reduced
d-I
moduloP•.(z) and modulopl(z) = II P.,(z). The output samples Yu,l are derived
1=1
Then,
and
REDUCTION MODULO
PfZ) = (ff-l)/(zP-l)
POLYNOMIAL
TRANSFORM
MODULOPjZ)
SIZE r. ROOT Z
r POLYNOMIAL
MULTIPLICATIONS
MODULOPjZ)
POLYNOMIAL
INVERSE POLYNOMIAL TRANSFORM
TRANSFORM MODULO MODULOPjZ)
PjZ) SIZE p • ROOT zP
SIZE r. ROOT Z
P POLYNOMIAL
MULTIPLICA TIONS .....1 - - -
MODULOPjZ)
INVERSE
POLYNOMIAL
TRANSFORM
MODULOPjZ)
SIZE P . ROOT zP
YII,I
Fig. 6.2. First stage of the computation of a convolution of size p2 x p2. p odd prime
As a second example, Fig. 6.3 gives the first stage of the calculation of a
convolution of size 2' x 2' by polynomial transforms. In this case, the compu-
tation is performed in t - 1 stages with polynomial transforms defined modulo
P,+l(Z) = Z2'-' + 1, P,(z) = Z2'-' + I, ... , Plz) = Z2 + 1. These polynomial
160 6. Polynomial Transforms
I I
I
ORDERING OF
POLYNOMIALS
I
•
X/Z)
+ I
REDUCTION MODULO
PI+I(Z)=Z
2,-1
+1 I REDUCTION MODULO
Z
],-1
-I
X2,1Z)
.. XI,IZ)
2' POLYNOMIALS OF 2,-1 TERMS
POLYNOMIAL
TRANSFORM I
REORDERING I
MODULO P,+lZ)
.
.2,-1 POLYNOMIALS OF 2' TERMS
•
2' . ROOT Z
•
SIZE
•
Z -I
! -----.
MULTIPUCA TlONS
MODULO P,+lZ) TlZ)HI.IIZ)
+ t
+ POLYNOMIAL
.
TRANSFORM CONVOLUTION OF SIZE
INVERSE POLYNOMIAL MODULO P,./Z) 2'-1 xl,-I
TRANSFORM MODULO SIZE 2'-1 . ROOT Z2
.
PI+I(Z)
,I
SIZE 2' . ROOT Z
.....-
2,-1 POLYNOMIAL
MULTIPUCATIONS ..
•
MODULO PI+I(Z) TiZ)H2.lIZ)
INVERSE POLYNOMIAL
TRANSFORM MODULO
P,+lZ)
SIZE 2,-1 . ROOT Zl
~
.,.
•
REORDERING AND CHINESE
REMAINDER RECONSTRUCTION
•I
+
Yu.1
Fig. 6.3. First stage of the computation of a convolution of size 2' x 2' by polynomial trans-
forms
transforms are particularly interesting because, due to their power of two sizes,
they can be computded with a reduced number of additions by a radix-2 FFT-
type algorithm.
6.2 General Definition of Polynomial Transforms 161
(6.43)
Condition (6.29) is also obviously satisfied, since NI> N z and W, G(z) have in-
verses. In order to meet condition (6.30), we consider S, with
N-I
S == L:
k-O
[WG(Z)]qk modulo P(z). (6.44)
Since (WG(Z))N == I modulo P(z), the exponents qk are defined modulo N. Thus,
S == N for q == 0 modulo N. For q $. 0 modulo N, we can always map S into a
two-dimensional sum, because NI and N z are mutually prime. This can be done
with
kz = 0, ... , N z - 1 (6.45)
N 2 -1 NJ-l
S == L: WqN,k, L: G(Z)qN,k, modulo P(z). (6.46)
k1-O k 1 =O
The existence of the two transforms of lengths N I and N z with roots G(Z) and W
implies that S == 0 for kl $. 0 modulo NI and kz $. 0 modulo N z, and therefore
that S == 0 for k $. 0 modulo N, which verifies that (6.30) is satisfied.
When NI is odd, the condition (NI> N z) = I implies that it is always possible
to increase the length of the polynomial transforms to NINz, with N z = 2t. The
new transforms will usually require some multiplications since the roots WG(z)
are no longer simple. We note, however, that this method is particularly useful
to compute convolutions of sizes 2NI X 2NI and 4NI X 4NI> because in these
162 6. Polynomial Transforms
(6.47)
k = 0, ... ,p - 2 (6.49)
(6.50)
Table 6.2. Number of additions for the computation of reductions, Chinese remainder opera-
tions, and polynomial transforms. p odd prime
p-1
Since Ro(z) = L: X1,r(z), Ro(z) is computed with (p - 2)(p - 1) additions. For
k '* r=1
p-2
== - L: Rk(z) modulo P(z), (6.54)
k=O
N-\
Xk(Z) == I: X1.r(z)zrk modulo P,+I(Z)
r=O
and
P,+I(Z) = ZZ'-I + 1. (6.56)
We have seen in the preceding sections that polynomial transforms map ef-
ficiently two-dimensional circular convolutions into one-dimensional polynomial
products and convolutions. When the polynomial transforms are properly
selected, this mapping is achieved without multiplications and requires only a
limited number of additions. Thus, when a two-dimensional convolution is
evaluated by polynomial transforms, the processing load is strongly dependent
upon the efficiency of the algorithms used for the calculation of polynomial
products and one-dimensional convolutions.
One approach that can be employed for evaluating the one-dimensional
convolutions involves the use of one-dimensional transforms that support circu-
lar convolution, such as DFTs and NTTs. These transforms can also be used
to compute polynomial products modulo cyclotomic polynomials p.,(z) by
noticing that, since p.(z), defined by (6.38), is a factor of ZN - 1, all computa-
tions can be carried o~t modulo (ZN - 1), with a final reduction modulo p.,(z).
With this method, the calculation of a polynomial product modulo p.,(z) is
replaced by that of a polynomial product modulo (ZN - 1), which is a convolu-
tion of length N. Hence, the two-dimensional convolution is completely mapped
166 6. Polynomial Transforms
POLYNOMIAL
TRANSFORM
MODULO (:d'-I)
ROOT Z • SIZE p
....-
I/q H1 •m
INVERSE
POLYNOMIAL
TRANSFORM
MODULO (:d' -I)
ROOT Z . SIZE p
REDUCTION MODULO
(zP-/)/(Z-I)
the use of these transforms such as roundoff errors for DFTs or modular arith-
metic for NTIs are then limited to only a part of the total computation process.
We have seen, however, in Chap. 3, that the methods based on interpolation and
on the Chinese remainder theorem yield more efficient algorithms than the DFTs
or NTIs for some convolutions and polynomial products. It is therefore often
advantageous to consider the use of such algorithms in combination with poly-
nomial transforms.
With this method, each convolution or polynomial product algorithm used
in a given application must be specifically programmed, and it is desirable to
use only a limited number of different algorithms in order to restrict total
program size. This can be done by computing the one-dimensional convolutions
P REDUCTIONS
MODULO
p(Z)=(zP-1 )/(Z-J)
POLYNOMIAL
TRANSFORM
MODULO!'(Z)
ROOT Z • SIZE p
p POL YNOMIAL
1 POLYNOMIAL
MULTIPLICATIONS MULTIPLICATION 1 MULTIPLICATION
MODULO !'(Z) MODULOP(Z)
INVERSE
POLYNOMIAL
TRANSFORM
MODULO !,(Z)
ROOT Z . SIZE p
case, we have
(6.59)
Since N\ and N2 are mutually prime, the indices I, m, n, and u can be mapped
into two sets of indices II> ffll> nl> u\ and 12, m2, n2, U2 by use of an approach based
on permutations (Sect. 2.1.2) to obtain
and
XN,(.,-.,)+N,(u,-.,),N,{/,-m,)+N,{/,-m,)· (6.61)
N,-I N,-I
Yu,.I,(Zh zz) == ~ ~ Hn,.m,(Zh zZ)Xu,-n,.I,-m,(Zh zz)
ml=O nl=O
(6.66)
(6.67)
This computation process may be extended recursively to more than two factors
provided that all these factors are mutually prime. In practice, the small con-
volutions of size NJ X NJ and N z X N z are computed by polynomial trans-
forms, and large two-dimensional convolutions can be obtained from a small set
of polynomial transform algorithms. A convolution of size 15 X 15 can, for
instance, be computed from convolutions of sizes 3 X 3 and 5 X 5. Since these
convolutions are calculated by polynomial transforms with 13 multiplications,
70 additions and 55 multiplications, 369 additions, respectively (Table 6.3),
nesting the two algorithms yields a total of 715 multiplications and 6547 addi-
tions for the convolution of size 15 X 15.
Table 6.4 itemizes arithmetic operations count for two-dimensional convolu-
tions computed by polynomial transforms and nesting. It can be seen that these
algorithms require fewer mUltiplications and more additions per point than for
the approach using composite polynomial transforms and corresponding to
Table 6.3. The number of additions here can be further reduced by replacing the
172 6. Polynomial Transforms
conventional nesting by a split nesting technique (Sect. 3.3.2). In this case, the
number of arithmetic operations becomes as shown in Table 6.5. It can be seen
that the number of additions per point in this table is comparable to that
obtained with large composite polynomial transforms.
Polynomial transforms are particularly suitable for the evaluation of real con-
volutions because they then require only real arithmetic as opposed to complex
arithmetic with a OFT approach. Furthermore, when the polynomial products
are evaluated by polynomial product algorithms, the polynomial transform
approach does not require the use of trigonometric functions. Thus, the compu-
tation of two-dimensional convolutions by polynomial transforms can be com-
pared to the nesting techniques [6.4] described in Sect. 3.3.1 which have similar
6.5 Polynomial Transforms Defined in Modified Rings 173
characteristics. It can be seen, by comparing Table 3.4 with Tables 6.3 and 6.5,
that the polynomial transform method always requires fewer arithmetic opera-
tions than the nesting method used alone, and provides increased efficiency with
increasing convolution size. For large convolutions of sizes greater than 100 X
100, the use of polynomial transforms drastically decreases the number of
arithmetic operations. In the case of a convolution of 120 X 120, for example,
the polynomial transform approach requires about 5 times fewer multiplications
and 2.5 times fewer additions than the simple nesting method.
When a convolution is calculated via FFT methods, the computation requires
the use of trigonometric functions and complex arithmetic. Thus, a comparison
with the polynomial transform method is somewhat difficult, especially when
issues such as roundoff error and the relative cost of ancillary operations are
considered. A simple comparative evaluation can be made between the two
methods by assuming that two real convolutions are evaluated simultaneously
with the Rader-Brenner FFT algorithm (Sect. 4.6) and the row-column method.
In this case, the number of arithmetic operations corresponding to convolutions
with one fixed sequence and precomputed trigonometric functions is listed in
Table 4.7. Under these conditions, which are rather favorable to the FFT ap-
proach, the number of additions is slightly larger than that of the polynomial
transform method while the number of multiplications is about twice that of the
polynomial transform approach. Conventional radix-4 FFT algorithms or the
Winograd Fourier transform method would also require a significantly larger
number of arithmetic operations than the polynomial transform method.
Of all possible polynomial transforms, the most interesting are those defined
modulo (ZN + 1), with N = 2', because these transforms are computed without
multiplications and with a reduced number of additions by using a radix-2 FFT-
type algorithm. We have seen, in Sect. 6.2.1, that large two-dimensional con-
volutions are computed with these transforms by using a succession of stages,
where each stage is implemented with a set of four polynomial transforms. This
approach is very efficient from the standpoint of the number of arithmetic
operations, but has the disadvantage of requiring a number of reductions and
Chinese remainder reconstructions. In the following, we shall present an inter-
esting variation [6.5] in which a simplification of the original structure is obtained
at the expense of increasing the number of operations.
In order to introduce this method, we first establish that a one-dimensional
convolution YI oflength N, with N = 2', can be viewed as a polynomial product
modulo (ZN + 1), provided that the input and output sequences are multiplied
by powers of W, where W is a root of unity of order 2N. We consider the
circular convolution YI defined by
174 6. Polynomial Transforms
1= 0, ... , N - 1, (6.68)
N-i
X(z) = L:
m=O
xmWmz m. (6.70)
(6.72)
where each coefficient al of A(z) corresponds to the products hnxm such that
n+ m = I or n + m = I + N. Since ZN == -1, we have
(6.73)
(6.74)
N-i N-i
Yu,l L: L:
= m=O n=O
hn.m Xu-n.l-m u, 1= 0, ... , N - 1. (6.75)
N-I
Hm(z} = 1:
,,-0
h",m W"z", W = e- J7C1N (6.77)
N-I
1: X.,r W'z',
Xr(Z} =
.-0 m, r = 0, ... , N- 1 (6.78)
N-I
A,(z} = 1:
,,-0
a""z", 1= 0, ... ,N-l (6.79)
The most important part of the calculations corresponds to the evaluation of the
polynomial convolution A,(z} defined modulo (ZN + I) corresponding to (6.76).
We note, however, that we can always define a polynomial transform oflength
POLYNOMIAL POLYNOMIAL
TRANSFORM TRANSFORM
MODULO (ZN+J) MODULO (ZN+ J)
SIZE N • ROOT ZZ SIZE N • ROOT ZZ
N POLYNOMIAL
'------t~ MULTIPLICATIONS
MODULO (ZN+ J)
INVERSE POLYNOMIAL
TRANSFORM MODULO
(ZN+ I )
SIZE N . ROOT ZZ
-+-- W- u
N modulo (ZN + 1). Hence, the two-dimensional convolution YM,I can be com-
puted with only three polynomial transforms of length N and roots Z2, as shown
in Fig. 6.6. When one of the input sequences, hn,m, is fixed, the corresponding
polynomial transform needs be computed only once and the evaluation of Yu,l
REDUCTION REDUCTION
MODULO (ZNI2+ 1) MODULO (ZNI2_l)
POLYNOMIAL
TRANSFORM
MODULO (ZNl2+l)
SIZE N . ROOT Z
POLYNOMIAL
TRANSFORM
MODULO (ZNI2+l)
N POLYNOMIAL
PRODUCTS SIZE N , ROOT Z
MODULO (ZNI2+1)
N POLYNOMIAL
--+ PRODUCTS
INVERSE MODULO (ZNI2+ 1)
POLYNOMIAL
TRANSFORM
MODULO (ZNl2+ 1) INVERSE
SIZE N • ROOT Z POLYNOMIAL
TRANSFORM
MODULO (ZNl2+1)
SIZE N • ROOT Z
p-t
Hm,.m,(z) = ~hn.m,.m, zn, mlo m2 = 0, ... ,p - 1 (6.81)
n=O
p-t
X"",(Z) = ~ X"'I'" z',
8=0
'10 '2 = 0, ... , p - 1 (6.82)
u = 0, ... ,p - 1. (6.85)
where
(6.87)
with a similar definition for the inverse transform. The two-dimensional poly-
nomial transform in (6.86) supports circular convolution because z is a root of
order N in the field of polynomials modulo P(z). Hence, it may be used to com-
pute the polynomial convolution Y t •1,• 1, (z) with
(6.90)
(6.91)
(6.92)
and
with
3x 3x 3 40 325
3x3x3x3 121 1324
6x 6 x 6 320 3896
6x6x6x6 1936 31552
180 6. Polynomial Transforms
POLYNOMIAL
TRANSFORM
MODULOP(Zj
SIZE pxp . ROOT Z
I
I
I
'+--
I
"z POLYNOMIAL
PRODUCTS
MODULOP(Zj
INVERSE
POLYNOMIAL
TRANSFORM
MODULOP(Z)
SIZE pxp • ROOT Z
(7.2)
(7.3)
It can easily be verified that (7.2-4) are equivalent to (7.1) by noting that the
definition of (7.4) modulo (z - Wk,) is equivalent to substituting Wk, for z in
(7.2) and (7.3). It should also be noted that although the definition of Xk/z)
modulo (ZN - 1) is superfluous at this stage, it is valid, since ZN == WNk, = 1.
In order to simplify the presentation, we assume now that N is an odd prime,
with N = p. Thus, zP - I is the product of two cyclotomic polynomials
zP - 1 = (z - I)P(z) (7.5)
(7.7)
For k z =t=. °
modulo p, Wk, is always a root of P(z), since
(7.8)
and Xk"k, may be obtained by substituting Wk, for z in (7.2). Since z - Wk, is a
factor of P(z) and P(z) is a factor of zP - I, (7.4) becomes
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 183
Xk,.k, == UXk,(z) modulo (zp - I)J modulo P(z)} modulo (z - Wk,). (7.9)
kl = 0, ... ,p - 1 (7.10)
(7.11)
k2 = 1, ... , p - 1. (7.12)
Since p is an odd prime and k2 =1= 0, the permutation k2kl modulo p maps all
values of kJ, and we obtain, by replacing kl with k2kJ,
_ p-l
Xk,k,(Z) == L: X!,(z) Wk,.,k, modulo P(z) (7.13)
"1=0
Xu k
2 I' 2.
== Xk k (z) modulo (z - Wk,).
2 I
(7.14)
_ p-l
Xk,k,(Z) == L: X!,(Z)Z·,k, modulo P(z), (7.15)
7.1 1 =0
_ p-2
Xl 1k (z) =
1
L: Yk
1=0 I'
I Zl. (7.16)
k2 = I, ... , p - 1, (7.17)
M = (p + I)Mh (7.18)
REDUCTION MODULO
P(Z) = (z.P -1 )/(Z-I)
POLYNOMIAL
TRANSFORM
MODULO P(Z)
ROOT Z . LENGTH p
p REDUCED DFTs
(p CORRELATIONS
OF p-l POINTS)
term is not computed. The simplification of (7.17) is based upon the fact that,
for k2 =1= 0 and p prime,
p-I
2:: Wk,1 = - 1. (7.19)
I-I
k2 = 1, ... , p - 1, (7.20)
where
1= 1, ... ,p - 2. (7.21)
In the OFTs defined by (7.20), the first input and output terms are missing. Thus,
these OFTs are usually called reduced OFTs. They can be computed as corre-
lations of length p - 1 by using Rader's algorithm [7.3] (Sect. 5.2). In this case,
if g is a primitive root modulo p, I and k2 are redefined by
I == g" modulop
{
k2 == g" modulo p u, v = 0, ... ,p - 2. (7.22)
Under these conditions, the reduced OFT (7.20) is converted into a correlation
(7.23)
The sequence ykt.1 can be constructed from the sequence Ykt,l without additions
by noting that it is equivalent to a multiplication of Xk k (z) by Z-I modulo
, t
M = (p + I)Ml - p. (7.24)
column method and always less (except for p = M 1) than the number of multi-
plications required by the Winograd algorithm. When the DFTs and reduced
DFTs of size p are evaluated by Rader's algorithm, all complex multiplications
reduce to multiplications by pure real or pure imaginary numbers and can be
implemented with only two real multiplications. In this case, the number of real
multiplications required to evaluate the DFT of size p x p by polynomial
transforms becomes 2(p + I)Ml - 2p.
(7.25)
(7.26)
(7.27)
+ 1)
N-I
Xl,(z) == ~ X!,(z)Wn,k, modulo (ZNI2 (7.29)
"1- 0
NIZ-I
X! I(z) = ~ (x.
"2=0.' 2
n - Xn •• n +N12)Z" == Xn I(z) modulo (ZNI2
:I
+ 1) (7.30)
(7.31)
7.1 Computation of Multidimensional DFfs by Polynomial Transforms 187
Since k2 is odd and N is a power of two, the permutation k2ki modulo N maps
all values of k i and we obtain, by replacing k i with k2kh
N-i
X~,k,(Z) == ~ X~,(z)Wk,n,k,
nl=O
modulo (ZNI2 + 1) (7.32)
+ 1),
N-l
Xl,k,(z) == ~ X~,(z)zn,k, modulo (ZNI2 (7.34)
n 1 =O
(7.35)
(7.36)
(7.37)
By reversing the role of kl and 2u, this OFT can be represented in polynomial
notation as
_ N/2-1
X2"(z) == L; X.,(Z)W2W1, modulo (ZN - 1) (7.39)
n2 =O
(7.40)
(7.41)
We may then use the same polynomial transform method as above to compute
Xk,.2"' The polynomial ZN - 1 factors into the two polynomials ZN/2 - 1 and
ZNI2 + 1 and the roots of ZNI2 - 1 correspond to Wk" kl even. Therefore, for kl
even, with kl = 2v, Xk,.2" reduces to a simple OFT of size (NI2) X (NI2)
For kl odd, the Wk, are the roots of ZNI2 + 1 and (7.39, 40) can be defined
modulo (ZNI2 +
1) instead of modulo (ZN - 1). In this case, Xk,.2u can be com-
puted using a polynomial transform of length NI2 in a way similar to that dis-
cussed above for Xk,.k" k2 odd. This is accomplished with
N/2-1
X!uk,(Z) == L;
1):&-0
X!,(Z)Z2W1, modulo (ZNI2 + 1)
U = 0, ... , NI2 - I (7.43)
k1odd. (7.45)
N POLYNOMIALS OF N TERMS
REDUCTION
MODULO ZNl2+ 1
POLYNOMIAL
TRANSFORM
MODULO ZN12+ 1
LENGTH N . ROOT Z
POLYNOMIAL
TRANSFORM
DFT
MODULO zNl2 + I
(NI2) x (NI2)
LENGTH Nl2 • ROOT Z2
I
I
,~
klODD
k2 EVEN
Xkl 'k2
Table 7.1. Number of real operations for complex DFTs of size N x N computed by polyno-
mial transforms with the reduced Rader-Brenner DFT algorithm. N = 2'. Trivial mUltiplica-
tions by ± 1, ±j are not counted
2 x 2 0 16 0 4.00
4 x 4 0 128 0 8.00
8 x 8 48 816 0.75 12.75
16 x 16 432 4528 1.69 17.69
32 x 32 2736 24944 2.67 24.36
64 x 64 15024 125040 3.67 30.53
128 x 128 76464 599152 4.67 36.57
256 x 256 371376 2790512 5.67 42.58
512 x 512 1747632 12735600 6.67 48.58
1024 x 1024 8039088 57234544 7.67 54.58
The same general approach can also be employed to compute DFTs of size
N X N, with N = pC, P an odd prime. If, for instance, N = pZ, zp' - 1 factors
into the three cyclotomic polynomials P1(z) = Z - 1, Pz(z) = Zp-I + zp-z +
... + 1" and P 3(z) = zp(p-Il +
Zp(p-Zl + ... +
1. In this case, a DFT of size
pZ X pZ is computed as shown in Fig. 7.3 with one polynomial transform of pZ
terms modulo P 3(z), one polynomial transform of p terms modulo P 3(z),pZ + p
reduced DFTs of length pZ, and one DFT of size p X p. This last DFT can in
turn be evaluated by polynomial transforms. In this approach, each of the re-
duced DFTs of size pZ is such that only the first p(p - 1) input samples are non-
zero and that the output samples with indices multiple of p are not computed.
These reduced DFTs are equivalent to one correlation ofp(p - 1) terms plus one
reduced DFT oflengthp.
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 191
REDUCTION MODULO
Pj(Z) = (zr-/)/(zI'-J)
POLYNOMIAL
TRANSFORM
OF'? TERMS
MODULO Pj (Z)
p2 REDUCED DFTs
OF LENGTH ,?
POLYNOMIAL
TRANSFORM
OF p TERMS
MODULO Pj (Z)
'+-
I
I
P REDUCED DFTs
OF LENGTH ,?
k2 :: 0 MODULO p
kJ f. 0 MODULO P
k2;j; 0 MODULO P
Fig. 7.3. Computation ofa DFT ofsizep 2 x p2 by polynomial transforms.p odd prime
Table 7.2. Main parameters for DFTs of size N x N computed by polynomial transforms
Table 7.3. Number of complex operations for simple two-dimensional DFTs evaluated by
polynomial transforms. Trivial multiplications are given between parentheses. Each complex
multiplication is implemented with two real multiplications
Number of Number of
DFT size
multiplications additions
2 x 2 4 (4) 8
3 x 3 9(1) 36
4 x 4 16 (16) 64
5 x 5 31 (1) 221
7x 7 65 (1) 635
8 x 8 64 (40) 408
9 x 9 105 (1) 785
16 x 16 304 (88) 2264
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 193
NxNxN
(7.46)
(7.48)
(7.50)
(7.53)
The same DFT is computed with dpd-1M I complex multiplications by the row-
column method. Therefore, the number of multiplications is approximately
reduced by a factor of d when the row-column method is replaced by the poly-
nomial transform approach. Thus, the efficiency of the polynomial transform
method, relative to the row-column algorithm, is proportional to d. A similar
result is also obtained when the polynomial transform method is compared to a
nesting technique, since the number of multiplications for nesting is Mt, with
Ml > p for p =1= 3. This point is illustrated more clearly by considering the case
of a DFT of size 7 X 7 X 7 which is computed with 457 complex multiplications
by polynomial transforms, as opposed to 1323 and 729 multiplications when
the calculations are done by the row-column method and the nesting algorithm,
respectively.
A similar polynomial transform approach applies to any d-dimensional DFT
with common factors in several dimensions and we give the number of complex
arithmetic operations in Table 7.4 for some complex three-dimensional DFTs
computed by polynomial transforms.
Table 7.4. Number of complex operations for simple three-dimensional DFTs evaluted by
polynomial transforms. Trivial multiplications are given between parentheses. Each complex
multiplication is implemented with two real multiplications
Number of Number of
DFT size
multiplications additions
2 x 2 x 2 8 (8) 24
3 x 3 x 3 27 (1) 162
4 x 4 x 4 64 (64) 384
5 x 5 x 5 156 (I) 1686
7 x 7 x 7 457 (I) 6767
8 x 8 x 8 512 (288) 4832
9 x 9 x 9 963 (I) 10383
16 x 16 x 16 4992 (I 184) 52960
DFT of size (N, X N,) X (N2 X N 2) by using Good's mapping algorithm [7.6].
With this approach, the four-dimensional DFT is, in turn, computed using
Winograd nesting [7.7] by calculating, by polynomial transforms, a DFT of size
N, X N, in which each scalar is replaced by an array of N2 X N2 terms and each
multiplication is replaced by a DFT of size N2 X N2 computed by polynomial
transforms. Thus, if M" M 2 , M and A" A 2 , A are, respectively, the number of
complex multiplications and additions required to evaluate the DFTs of sizes
N, X N" N2 X N 2, and N,N2 X N,N z, we have
(7.55)
(7.56)
The four-dimensional DFT of size (N, X N,) X (N2 X N 2) can also be com-
puted by the row-column method as Nt DFTs of dimension N2 X N2 plus Ni
DFTs of dimension N, X N,. In this case, we have
(7.57)
(7.58)
Since M, ~ Nt and M2 ~ Ni, the nesting method generally requires more ad-
dition than the row-column method, except when M, = Nt. However, for
Table 7.5. Number of real operations for complex multidimensional DFTs evaluated by poly-
nomial transforms and nesting. Trivial multiplications by ± 1, ±j are not counted
short DFTs, Ml and M2 are not much larger than Nl and N~, and the nesting
method requires fewer multiplications than the row-column algorithm, and a
number of additions which is about the same. Thus, the nesting algorithm is
generally better suited for DFTs of moderate sizes whereas the prime factor
technique is best for large DFTs.
With both methods, large DFTs can be evaluated using a small set of short
length DFTs computed by polynomial transforms. Moreover, additional com-
putational savings can be obtained by splitting the calculations with the tech-
niques discussed in Sects. 5.3.3 and 5.4.3.
Table 7.5 gives the number of real operations for complex multidimensional
DFTs computed by nesting the small multidimensional DFTs evaluated by
polynomial transforms for which data are tabulated in Tables 7.3 and 7.4. It can
be seen by comparison with Table 7.1 that this method requires only about half
the number of multiplications of the large polynomial transform approach with
size N = 2', but uses more additions. We shall see, however, that when this
method is combined with split nesting and another polynomial transform
method, significant additional reduction in the number of operations is made
possible.
where the symbol W represents e- i 1<IN instead of e-J21<IN for reasons that will be
apparent later.
We first rewrite (7.59) as
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 197
(7.61)
(7.62)
(7.63)
(7.65)
(7.67)
198 7. Computation of Discrete Fourier Transforms by Polynomial Transforms
ORDERING OF
POLYNOMIALS
POLYNOMIAL
TRANSFORM
MODULO (ZN + J)
SIZE N . ROOT z2
N REDUCED DFTs
OF N TERMS
(7.71)
(7.72)
which demonstrates that the polynomial transform approach is better than the
row-column method for N > 16 and reduces the number of multiplications by
half for large transforms. It should also be noted that the polynomial transform
approach reduces the number of additions by about 15 %for large transforms.
Therefore, the foregoing polynomial transform method reduces significantly
the number of arithmetic operations while retaining the structural simplicity of
the row-column radix-2 FFT algorithm. In practice, the reduced DFTs will
usually be calculated via the Rader-Brenner algorithm [7.5] because all complex
multiplications are then implemented with only two real multiplications.
The number of arithmetic operations can be further reduced by modifying the
ring structure for only part of the procedure. This may be realized by computing
one or several stages with the method described in Sect. 7.1.2 and by completing
the calculations with the modified ring technique. In the case of a one-stage
process, the DFT Xk,.k , of size N X N is redefined by
(7.75)
(7.76)
+ 1)
N-I
Xl ,k,(z) == 2.: X!,(z)zn,k, modulo (ZNI2 (7.77)
"1- 0
X!,(z) == Xn,{z) modulo (ZNIZ + 1) (7.78)
200 7. Computation of Discrete Fourier Transforms by Polynomial Transforms
k z odd. (7.79)
For k z even, Xk"k, reduces to a OFT of size N X (Nj2) which is computed using
a ring translation technique
(7.80)
N-l
X(k +l)k (z)
2 1
== ~
11 1 =0
X; (z)zn,k,
1
modulo (ZNIZ + 1) (7.81)
which indicates that the OFT Xk"k, of size N X N is computed as shown in Fig.
7.5 with only NZj2 premultiplications by w-n" plus two polynomial transforms
POLYNOMIAL
TRANSFORM
MODULO (zN/2+ 1)
SIZE N , ROOT Z POLYNOMIAL
TRANSFORM
MODULO (ZNI2+ 1)
SIZE N , ROOT Z
defined modulo (zN12 + 1) and 2N reduced OFTs of size N12. When the reduced
OFTs are computed by a simple radix-2 FFT algorithm, the number of real
mUltiplications M3 and real additions A3 become
M3 = 2N2(2 + log2N) (7.83)
We have seen in the preceding sections that multidimensional OFTs can be ef-
ficiently partitioned by polynomial transforms into one-dimensional OFTs and
reduced OFTs. This method is mainly applicable to OFTs having common
factors in two or more dimensions and therefore does not apply readily to one-
dimensional OFTs. In this section, we shall present a second way of computing
OFTs by polynomial transforms [7.1, 2, 9]. This method is based on the decom-
position of a composite OFT into multidimensional correlations via the Wino-
grad [7.7] algorithm and on the computation of these multidimensional cor-
relations by polynomial transforms when they have common factors in several
dimensions. This method is applicable in general to multidimensional OFTs and
also to some one-dimensional OFTs.
In order to simplify the presentation, we shall assume that N \ and N2 are prime.
For k2 = 0, Xk"k, becomes a OFT oflength Nl
kl = 0, ... , N\ - l. (7.86)
k2 = I, ... , N2 - 1 (7.87)
(7.88)
Since N2 is a prime, and n2, k2 =1= 0, we can map XO,k, into a correlation of
length N2 - \ by using Rader's algorithm [7.3] with
n2 == gU, modulo N2
k2 == gV, modulo N2 u2, V2 = 0, ... , N2 - 2 (7.89)
(7.90)
(7.9\)
7 POLYNOMIALS OF 7 TERMS
OF 6 TERMS
OF 7 TERMS
6 POL YNOMIAL
MULTIPLICATIONS
Fig. 7.6. Computation of a DFT of size 7 x 7 by the Winograd algorithm and polynomial
transforms
forms. The same technique can also be applied recursively to accommodate the
case of more than two factors or factors that are composite (Sect. 3.3.1).
When NI = N 2 , all factors in both dimensions are common and a polynomial
204 7. Computation of Discrete Fourier Transforms by Polynomial Transforms
Table 7.6. Number of real operations for complex DFTs computed by multidimensional
correlations and polynomial transforms. Trivial multiplications by ± 1, ± j are not counted
For large multidimensional DFTs, the two polynomial transform methods can
be combined by converting the multidimensional DFT into a set of one-dimen-
sional DFTs by use of a polynomial transform mapping and, then, by computing
these one-dimensional DFTs via a multidimensional correlation polynomial
transform mapping. With this technique, a DFT of size 63 X 63, for instance, is
calculated by nesting DFTs of size 7 X 7 and 9 X 9 evaluated by the first poly-
nomial transform method. Hence, the DFT of size 7 X 7 is partitioned into 1
multiplication plus 8 correlations of 6 terms, and the DFT of size 9 X 9 is
mapped into 33 multiplications plus 12 correlations of 6 terms. Thus, the DFT
of size 63 X 63 is computed with 33 multiplications, 276 correlations of 6 terms,
and 96 correlations of size 6 X 6. When the (6 X 6)-point correlations are
computed by polynomial transforms, the DFT of size 63 X 63 is calculated
with only 11344 real multiplications as opposed to 13648 multiplications when
the first polynomial transform method is used alone and 19600 multiplications
for the conventional Winograd nesting algorithm. It should be noted that com-
bining the two polynomial transform methods also reduces the number of
additions.
Table 7.7 lists the number of real operations for complex DFTs computed
by combining the two polynomial transform methods with the split nesting
technique. It can be seen by comparison with Table 7.1 that the combined poly-
nomial transform method requires about half the number of mUltiplications of
the first polynomial transform method for large transforms. In practice, the
number of multiplications required by this method is always very small, as
exemplified by a DFT of size 1008 X 1008 which is calculated with only 3.39
real multiplications per point or about one complex multiplication per point.
It should be noted however that this low computation requirement is ob-
206 7. Computation of Discrete Fourier Transforms by Polynomial Transforms
Table 7.7. Number of real operations for complex DFTs computed by combining the two
polynomial transform methods. Trivial multiplications by ± 1, ±j are not counted
plex multiplications, or 2N2 log2N real multiplications and N 2 log2 N real ad-
ditions, while retaining the simple structure of the FFT implementation. The
method given in Sect. 7.1.2 is essentially a generalization of this technique, which
is derived by using a complete decomposition to eliminate the multiplications by
W-n,.
When a large multidimensional DFT is evaluated by combining polynomial
transforms and nesting, as in Sects. 7.1.4 and 7.2.1, this method can be con-
sidered as a generalization of the Winograd algorithm in which small multidi-
mensional DFTs and correlations having common factors in several dimensions
are systematically partitioned into one-dimensional DFTs and correlations by
polynomial transform mappings.
In practice, significant computational savings are obtained by computing
DFTs by polynomial transforms. This can be seen by comparing the data given
in Tables 7.1 and 7.7 with those in Table 4.5 which corresponds to two-dimen-
sional DFTs calculated by the Rader-Brenner FFT algorithm and the row-
column method. It can be seen that the number of multiplications is reduced by
a factor of about 2 for large DFTs computed by the first polynomial transform
method used alone and by a factor of about 4 when the two polynomial trans-
form methods are combined. In both cases the number of additions is compara-
ble to and sometimes smaller than the number corresponding to the FFT ap-
proach.
A comparison with the Winograd-Fourier transform algorithm also demon-
strates a significant advantage in favor of polynomial transform methods. For
example, a DFT of size 1008 X 1008 is computed by the WFTA altorithm with
6.25 real multiplications and 91.61 additions per point. This contrasts with the
first polynomial transform technique which requires 7.67 multiplications and
54.58 additions per point for a DFT of size 1024 X 1024 and the combination of
the two polynomial transform methods which requires 3.39 multiplications and
70.33 additions per point for a DFT of size 1008 X 1008.
16. For N = 2', with N > 16, the one-dimensional DFTs can be computed by
the Rader-Brenner algorithm (Sect. 4.3) and the corresponding number of
operations is given in Table 4.3.
For the reduced DFT algorithms, we have already seen in Sect. 7.1.1 that,
when N is a prime, the reduced DFTs become correlations of N - 1 terms. Thus,
these reduced DFTs may be computed by the algorithms of Sect. 3.7.1 with the
corresponding number of complex operations given in Table 3.1. Large odd
DFTs corresponding to N = 2' can be computed by the Rader-Brenner algo-
rithm as shown in Sect. 4.3 with an operation count given in Table 4.4.
We define in Sects. 7.4.1-4 reduced DFT algorithms for N = 4, 8, 9, 16.
These algorithms are derived from the short DFT algorithms of Sect. 5.5 and
compute q'-I(q - 1) output samples of a DFT of length N = q'. The reduced
DFT is defined by
Table 7.8. Number of real operations for complex DFTs and reduced DFTs. Trivial multipli-
cations by ± 1, ±j are given between parentheses
2 4 (4) 4
3 6 (2) 12
4 8 (8) 16
5 12 (2) 34
7 18 (2) 72
8 16 (12) 52 DFTs
9 22 (2) 88
16 36 (16) 148
32 104 (36) 424
64 272 (76) 1104
128 672 (156) 2720
256 1600 (316) 6464
512 3712 (636) 14976
1024 8448 (1276) 34048
3 4 (0) 8
4 4 (4) 4
5 10 (0) 30
7 16 (0) 68
8 8 (4) 20 Reduced
9 16 (0) 56 DFTs
16 20 (4) 64
32 68 (20) 212
64 168 (40) 552
128 400 (80) 1360
256 928 (160) 3232
512 2112 (320) 7488
1024 4736 (640) 17024
7.4 Odd DFT Algorithms 209
_ q'-'(q-I)-I
Xk = '"
~ X11 W·k , l~k~N-I k =1= 0 modulo q
11=0
j = ,J-=T, (7.92)
where the input sequence is labelled X., the output sequence is labelled Xb and
the last q'-I input samples are zero. Input and output additions must be executed
in the specified index numerical order. Table 7.8 summarizes the number of real
operations for various complex DFTs and reduced DFTs used as building blocks
in the polynomial transform algorithms. Trivial mUltiplications by ± I, ±j are
given in parentheses.
XI = mo + ml
X3 = mo - mi.
ml -
_ (2 cos u - cos 2u - cos
3
4U) (XI - X2
)
mz -
_ (COS U+ cos 2u
3
- 2 cos 4U) (X 2 - tl
)
m3 -
_ (COS U- 2 cos 2u
3
+ cos 4U) ('1 - XI
)
m 4 = -jX3 sin 3u
210 70 Computation of Discrete Fourier Transforms by Polynomial Transforms
m7 = j(xi - t z) sin 2u
S2 = ml+ m + mo 2 S3 = -mz + m3 + mo
S4 = -m l - m3 + mo Ss = m + ms + m6
4
S6 = -m6 + m7 + m 4 S7 = -ms - m7 + m 4
XI = S2 + Ss X2 = S3 - S6 X4 = S4 + S7
Xs = S4 - S7 X7 = S3 + S6 X8 = S2 - SSo
SI = mo + ml S2 = mo - ml
S4 = m4 - m2 Ss = SI + S3 S6 = SI - S3
S7 = Sz + S4 S8 = S2 - S4 Sy = ms + m6
SIO = ms - m6 Sll = m7 + m8 SI2 = m7 - my
Sl3 = S9 + SII SI4 = S9 - SII SIS = SIO + SI2
SI6 = SIO - SI2
Most of the fast convolution techniques discussed so far are essentially algebraic
methods which can be implemented with any type of arithmetic. In this chapter,
we shall show that the computation of convolutions can be greatly simplified
when special arithmetic is used. In this case, it is possible to define number theo-
retic transforms (NTI) which have a structure similar to the DFT, but with
complex exponential roots of unity replaced by integer roots and all operations
defined modulo an integer. These transforms have the circular convolution prop-
erty and can, in some instances, be computed using only additions and multipli-
cations by a power of two. Hence, significant computational savings can be
realized if NTIs are executed in com puter structures which efficiently implement
modular arithmetic.
We begin by presenting a general definition of NTIs and by introducing the
two most important NTIs, the Mersenne transform and the Fermat number
transform (FNT). Then, we generalize our definition of the NTI to include
complex transforms and pseudo transforms. Finally, we conclude the chapter by
discussing several implementation issues and establishing a theoretical relation-
ship between NTIs and polynomial transforms.
Let Xm and h n be two N-point integer sequences. Our objective is to compute the
circular convolution YI of dimension N
(8.1)
In most practical cases, hn and Xm are not sequences of integers, but it is always
possible, by proper scaling, to reduce these sequences to a set of integers. We
shall first assume that all arithmetic operations are performed modulo a prime
number q, in the field GF(q). If h n and Xm are so scaled that IYII never exceeds
q/2, YI has the same numerical value modulo q that would be obtained in normal
arithmetic.
Under these conditions, the calculation of YI can be simplified by introducing
a number theoretic transform [8.1-3] having the same structure as a DFT, but
with the complex exponentials replaced by an integer g and with all operations
performed modulo q. The direct NTI of h n is, thus,
212 8. Number Theoretic Transforms
N-I
Hk == L
n=O
h"gnk modulo q, (8.2)
with a similar relation for the NIT Xk of X m • Since q is a prime, N has an inverse
N- 1 modulo q, and we define an inverse transform as
N-I
al == N- 1 L iik g-Ik modulo q, (8.3)
k=O
where
N N- 1 == 1 modulo q. (8.4)
Note that, since q is a prime, g has also an inverse g-I modulo q. Thus, the nota-
tion g-Ik is valid.
We would now like to establish the conditions which must be met for the
transform (8.2) to support circular convolution. Computing the NTTs Hk and
Xk of h n and X m, mUltiplying iik by Xk, and evaluating the inverse transform al
of iikXk yields
N-I N-I N-I
al == N-l L
n=O
L m=O
hnxm L
k-O
g<n+m-Ilk modulo q. (8.5)
Let
N-I
S == L g<n+m-Ilk modulo q. (8.6)
k=O
If the NTTs support convolution, then (8.5) must reduce to (8.1) and we must
have S == N for t = n + m - I == 0 modulo Nand S == 0 for t =1= 0 modulo N.
The first condition means that the exponents of g must be defined modulo N,
and this implies that
gN == 1 modulo q. (8.7)
gN == 1 modulo q. (8.9)
The last condition which must be satisfied to support the circular convolution
property corresponds to (8.8) and implies not only that g is a root of order N
modulo q, but also, since q is composite, that [(gt - 1), q] = 1. Hence, the fol-
lowing existence theorem may be defined.
Theorem 8.3: An NTT of length N and root g, defined modulo a composite in-
teger q, supports circular convolution if and only if the following conditions
are met:
gN== 1 modulo q
NN- 1 == 1 modulo q
214 8. Number Theoretic Transforms
Note that the condition [(g' - 1), q)] = 1 is stronger than just stating that
g must be a root of order N modulo q. This can be seen, for instance, in the case
corresponding to q = 15. In this case, 2 is a root of order 4 modulo 15, since the
4 powers of two 2°, 2 1 , 2 2 , 2 3 are all distinct and 24 == 1 modulo 15. However
we have 22 - 1 = 3, and 3 is not relatively prime to 15. In practice, the condi
tion [(g' - 1), q] = 1 can be replaced by a more restrictive condition by noting
that it corresponds to the need to ensure that
N-I
S == ~ g'k == 0 modulo q, for t = 1, ... , N - 1. (8.11)
k=O
The following theorem, due to Erdelsky [8.4], specifies the conditions required
for the existence of NTTs which support circular convolution.
Theorem 8.4: An NTT of length N and root g, defined modulo a composite
integer q, supports circular convolution if and only if the following conditions
are met:
gN == 1 modulo q
NN- 1 == 1 modulo q
[(gd - 1), q] = 1 for every integer d such that N/d is a prime.
N-I
(Or equivalently ~ gdk == 0 modulo q for every d such that N/d is a prime).
k=O
ql prime. (8.12)
In this case, the condition NN-I == 1 implies that N must be relatively prime
with ql' Moreover, g is of necessity relatively prime to qh because the condition
gN == 1 modulo q implies gN == 1 modulo ql' Therefore, for each g relatively
prime to q, we have, by Euler's theorem (theorem 2.3),
We now can demonstrate the following theorem which establishes the existence
of an NTT defined modulo qp and of length ql - 1.
Theorem 8.5: Given an NTT which supports circular convolution when defined
8.1 Definition of the Number Theoretic Transforms 215
modulo qh ql prime, with the root gl and the length ql - 1, there is always an
NIT of length ql - 1 when defined modulo qi'. This NIT supports circular
convolution and its root is g = g11,-I.
In order to demonstrate this theorem, we note first that the existence of the
NIT defined modulo ql implies that (gh ql) = 1. Then, Euler's theorem implies
r -1 .
that gq,-I = g11' (q,-I) == I modulo qi'. Moreover, SInce ql - 1 has no com-
mon factors with qh (ql - 1) is mutually prime with ql and qi' and has therefore
an inverse modulo qi'. We also note that the existence of the NIT defined
modulo ql implies that [(g1 - 1), ql] = 1 for s = 1, ... , ql - 2. Thus, g1 - 1 is
not a multiple of q)' and since, by Fermat's theorem (theorem 2.4) g1' == gl
modulo ql> we have, by systematically replacing gl by g1', gf - 1 == g{q.',-l_ 1
modulo ql == g' - 1. This means that gS - 1 has no common factors with ql
for s = 1, ... , ql - 2. Hence the three conditions of theorem 8.3 are met and
this completes the proof of theorem 8.5.
We can now consider any composite integer q given by its unique prime
power factorization.
By the Chinese remainder theorem (theorem 2.1), the N-length circular convolu-
tion modulo q can be calculated by evaluating separately the N-length convolu-
tions modulo each q[' and performing a Chinese remainder reconstruction to
recover the convolution modulo q from the convolutions modulo q,/. Therefore,
an N-Iength NTT which supports circular convolution will exist if and only if
N-length NITs exist modulo each factor q,/. Theorem 8.5 shows that this is the
case if NJ (qi - 1). Thus, N must divide the greatest common divisor (GCD)
of the (qi - 1) and we have the following existence theorem.
Theorem 8.6: A length-N NTT defined modulo q, with q = ql' ... q? ... q;'
supports circular convolution if and only if
NJ GCD[(ql - 1), (q2 - 2), ... (qi - 1), ... , (qe - 1)]. (8.16)
g and are, therefore, not significantly simpler than DFTs, except that the com-
plex exponentials in the DFTs are replaced by real integers. Thus, real convolu-
tions are computed via NTIs with real arithmetic, instead of complex arithmetic
as with the DFT approach. We shall see, however, in the following sections that
when the modulus q is properly selected, the mUltiplications by powers of g are
replaced by simple shifts, thereby simplifying the NTI computations consider-
ably. Another advantage of computing convolutions by NTIs instead of FFTs is
that the convolutions are computed exactly, without round-off errors. Thus, the
NTT approach is well adapted to high accuracy computations.
Since the NTIs have the same structure as DFTs, they have the same general
properties as the DFTs and the reader can refer to Sect. 4.1 for a description of
them. We simply note here that the NTI definition implies that the linearity
property
Theorem 8.6 implies the existence of a length-N NTI which supports circular
convolution modulo a Mersenne number q = 2p - 1, p prime, provided that N
divides all the ql - 1, where the q[' are the factors of q. Some of the Mersenne
numbers are primes. For these numbers, the possible transform lengths are given
by NI(q - 1)
(8.20)
For the last condition, we note that, since p is a prime, the set of exponents (·k
p-I
modulo pin S = L; 2tk is a simple permutation of k. Thus, for ( =1= 0 modulo p,
k=O
p-I p-I
L;2tk == L;2k == 2 p - 1 == 0 modulo (2 p - 1). (8.22)
k=O k-O
Hence we can define a p-point Mersenne transform having the circular convolu-
tion property by
- p-I
Xk == L; Xm 2mk modulo (2 p - 1), p prime k = 0, ... , p - 1 (8.23)
m-O
(8.24)
with
2- mk == 2(p-llmk modulo (2 p - 1). (8.25)
MERSENNE MERSENNE
TRANSFORM TRANSFORM
INVERSE
MERSENNE
TRANSFORM
For t = p, S = 1 - 1 +
1 - 1 ... = 0.
Hence, for any Mersenne number, we can define a Mersenne transform of
length 2p by
_ 2p-1
Xk == L: xm( _2)mk modulo (2P - 1) (8.28)
m=O
with
(_2)-mk == (_2)(2 p -llmk modulo (2P - 1). (8.30)
(8.32)
Hence
Consider now two integers Xm and hn' with Xm defined by (8.31) and hn defined by
(8.35)
(8.36)
we have
+2
p-I-d p-I
C" = '"
.L....i X m.l 21+d p ' " X m,t
£-A 21+d-p , (8.37)
i-O I-p-d
Given the two length-5 data sequences h" and X m , use the 5-point Mersenne
transform defined modulo (2' - 1) to compute the circular convolution Yl of h"
and X m • h" and Xm are defined by
ho = 1 (8.39)
Xo = 3 (8.40)
(8.41)
iio 1
iii 2 4 8 16 3
iiz 4 16 2 8 2
ii3 8 2 16 4 2
ii4 16 8 4 2 0 (8.42)
iiz == 18 (8.43)
Xz == 22 (8.44)
(8.47)
Yo 1 19
y, 16 8 4 2 0
Y2 8 2 16 4 11
Y3 4 16 2 8 0
Y4 2 4 8 16 16 (8.48)
Yo = 15 y, = 15 Yz = 12 Y3 = 13 Y4 = 9. (8.49)
The direct computation in ordinary arithmetic would produce the same result.
Note, however, that if we had chosen larger input samples, some of the output
samples of the convolution would have been greater than 30. In this case, the
Mersenne transform approach would produce erroneous samples because of the
reduction modulo 3 l.
We have seen in Sect. 2.15 that the first five Fermat numbers, Fo to F4 , are prime
while all other known Fermat numbers are composite. When Ft is a prime, the
maximum transform length is F t - 1 = 2 2', and therefore, all possible trans-
form lengths N correspond to N 122'.
When F t is composite, every prime factor ql of F t is of the form
(8.50)
Hence, theorem 8.6 implies that we can always define an N-Iength transform
modulo a composite Fermat number, provided that
(8.51)
We must now find the roots of these transforms. It is obvious that 2 is root of
order 2t+1 modulo F" since 2 2' == - 1 and since 21 takes the 2t+1 distinct values
1,2,22, ... , -1, -2, ... ,22'-1 - 1 for i = 0, 1,2, ... , 2t+1 - 1. This means that
when F t is a prime, we can define an FNT oflength N = 2t+1 with root 2. For F t
composite, 2 is also a root of order 2t+ 1, but we must also prove that 2t+1 has an
inverse and that 2 2' - 1 is mutually prime with 2 2' + 1 (theorem 8.4). Since
22' == - 1, the inverse of 2t+1 modulo F t is obviously _22'-t-l. Moreover, we
have 22' - 1 = (22' + 1) - 2. This means that any divisor of 22' - 1 and
22' + 1 should also divide 2. Thus, this divisor could only be 2, but this is
impossible since 22' - 1 and 22' + I are odd.
Under these conditions, we can define a length - 2t+1 FNT which supports
circular convolution by
2'+'-1
Xk == L:
m=O
Xm 2mk modulo (2 2' + 1) (8.52)
2'+'-1
Xm == - 22'-t-1 L: Xk 2-mk modulo (22' + I) (8.53)
k-O
with
Iffollows immediately that, since 2 is a root of order 2t+1 modulo F" 22' is a root
of order 2t+1-'. We can, therefore, always define FNTs of length 2t+1-, with root
22'.
We have shown above that it is possible to define FNTs oflength 2t+2 when
F t is composite. Since the maximum number of distinct powers of ±2 modulo
(22' + 1) is equal to 2t+1, the roots of the length - 2t+2 FNTs can no longer be
224 8. Number Theoretic Transforms
simple powers of two. We note that 2 is a root of order 21+1 and therefore that
-./2 is a root of order 2'+2. However, -./2 has a very simple expression in a
ring of Fermat numbers
Thus, FNTs have lengths which are powers of two, and can be computed using
only additions and multiplications by powers of 2 for sizes up to N = 2'+2.
Larger FNTs can be defined modulo prime Fermat numbers, but in this case,
the roots are no longer simple and the computation of these transforms requires
general multiplications. Therefore, most practical applications are restricted to a
maximum length equal to 21+2.
FNTs are superior to Mersenne transforms in several respects. As a first point
of difference, it can be noted that FNTs permit much more flexibility in selecting
the transform length as a function of the word length than Mersenne transforms.
A second advantage of using FNTs relates to the highly composite length of
such transforms. This makes it possible to evaluate an FNT with a reduced
number of additions by use of a radix-2 FFT-type algorithm. To illustrate this
point with a decimation in time algorithm, the length - 2'+1 FNT defined by
(8.52) can be calculated, in the first stage by
_ N/2-1 N/2-1
X k == L;
m=O
X2m 22mk + 2k L;
m=Q
X 2m + 1 22mk modulo F, (8.56)
_ N/2-1 N!2-1
Xk+N/2 == L; X2m 2 2mk - 2k L; X2m+1 2 2mk modulo F,. (8.57)
m=O m=O
v-I
Xm = L; x m ,,2' + x m ,_ 2-, X m ,' E (0, I)
i=O
Since Xm ~ 2-, x m ,_ is equal to 1 only if all the x m ,' are equal to zero for i < v.
Negation can be realized by complementing all bits x m ,' of X m , except x m ,_' Thus,
if we treat x m ,_ separately, we have
8.3 Fermat Number Transforms 225
+ Xm =
v-I
Xm ~ 21 = 2v - 1
I~O
and
- Xm = xm + 2, Xm,v = ° (8.59)
- Xm = 1, xm,v = 1. (8.60)
(8.61)
cannot be greater than 2.+ 1 and, since 2· == -1 and Cn,.+1 = 1 only for Cn,l = 0,
j = 0, ... , v,
v-I
Cn == ~ Cn,l 21 - cn,v, (8.62)
1-0
(8.63)
and, since 2· == - I,
v-I-4 v-I
Cn == ~ x m ,I21+4 - ~ x m ,121+4-., Xm,v = 0. (8.65)
1=0 l=v-4
For x m ,. = 1, Cn reduces to
Xm,v = 1. (8.66)
(8.67)
(8.68)
which indicates that the coded samples am are obtained by simply complementing
the v least significant bits of Xm and adding 1 to Xm + x m••2·.
With this technique, the input data stream is encoded only once prior to
transform computation and all operations are performed on the coded sequences
with a single decoding operation on the final result, this decoding operation being
also defined by (8.68). We now demonstrate that arithmetic modulo Fermat
numbers on the coded sequences am is much simpler than on the original se-
quences.
Consider first negation. If am •• = 1, then Xm = 0 and no modification is
required. If am •• = 0, coding the complement am of am yields
(8.69)
(8.70)
which shows, by comparison with (8.67) that am is the coded representation of
-Xm. Thus, negation is performed on the coded samples by a simple comple-
mentation except when am •• = 1. If am and bm are the coded values of two inte-
gers Xm and hn' the sum of am and bm is given by
(8.71)
with
(8.72)
Thus,
v-I
dm = em - 2· == L: e m•
1-0
1 21 +I- em •• , (8.74)
which indicates that addition in the transposed system is executed with ordinary
adders, but with high order carry fed back, after complementation, to the least
significant carry input of the adder. If one or both bits am •• and bm•• are zero, one
or both of the operands Xm and h n are zero. In this case, the operation must be
inbibited.
It can be verified easily, from the rules of addition, that multiplication by 24
corresponds in the transposed system to a simple d-bit rotation around the v-bit
8.3 Fermat Number Transforms 227
Ilm.v_l Il m •v
~J LAG
FLAG
where h", Xm and y, are the real parts of the input and output sequences and It",
im , and y, are the imaginary parts of the input and output sequences. With the
conventional approach, this complex convolution is calculated by evaluating
four real convolutions:
228 8. Number Theoretic Transforms
N-I N-I _
Y/ ==
11=0
L: h n x/_ n modulo (2V + I)
L: hn X/- n - 11=0 (8.76)
N-I N-I _
L: h n x/- n + L:
y/ == 11=0 n=O
h n X/_ n modulo (2V + 1). (8.77)
We shall now show that the evaluation of y, + y/ can be done with only two real
convolutions by taking advantage of the special properties of j = ,J=1 in cer-
tain rings of integers [8.11, 12]. This is done by noting that in the ring of integers
modulo (2V + 1), with v even, we have 2v == - I, which means thatj = ,j-I is
congruent to 2v/2. y/ + jy/ is evaluated by first computing the two real auxiliary
convolutions a/ and b/ defined by
Since 2 v == - I, we have
(8.81)
When a convolution is computed via NTTs, all the calculations are executed on
integer data sequences and the convolution product is obtained modulo q without
roundoff errors. This feature provides significant advantage over other methods
when high accuracy is needed, but can also impose a requirement for relatively
long words to ensure that the result remains within the modulo range. In order
to analyze these limitations for arithmetic operations modulo q, we assume that
the two input sequences Xn and h n are integer sequences. The output Yn of the
ordinary convolution is given by
(8.82)
which means that the word length of the original input sequences must be slightly
less than half of the word length corresponding to the modulus q. Tighter bounds
on input signal amplitudes can be found using the L1' norms [8.13] defined by
IN-I )l/r
IIxllr = (
N ~ Ix.. lr , r ~ 1. (8.85)
i .. is bounded by
li .. 1~ Nllxllrllhll. (8.86)
with
These bounds are better than (8.84), especially when the circular convolutions
are used to compute aperiodic convolutions by the overlap-add method, with
both input sequences padded out with zeros.
Thus, when convolutions are computed by NTIs, the only source of quan-
tization noise is the input quantization noise resulting from scaling and rounding
of the input sequences required to avoid overflow [8.14]. This implies that the
output signal-to-noise ratio (SNR) is relatively independent of the convolution
length N and increases by 3 dB for each increase of word length by one bit. By
contrast, when the convolution is evaluated by FFTs with fixed word length, one
must account for the roundoff noise incurred in FFT computations and the SNR
increases by about 6 dB for each additional bit of word length and decreases by
about 2 dB for every doubling of the convolution length. This shows that for
fixed word lengths, the computation by NTIs is less noisy than the computation
by FFTs for long convolutions. For words of 12 bits, for instance, NTI filtering
gives a better SNR than FFT filtering for N greater than about 32.
This motivates one to use NTIs for computing long convolutions. However,
for a given modulus q, the maximum number of distinct powers of ±2 is equal
to 2rIog2 q, with a = rIog2 q, where a is the smallest integer such that a ~ log2
q. Thus, for NTIs computed without multiplications, there is a rigid relationship
between word length and maximum convolution length, and long convolutions
imply long word lengths, even if these long word lengths far exceed the desired
accuracy.
One solution to this problem consists of simply computing the convolutions
YI ... and Y2 ... of two consecutive blocks simultaneously. Assuming, for instance,
230 8. Number Theoretic Transforms
that h. is a fixed input sequence of positive integers and that XI,. and X2,. are two
consecutive positive integer sequences, the two length-N convolutions YI,. and
Y2,. can be computed in a single step by
(8.91)
with
With this method, the transform length is doubled for a given accuracy and there
is no overflow, provided IYl,. I, IYz,n I < (.../(j)/2.
Another solution to computing long convolutions with NTTs using moderate
word lengths is possible by mapping the one-dimensional convolution of length
N into multidimensional convolutions using one of the methods discussed in
Chap. 3. For instance, if N is the product of d distinct Mersenne numbers Nt.
N 2, ... , Nd with N = NlN2 ... N d, we can map the length N convolution into a
d-dimensional convolution of size Nl X N2 X ... X Nd by using the Agarwal-
Cooley algorithm (Sect. 3.3.1). This is always possible because all Mersenne
numbers are mutually prime (theorem 2.15). The nested convolutions are then
calculated with Mersenne transforms defined modulo Nt. N2 ... Nd and the
convolution product Y/ is obtained without overflow provided that Iy/I < Nd2,
where NI is the smallest Mersenne number.
We have seen that Mersenne transforms defined modulo (2P - 1), withp prime,
and Fermat number transforms defined modulo(22 ' + 1) can be used to compute
circular convolutions. Both transforms are computed without multiplication but
suffer serious limitations which relate mainly to the lack of an FFT-type al-
gorithm for Mersenne transforms and to the problems associated with arithmetic
modulo (22 ' + 1) for FNTs. It would seem difficult to consider the use of any
modulus other than a Mersenne or a Fermat number because of the problems
associated with the corresponding arithmetic. In the following, however, we shall
show that these difficulties can be circumvented by defining NTTs modulo in-
tegers ql which are factors of pseudo Mersenne numbers q, with q = 2p - 1, p
composite or of pseudo Fermat numbers q, with q = 2" + 1, v =1= 2'. In both
cases, q is composite and can always be defined as the product of two factors
8.5 Pseudo Transforms 231
N-I
Yl == L: h n X1- n modulo q2 (8.94)
n=O
With this method [8.15, 16], if q is a pseudo Mersenne number, all operations
but the last reduction are done in one's complement arithmetic. The price to be
paid for use of this approach is that all operations modulo (2P - 1) must be
executed on word lengths longer than that of the final result. However, the
increase in number of operations is very limited when ql is small and the penalty
is more than offset by the fact that P needs no longer to be a prime.
We shall first consider the use of pseudo Mersenne transforms defined modulo
q = 2p - 1, with P composite. For P even, q = (2P12 - 1) (2P'Z + 1), and the
transform length cannot be longer than that which is possible for 2p !2 - 1. Thus,
we need be concerned only with the cases corresponding to P odd. In order to
specify the pseudo Mersenne transforms, we shall use the following theorem
introduced by Erdelsky [8.4].
Theorem 8.7: Given a prime number PI and two integers u and g such that
u ~ 1, jgj ~ 2, g =t= 1 modulo Ph the NTT of length N = p~ and of root g
supports circular convolution. This NTT is defined modulo q2 = (gP; - 1)/
(gP;-1 - 1).
In order to demonstrate this theorem, we must establish that the three condi-
tions of theorem 8.4 are satisfied.
gN == 1 modulo qz (8.96)
(8.98)
The condition (8.96) follows immediately from the fact that gN == 1 modulo
232 8. Number Theoretic Transforms
(gP; - 1), with q2 factor of (gP; - 1). For the condition (8.97) we note that
Fermat's theorem implies that
(8.100)
or
(8.101)
which implies that gJ>;-' - 1 is mutually prime with q2 if [(gP;-' - 1), PI] = 1.
This condition is immediately established, because gJ>;-' == g modulo q2 and
W $. 1 modulo Q2' Therefore (8.98) is proved and this completes the demonstra-
tion of the theorem.
We can now derive two classes of pseudo Mersenne transforms from the-
dry_I) (t1-I)
2 2
ROOT 2 . LENGTH PI ROOT 2 , LENGTH PI
INVERSE PSEUDO
MERSENNE TRANSFORM
MODULO (/I-I)
PJ-I 2
ROOT 2 , LENGTH P
REDUCTION MODULO
,; PI
(2 1_1)/(2 -1)
Fig. 8.3. Computation of a circular convolution
modulo (2"f - 1)/(2'" - 1) by pseudo Mersenne
y/ transforms defined modulo (2J>f - 1), PI prime
8.5 Pseudo Transforms 233
with similar definitions for the transform iik of hn' and for the inverse transform.
Another class of pseudo Mersenne transforms is derived from theorem 8.7
by setting u = 1, g = 2p , and P = PIP2. This gives pseudo Mersenne transforms
oflength N = PI> PI prime, and defined modulo Q2, where
With this pseudo transform, computation is executed modulo (2P 'P' - 1) on data
word lengths of PIP2 bits and the final result is obtained modulo (2plP' - 1)/
(2P ' - 1) on words of approximately plpl - 1) bits.
Table 8.1 lists the parameters for various pseudo Mersenne transforms
defined modulo (21' - 1), with P odd. The most interesting transforms are those
which have a composite number of terms and a useful word length as close as
Table 8.1. Parameters for various pseudo Mersenne transforms defined modulo (2 P - 1) and
convolutions defined modulo Q2, with Q2 factor of 2P - 1
15 7·31·151 (2 1S - 1)/7 5 23 12
21 72 ol27·337 (2 21 - 1)/49 7 23 15
25 31·601·1801 (2 25 - 1)/31 25 2 20
27 7·73·262657 (227 - 1)/511 27 2 18
35 31·71·127·122921 (2 3' - 1)/3937 35 2 23
35 31·71·127·122921 (2 3' - 1)/(127) 5 27 28
35 31·71·127·122921 (2 3' - 1)/31 7 2' 30
45 7·31·73·151·631·23311 (24) - 1)/511 5 29 36
49 127·4432676798593 (249 - 1)/127 7 27 42
49 127·4432676798593 (249 - 1)/127 49 2 42
234 8. Number Theoretic Transforms
gN == 1 modulo qz (8.105)
N N- I == 1 modulo qz (8.106)
Since gV! == -1 modulo (gV! + 1), we have gN == 1 modulo (gV! + 1) and there-
fore, gN == 1 modulo qz, because qz is a factor of gV! +
1. To establish the condi-
tion (8.106), we note that Fermat's theorem implies that
(8.108)
which implies that q2 is odd, since VI is odd. Thus we have (N, q2) = I and N
has an inverse modulo q2'
In order to establish that the condition [(gV! - I), qz] = I corresponding
to (8.107) is met, we note that
(8.ll0)
which implies that [(gV! - I), q2] = (qz, 2) and, since q2 is odd,
8.5 Pseudo Transforms 235
In order to establish that the condition [(g2~-1 - 1), q2] = 1 is satisfied, we note
that, since g2~-1 - 1 = (gv:- I + 1) (g"l-I - 1), this condition corresponds to
[(g~-I + 1), q2] = 1 and [(g~-I - 1), q2] = 1. The condition (8.110) implies
that [(gv:- I - 1), q2] = 1, since (gv:- I - 1) is a factor of (g": - 1). We note also
that (8.109) implies
(8.112)
and, since g $. -1 modulo VI and gv:- I == g modulo Vi> we have [(g~-I + 1),
q2] = 1, which completes the proof of the theorem.
An immediate application of theorem 8.8 is that, if g = 2 and u = 1, we can
*"
define for VI 3 a NTT oflength N = 2vI which supports circular convolution.
This NTT is defined modulo q2, with q2 = (2V' + 1)/3.
Similarly, for u = 2 and g = 2, we have an NTT of length N = 2vI. This
NTT has the circular convolution property and is defined modulo q2, with
q2 = (2vl + 1)/(2V, + 1).
A systematic application of theorems 8.7 and 8.8 yields a large number of
pseudo Fermat number transforms. We summarize the main characteristics of
some of these transforms for V even and V odd, respectively, in Tables 8.2 and 8.3.
It can be seen that there is a large choice of pseudo Fermat number transforms
having a composite number of terms. This allows one to select word lengths that
are more closely taylored to meet the needs of particular applications than when
word lengths are limited to powers of two, as with FNTs.
The same pseudo transform technique can also be applied to moduli other
than 2p - 1 and 2" + 1, and the cases corresponding to moduli 2 2p - 2p + 1
Table 8.2. Parameters for various pseudo Fermat number transforms defined modulo (2· + 1)
and convolutions defined modulo Q2, with Q2 factor of 2" + 1. v even
Table 8.3. Parameters for various pseudo Fermat number transforms defined modulo (2" + 1)
and convolutions defined modulo Q2, with Q2 factor of 2" + 1. v odd
are discussed in [S.17]. These moduli are factors of 26q - 1 and NTIs of dimen-
sion 6q and root 2 can be defined modulo some factors of 2 2p - 2!' + 1.
When a convolution is calculated by pseudo transforms, the computation is
performed modulo q and the final result is obtained modulo q2, with q = qlq2'
Since q2 < q, it is possible to detect overflow conditions by simply comparing
the result of the calculations modulo q with the convolution product defined
modulo qz [8.18].
We consider a complex integer x + jx, where x and x are defined in the field
GF(q) of the integers defined modulo a prime q. Thus, x and x can take any
integer value between 0 and q - 1. We also assume that j = ~ is not a
member of GF(q), which means that -I is a quadratic nonresidue modulo q.
Then, the Gaussian integers x + jx are similar to ordinary complex numbers,
with real and imaginary parts treated separately and addition and multiplication
defined by
(S.1I3)
(S.1I4)
Since each integer x and x can take only q distinct values, the Gaussian integers
x + jx can take only qZ distinct values and are said to pertain to the extension
field GF(q2). Since x +
jx pertain to a finite field, the successive powers of x jx +
+
given by (x jx)n, n = 0, 1,2, ... yield a sequence which is reproduced with
8.6 Complex NTIs 237
+ j.:£)q =
q
(x ~ C Ie x, gq-I jq-I, (8.115)
1=0
C' q! (8.116)
k = l..'(q _ I.)'.
.
Since these coefficients are integers, i!(q - i)! must divide q!. However i!(q - i)!
cannot divide q because q is a prime. Therefore i!(q - i)! divides (q - I)! for
i =1= 0, q and (8.115) reduces to
(8.117)
This implies that, when q2 == 1 modulo 4, any root of order N in GF(q2) must
satisfy the condition
q == 3 modulo 4. (8.123)
238 8. Number Theoretic Transforms
This condition is established easily for Mersenne and pseudo Mersenne trans-
forms defined modulo (2P - 1), because, in this case
q = 2p - 1 == -1 == 3 modulo 4. (8.124)
(8.125)
Since p is an odd prime, we have gtp == 1 modulo q and g~P == 1 modulo q, with
gj and gi taking, respectively, 4p and 8p distinct values for n = 0, ... , 4p - 1
and n = 0, ... , 8p - 1. Thus ,we can define complex Mersenne transforms
[8.12, 15] of length 4p and 8p which support the circular convolution by
_ 4p-1
Xk == ~ xm(2j)mk modulo (2P - 1) (8.127)
m=O
_ 8q-l
Xk == ~ xm(1 +j)mk modulo (2P - 1). (8.128)
m=O
(8.129)
g = a + jb (8.130)
with
370/155) were about 2 to 4 times shorter than with an efficient FFT program. In
this comparison, the convolutions of lengths above 128 are computed by two-
dimensional FNTs.
An interesting aspect of number theoretic transforms is their analogy with
discrete Fourier transforms. NTIs are defined with roots of unity g of order
N modulo an integer q, while DFTs are defined with complex roots of unity W
of order N in the field of complex numbers. Hence NTIs can be viewed as DFTs
defined in the ring of numbers modulo q. In fact, NTIs can also be considered
as particular cases of polynomial transforms in which the N-bit words are viewed
as polynomials. This is particularly apparent for polynomial transforms oflength
21+1 defined modulo(Z2' + 1). Such transforms compute a circular convolution
of length 2'+1 on polynomials oflength 2'. If the 21+1 input polynomials are defined
as words of 2' bits, the polynomial transform reduces to an FNT oflength 2'+1,
of root 2 and defined modulo (2 2' + 1). Thus, polynomial transforms and NTIs
are DFTs defined in finite fields and rings of polynomials or integers. Their
main advantage over DFTs is that systematic advantage is taken from the
operation in finite fields or rings to define simple roots of unity which allow one
to eliminate the multiplications for transform computation and to replace com-
plex arithmetic by real arithmetic.
References
Chapter 2
2.1 T. Nagell: Introduction to Number Theory, 2nd ed. (Chelsea, New York 1964)
2.2 G. H. Hardy, E. M. Wright: An Introduction to the Theory of Numbers, 4th ed. (Oxford
University Press, Ely House, London 1960)
2.3 N. H. McCoy: The Theory of Numbers (MacMillan, New York 1965)
2.4 J. H. McClellan, C. M. Rader: Number Theory in Digital Signal Processing (Prentice-
Hall, Englewood Cliffs, N. J. 1979)
2.5 M. Abramowitz, I. Stegun: Handbook of Mathematical Functions, 7th ed. (Dover, New
York 1970) pp. 864-869
2.6 W. Sierpinski: Elementary Theory of Numbers (Polska Akademia Nauk Monographie
Matematyczne, Warszawa 1964)
2.7 I. M. Vinogradov: Elements of Number Theory, (Dover, New York 1954)
2.8 D. J. Winter: The Structure of Fields, Graduate Texts in Mathematics, Vol. 16 (Springer,
Berlin, New York, Heidelberg 1974)
2.9 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans.
ASSP-25, 392-410 (1977)
2.10 J. H. Griesmer, R. D. Jenks: "SCRATCHPAD I. An Interactive Facility for Symbolic
Mathematics", in Proc. Second Symposium on Symbolic and Algebraic Manipulation,
ACM, New York, 42-58 (1971)
2.11 S. Winograd: On computing the discrete Fourier transform. Math. Comput. 32, 175-199
(1978)
2.12 S. Winograd: Some bilinear forms whose multiplicative complexity depends on the field
of constants. Math. Syst. Th., 10, 169-180 (1977)
Chapter 3
3.1 T. G. Stockham: "Highspeed Convolution and Correlation", in 1966 Spring Joint Com-
puter Conf., AFIPS Proc. 28, 229-233
3.2 B. Gold, C. M. Rader, A. V. Oppenheim, T. G. Stockham: Digital Processing of Signals,
(McGraw-Hill, New York 1969) pp. 203-213
3.3 R. C. Agarwal, J. W. Cooley: "New Algorithms for Digital Convolution", in 1977 Intern.
Conf., Acoust., Speech, Signal Processing Proc., p. 360
3.4 I. J. Good: The relationship between two fast fourier Transforms. IEEE Trans. C-20,
310-317 (1971)
3.5 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE ASSP-2S,
392-410 (1977)
3.6 H. J. Nussbaumer: "New Algorithms for Convolution and DFT Based on Polynomial
Transforms", in IEEE 1978 Intern. Conf. Acoust., Speech, Signal Processing Proc., pp.
638-641
3.7 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev., 22, 134-144 (1978)
3.8 R. C. Agarwal, C. S. Burrus: Fast one-dimensional digital convolution by multidimen-
sional techniques. IEEE Trans. ASSP-22, 1-10 (1974)
3.9 H. J. Nussbaumer: Fast polynomial transform algorithms for digital convolution. IEEE
Trans.ASSP-28,205-215,(1980)
242 References
3.10 A. Croisier, D. J. Esteban, M. E. Levilion, V. Riso: Digital Filter for PCM Encoded
Signals, US Patent 3777130, Dec. 4, 1973
3.11 C. S. Burrus: Digital filter structures described by distributed arithmetic. IEEE Trans.
CAS-24, 674-680 (1977)
3.12 D. E. Knuth: The Art of Computer Programming, Vol. 2, Semi-Numerical Algorithms
(Addison-Wesley, New York 1969)
Chapter 4
4.1 B. Gold, C. M. Rader: Digital Processing of Signals (McGraw-Hili, New York 1969)
4.2 E. O. Brigham: The Fast Fourier Transform (Prentice-Hall, Englewood Cliffs, N. J. 1974)
4.3 L. R. Rabiner, B. Gold: Theory and Application of Digital Signal Processing (Prentice-
Hall, Englewood Cliffs, N. J. 1975)
4.4 A. V. Oppenheim, R. W. Schafer: Digital Signal Processing (Prentice-Hall, Englewood
Cliffs, N. J. 1975)
4.5 A. E. Siegman: How to compute two complex even Fourier transforms with one trans-
form step. Proc. IEEE 63, 544 (1975)
4.6 J. W. Cooley, J. W. Tukey: An algorithm for machine computation of complex Fourier
series. Math. Comput. 19,297-301 (1965)
4.7 G. D. Bergland: A fast Fourier transform algorithm using base 8 iterations. Math. Com-
put. 22, 275-279 (1968)
4.8 R. C. Singleton: An algorithm for computing the mixed radix fast Fourier transform.
IEEE Trans. AU-17, 93-103 (1969)
4.9 R. P. Polivka, S. Pakin: APL: the Language and Its Usage (Prentice-Hall, Englewood
Cliffs, N. J. 1975)
4.10 P. D. Welch: A fixed-point fast Fourier transform error analysis. IEEE Trans. AU-I7,
151-157 (1969)
4.11 T. K. Kaneko, B. Liu: Accumulation of round-off errors in fast Fourier transforms. J.
Assoc. Com put. Mach. 17, 637-654 (1970)
4.12 C. J. Weinstein: Roundoff noise in floating point fast Fourier transform computation.
IEEE Trans. AU-I7, 209-215 (1969)
4.13 C. M. Rader, N. M. Brenner: A new principle for fast Fourier transformation. IEEE
Trans. ASSP-24, 264-265 (1976)
4.14 S. Winograd: On computing the discrete Fourier transform. Math. comput. 32, 175-199
(1978)
4.15 K. M. Cho, G. C. Ternes: "Real-factor FFT algorithms", in IEEE 1978 Intern. Conf.
Acoust., Speech, Signal Processing, pp. 634-637
4.16 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
4.17 G. Bonnerot, M. Bellanger: Odd-time odd-frequency discrete Fourier transform for sym-
metric real-valued series. Proc. IEEE 64,392-393 (1976)
4.18 G. Bruun: z-transform DFT filters and FFTs. IEEE Trans. ASSP-26, 56-63 (1978)
4.19 G. K. McAuliffe: "Fourier Digital Filter or Equalizer and Method of Operation There-
fore", US Patent No.3 679 882, July 25, 1972
Chapter 5
5.1 L. I. Bluestein: "A Linear Filtering Approach to the Computation of the Discrete Fourier
Transform", in 1968 Northeast Electronics Research and Engineering Meeting Rec., pp.
218-219
5.2 L. I. Bluestein: A linear filtering approach to the computation of the discrete Fourier
transform. IEEE Trans. AU-IS, 451-455 (1970)
5.3 C. M. Rader: Discrete Fourier transforms when the number of data samples is prime.
Proc. IEEE 56, 1107-1108 (1968)
5.4 S. Winograd: On computing the discrete Fourier transform. Proc. Nat. Acad. Sci. USA
73, 1005-1006 (1976)
References 243
5.5 L. R. Rabiner, R. W. Schafer, C. M. Rader: The Chirp z-transform algorithm and its
application. Bell Syst. Tech. J. 48, 1249-1292 (1969)
5.6 G. R. Nudd, O. W. Otto: Real-time Fourier analysis of spread spectrum signals
using surface-wave-implemented Chrip-z transformation. IEEE Trans. MTT-24, 54-56
(1975)
5.7 M. J. Narasimha, K. Shenoi, A. M. Peterson: "Quadratic Residues: Application to Chirp
Filters and Discrete Fourier Transforms", in IEEE 1976 Acoust., Speech, Signal Pro-
cessing Proc., pp. 376-378
5.8 M. J. Narasimha: "Techniques in Digital Signal Processing", Tech. Rpt. 3208-3, Stanford
Electronics Laboratory, Stanford University (1975)
5.9 J. H. McClellan, C. M. Rader: Number Theory in Digital Signal Processing (Prentice-Hall,
Englewood Cliffs, N. J. 1979)
5.10 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
5.11 1. J. Good: The interaction algorithm and practical Fourier analysis. J. Roy. Stat. Soc.
B-20, 361-372 (1958); 22,372-375 (1960)
5.12 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
5.13 D. P. Kolba, T. W. Parks: A prime factor FFT algorithm using high-speed convolution.
IEEE Trans. ASSP-25, 90-103 (1977)
5.14 C. S. Burrus: "Index Mappings for Multidimensional Formulation of the DFT and Con-
volution", in 1977 IEEE Intern. Symp. on Circuits and Systems Proc., pp. 662-664
5.15 S. Winograd: "A New Method for Computing DFT", in 1977 IEEE Intern. Conf.
Acoust., Speech and Signal Processing Proc., pp. 366-368
5.16 S. Winograd: On computing the discrete Fourier transform. Math. Com put. 32, 175-
199 (1978)
5.17 H. F. Silverman: An introduction to programming the Winograd Fourier transform
algorithm (WFTA). IEEE Trans. ASSP-25, 152-165 (1977)
5.18 R. W. Patterson, J. H. McClellan: Fixed-point error analysis of Winograd Fourier trans-
form algorithms. IEEE Trans. ASSP-26, 447--455 (1978)
5.19 L. R. Morris: A comparative study of time efficient FFT and WFTA programs for general
purpose computers. IEEE Trans. ASSP-26, 141-150 (1978)
Chapter 6
6.1 H. J. Nussbaumer: Digital filtering using polynomial transforms. Electron. Lett. 13, 386-
387 (1977)
6.2 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev. 22, 134-144 (1978)
6.3 P. Quandalle: "Filtrage numerique rapide par transformees de Fourier et transformees
polynomiales-Etude de I'implantation sur microprocesseurs" These de Doctorat de
Specialite, University of Nice, France (18 mai 1979)
6.4 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans. ASSP-
25,392--410 (1977)
6.5 B. Arambepola, P. J. W. Rayner: Efficient transforms for multidimensional convolutions.
Electron. Lett. 15, 189-190 (1979)
Chapter 7
7.1 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
7.2 H. J. Nussbaumer, P. Quandalle: "New Polynomial Transform Algorithms for Fast DFT
Computation", in IEEE 1979 Intern. Acoustics, Speech and Signal Processing Conf. Proc.,
pp.510-513
7.3 C. M. Rader: Discrete Fourier transforms when the number of data samples is prime.
Proc. IEEE 56, 1107-1108 (1968)
244 References
7.4 G. Bonnerot, M. Bellanger: Odd-time odd-frequency discrete Fourier transform for sym-
metric real-valued series. Proc. IEEE 64, 392-393 (1976)
7.5 C. M. Rader, N. M. Brenner: A new principle for fast Fourier transformation. IEEE
Trans. ASSP-24, 264-266 (1976)
7.6 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
7.7 S. Winograd: On computing the discrete Fourier transform. Math. Comput. 32, 175-199
(1978)
7.8 H. J. Nussbaumer: DFT computation by fast polynomial transform algorithms. Electron.
Lett. 15,701-702 (1979)
7.9 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev. 22, l34-144 (1978)
7.10 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans. ASSP-
25, 392--410 (1977)
Chapter 8
8.1 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
8.2 J. M. Pollard: The fast Fourier transform in a finite field. Math. Comput. 25, 365-374
(1971)
8.3 P. J. Nicholson: Algebraic theory of finite Fourier transforms. J. Comput. Syst. Sci. 5,
524-547 (1971)
8.4 P. J. Erdelsky: "Exact convolutions by number-theoretic transforms"; Rept. No. AD-
AOl3 395, San Diego, Calif. Naval Undersea Center (1975)
8.5 C. M. Rader: Discrete convolutions via Mersenne transforms. IEEE Trans. C-21, 1269-
1273 (1972)
8.6 R. C. Agarwal, C. S. Burrus: Fast convolution using Fermat number transforms with
applications to digital filtering. IEEE Trans. ASSP-22, 87-97 (1974)
8.7 R. C. Agarwal, C. S. Burrus: Number theoretic transforms to implement fast digital con-
volution. Proc. IEEE 63, 550-560 (1975)
8.8 L. M. Leibowitz: A simplified binary arithmetic for the Fermat number transform. IEEE
Trans. ASSP-24, 356-359 (1976)
8.9 J. H. McClellan: Hardware realization of a Fermat number transform. IEEE Trans.
ASSP-24, 216-225 (1976)
8.10 H. J. Nussbaumer: Linear filtering technique for computing Mersenne and Fermat num-
ber transforms. IBM J. Res. Dev. 21, 334-339 (1977)
8.11 H. J. Nussbaumer: Complex convolutions via Fermat number transforms. IBM J. Res.
Dev. 20, 282-284 (1976)
8.12 E. Vegh, L. M. Leibowitz: Fast complex convolutions in finite rings. IEEE Trans. ASSP-
24, 343-344 (1976)
8.13 L. B. Jackson: On the interaction of round-off noise and dynamic range in digital filters.
Bell Syst. Tech. J. 49, 159-184 (1970)
8.14 P. R. Chevillat, F. H. Closs: "Signal processing with number theoretic transforms and
limited word lengths", in IEEE 1978 Intern. Acoustics, Speech and Signal Processing
Conf. Proc., pp. 619-623
8.15 H. J. Nussbaumer: Digital filtering using complex Mersenne transforms. IBM J. Res.
Dev. 20, 498-504 (1976)
8.16 H. J. Nussbaumer: Digital filtering using pseudo Fermat number transforms. IEEE
Trans. ASSP-26, 79-83 (1977)
8.17 E. Dubois, A. N. Venetsanopou!os: "~umber theoretic transforms with modulus 220 -
20 + 1", in IEEE 1978 Intern. Acoustics, Speech and Signal Processing Conf. Proc., pp.
624-627
References 245
Decimation Mersenne
in frequency 89, 187 number 19,216,230
in time 87 prime 20
Diophantine equations 6, 8, 237 transform 216
Discrete Fourier transform (DFT) 80, 112, Modulus 7
181 Multidimensional
Distributed arithmetic 64 convolution 45, 108, 151
Division DFT 102, 141, 193
integers 4 polynomial transform 178
polynomials 25 Mutually prime 5, 20, 21, 26, 43, 170,201