0% found this document useful (0 votes)
88 views260 pages

1981 Book FastFourierTransformAndConvolu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views260 pages

1981 Book FastFourierTransformAndConvolu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 260

Springer Series in Information Sciences 2

Editor: T. S. Huang
Springer Series in Information Sciences
Editors: King Sun Fu Thomas S. Huang Manfred R. Schroeder

Volume 1 Content-Addressable Memories


By T. Kohonen

Volume 2 Fast Fourier Transform


and Convolution Algorithms
By H. J. Nussbaumer

Volume 3 Algorithms and Devices for Pitch


Determination of Speech Signals
By W. Hess

Volume 4 Pattern Analysis


By H. Niemann
Henri J. Nussbaumer

Fast Fourier Transform


and
Convolution Algorithms

With 34 Figures

Springer-Verlag Berlin Heidelberg New York 1981


Dr. Henri J. Nussbaumer
IBM Centre d'Etudes et Recherches
F-06610 La Gaude, Alpes-Maritimes, France

Series Editors:

Professor King Sun Fu


School of Electrical Engineering, Purdue Universty
West Lafayette, IN 47907, USA

Professor Thomas S. Huang


Oepartment of Electrical Engineering and Coordinated Science Laboratory,
University of Illinois, Urbana IL 61801, USA

Professor Dr. Manfred R. Schroeder


Orittes Physikalisches Institut, Universităt Giittingen, BiirgerstraBe 42--44,
0-3400 Giittingen, Fed. Rep. of Germany

Library ofCongress Cataloging in Publication Data. Nussbaumer, Henri J. 1931 - Fast Fourier transform and
convolution algorithms. (Springer series in information sciences; v. 2). Bibliography: p. lncludes index. 1. Fourier
transformations-Data processing. 2. Convolutions (Mathematics}-Data processing. 3. Digital filters (Mathema-
tics) 1. Title. II. Series. QA403.5.N87 515.7'23 80-18096
This work is subject to copyright. AII rights are reserved, whether the whole or part of the material is concerned,
specifica1ly those of translation, reprinting, reuse of illustrations, broadcasting, reproduction by photocopying
machine or similar means, and storage in data banks. Under § 54 ofthe German Copyright Law where copies are
made for other than private use, a fee is payable to "Verwertungsgesellschaft Wort", Munich.

ISBN 978-3-662-00553-8 ISBN 978-3-662-00551-4 (eBook)


DOI 10.1007/978-3-662-00551-4
© by Springer-Verlag Berlin Heidelberg 1981
Softcover reprint ofthe hardcover lst edition 1981
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific
statement, that such names are exempt from the relevant protective laws and regulations and therefore free for
general use.
Offset printing and bookbinding: Briihlsche Universitlitsdruckerei, Giessen
2153/3130-543210
Preface

This book presents in a unified way the various fast algorithms that are used
for the implementation of digital filters and the evaluation of discrete Fourier
transforms.
The book consists of eight chapters. The first two chapters are devoted to
background information and to introductory material on number theory and
polynomial algebra. This section is limited to the basic concepts as they apply
to other parts of the book. Thus, we have restricted our discussion of number
theory to congruences, primitive roots, quadratic residues, and to the
properties of Mersenne and Fermat numbers. The section on polynomial
algebra deals primarily with the divisibility and congruence properties of
polynomials and with algebraic computational complexity.
The rest of the book is focused directly on fast digital filtering and
discrete Fourier transform algorithms. We have attempted to present these
techniques in a unified way by using polynomial algebra as extensively as
possible. This objective has led us to reformulate many of the algorithms which
are discussed in the book. It has been our experience that such a presentation
serves to clarify the relationship between the algorithms and often provides
clues to improved computation techniques.
Chapter 3 reviews the fast digital filtering algorithms, with emphasis on
algebraic methods and on the evaluation of one-dimensional circular
convolutions.
Chapters 4 and 5 present the fast Fourier transform and the Winograd
Fourier transform algorithm.
We introduce in Chaps. 6 and 7 the concept polynomial transforms and
we show that these transforms are an important tool for the understanding of
the structure of multidimensional convolutions and discrete Fourier trans-
forms and for the design of improved algorithms. In Chap. 8, we extend these
concepts to the computation of one-dimensional convolutions by replacing
finite fields of polynomials by finite fields of numbers. This facilitates intro-
duction of number theoretic transforms which are useful for the fast com-
putation of convolutions via modular arithmetic.
Convolutions and discrete Fourier transforms have many uses in physics
and it is our hope that this book will prompt some additional research in
these areas and will help potential users to evaluate and apply these techniques.
We also feel that some of the methods presented here are quite general and
might someday find new unexpected applications.
VI Preface

Part of the material presented here has evolved from a graduate-level


course taught at the University of Nice, France. I would like to express my
thanks to Dr. T.A. Kriz from IBM FSD for kindly reviewing the manuscript
and for making many useful suggestions. I am grateful to Mr. P. Bellot, IBM,
C.E.R., La Gaude, France, for his advice concerning the introductory chapter
on number theory and polynomial algebra, and to Dr. J. W. Cooley, from IBM
Research, Yorktown Heights, for his comments on some of the work which
led to this book. Thanks are also due to Dr. P. Quandalle who worked with
me on polynomial transforms while preparing his doctorate degree and with
whom I had many fruitful discussions. I am indebted to Mrs. C. De Backer for
her aid in improving the English and to Mrs. C. Chevalier who prepared the
manuscript.

La Gaude HENRI J. NUSSBAUMER


November 1980
Contents

Chapter 1 Introduction
1.1 Introductory Remarks. 1
1.2 Notations. . . . . . 2
1.3 The Structure of the Book. 3

Chapter 2 Elements of Number Theory and Polynomial Algebra


2.1 Elementary Number Theory. . . 4
2.1.1 Divisibility of Integers. . . 4
2.1.2 Congruences and Residues. 7
2.1.3 Primitive Roots. . . . . . 11
2.1.4 Quadratic Residues. . . . 17
2.1.5 Mersenne and Fermat Numbers 19
2.2 Polynomial Algebra. . . 22
2.2.1 Groups. . . . . . 23
2.2.2 Rings and Fields. . 24
2.2.3 Residue Polynomials 25
2.2.4 Convolution and Polynomial Product Algorithms
in Polynomial Algebra. . . . . . . . . . . . . 27

Chapter 3 Fast Convolution Algorithms


3.1 Digital Filtering Using Cyclic Convolutions 32
3.1.1 Overlap-Add Algorithm. . . . . . 33
3.1.2 Overlap-Save Algorithm. . . . . . 34
3.2 Computation of Short Convolutions and Polynomial Products 34
3.2.1 Computation of Short Convolutions by the
Chinese Remainder Theorem. . . . . . . . . 35
3.2.2 Multiplications Modulo Cyclotomic Polynomials 37
3.2.3 Matrix Exchange Algorithm . . . . . . . . . 40
3.3 Computation of Large Convolutions by Nesting of Small
Convolutions. . . . . . . . . . . . 43
3.3.1 The Agarwal-Cooley Algorithm. 43
3.3.2 The Split Nesting Algorithm. . 47
3.3.3 Complex Convolutions . . . . 52
3.3.4 Optimum Block Length for Digital Filters 55
3.4 Digital Filtering by Multidimensional Techniques. 56
3.5 Computation of Convolutions by Recursive Nesting of Polynomials 60
3.6 Distributed Arithmetic . . . . . . . . . . . . . . . . . . . 64
VIII Contents

3.7 Short Convolution and Polynomial Product Algorithms 66


3.7.1 Short Circular Convolution Algorithms . 66
3.7.2 Short Polynomial Product Algorithms . . 73
3.7.3 Short Aperiodic Convolution Algorithms 78

Chapter 4 The Fast Fourier Transform


4.1 The Discrete Fourier Transform 80
4.1.1 Properties of the DFT. . . 81
4.1.2 DFTs of Real Sequences. . 83
4.1.3 DFTs of Odd and Even Sequences 84
4.2 The Fast Fourier Transform Algorithm 85
4.2.1 The Radix-2 FFT Algorithm. . . 87
4.2.2 The Radix-4 FFT Algorithm. . . 91
4.2.3 Implementation of FFT Algorithms. 94
4.2.4 Quantization Effects in the FFT 96
4.3 The Rader-Brenner FFT. 99
4.4 Multidimensional FFTs . . . . . . 102
4.5 The Bruun Algorithm. . . . . . . 104
4.6 FFT Computation of Convolutions . 107

Chapter 5 Linear Filtering Computation of Discrete Fourier Transforms


5.1 The Chirp z-Transform Algorithm. . . . . . . . . . . . 112
5.1.1 Real Time Computation of Convolutions and DFTs
Using the Chirp z-Transform. . . . . . . . . . 113
5.1.2 Recursive Computation of the Chirp z- Transform. 114
5.1.3 Factorizations in the Chirp Filter. 115
5.2 Rader's Algorithm . . . . . . . . . . . . . . . 116
5.2.1 Composite Algorithms. . . . . . . . . . . 118
5.2.2 Polynomial Formulation of Rader's Algorithm 120
5.2.3 Short DFT Algorithms . . . . . . . . . . 123
5.3 The Prime Factor FFT . . . . . . . . . . . . . 125
5.3.1 Multidimensional Mapping of One-Dimensional DFTs. 125
5.3.2 The Prime Factor Algorithm. . . . . . . . . 127
5.3.3 The Split Prime Factor Algorithm. . . . . . . 129
5.4 The Winograd Fourier Transform Algorithm (WFTA). 133
5.4.1 Derivation of the Algorithm 133
5.4.2 Hybrid Algorithms . . . 138
5.4.3 Split Nesting Algorithms. . 139
5.4.4 Multidimensional DFTs. . 141
5.4.5 Programming and Quantization Noise Issues. 142
5.5 Short DFT Algorithms 144
5.5.1 2-Point DFT. 145
5.5.2 3-Point DFT. . 145
Contents IX

5.5.3 4-Point DFT . 145


5.5.4 5-Point DFT . 146
5.5.5 7-Point DFT . 146
5.5.6 8-Point DFT . 147
5.5.7 9-Point DFT . 148
5.5.8 16-Point DFT 149

Chapter 6 Polynomial Transforms


6.1 Introduction to Polynomial Transforms . . . . . . . 151
6.2 General Definition of Polynomial Transforms . . . . 155
6.2.1 Polynomial Transforms with Roots in a Field of
Polynomials . . . . . . . . . . . . . . . . 157
6.2.2 Polynomial Transforms with Composite Roots. 161
6.3 Computation of Polynomial Transforms and Reductions. 163
6.4 Two-Dimensional Filtering Using Polynomial Transforms 165
6.4.1 Two-Dimensional Convolutions Evaluated by Polynomial
Transforms and Polynomial Product Algorithms . . . . 166
6.4.2 Example of a Two-Dimensional Convolution Computed
by Polynomial Transforms. . . . . . . . . . . . . . 168
6.4.3 Nesting Algorithms. . . . . . . . . . . . . . . . . 170
6.4.4 Comparison with Conventional Convolution Algorithms. 172
6.5 Polynomial Transforms Defined in Modified Rings 173
6.6 Complex Convolutions . . . . . . . . 177
6. 7 Multidimensional Polynomial Transforms . . . . 178

Chapter 7 Computation of Discrete Fourier Transforms by Polynomial


Transforms
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 181
7.1.1 The Reduced DFT Algorithm . . . 182
7.1.2 General Definition of the Algorithm. . 186
7.1.3 Multidimensional DFTs. . . . . . . 193
7.1.4 Nesting and Prime Factor Algorithms . 194
7.1.5 DFT Computation Using Polynomial Transforms Defined
in Modified Rings of Polynomials. . . . . . . . . . . . 196
7.2 DFTs Evaluated by Multidimensional Correlations and Polynomial
Transforms . . . . . . . . . . . . . . . . . . . . . . . 201
7.2.1 Derivation of the Algorithm . . . . . . . . . . . . . 201
7.2.2 Combination of the Two Polynomial Transform Methods 205
7.3 Comparison with the Conventional FFT. 206
7.4 Odd DFT Algorithms. . . . . . . . 207
7.4.1 Reduced DFT Algorithm. N =4 209
7.4.2 Reduced DFT Algorithm. N = 8 209
7.4.3 Reduced DFT Algorithm. N =9 209
7.4.4 Reduced DFT Algorithm. N = 16 . 210
x Contents

Chapter 8 Number Theoretic Transforms


8.1 Definition of the Number Theoretic Transforms 211
8.1.1 General Properties of NTTs . . . . 213
8.2 Mersenne Transforms. . . . . . . . . . . 216
8.2.1 Definition of Mersenne Transforms. . 216
8.2.2 Arithmetic Modulo Mersenne Numbers 219
8.2.3 Illustrative Example. . . . . . . . . 221
8.3 Fermat Number Transforms. . . . . . . . 222
8.3.1 Definition of Fermat Number Transforms 223
8.3.2 Arithmetic Modulo Fermat Numbers . . 224
8.3.3 Computation of Complex Convolutions by FNTs . 227
8.4 Word Length and Transform Length Limitations. 228
8.5 Pseudo Transforms. . . . . . . . . . . 230
8.5.1 Pseudo Mersenne Transforms . . . 231
8.5.2 Pseudo Fermat Number Transforms. 234
8.6 Complex NTTs. . . . . . 236
8.7 Comparison with the FFT. . . . . . . . 239

References . . 241
Subject Index. 247
1. Introduction

1.1 Introductory Remarks


The practical applications of the digital convolution and of the discrete Fourier
transform (OFn have gained growing importance over the last few years. This
is a direct consequence of the major role played by digital filtering and OFTs
in digital signal processing and by the increasing use of digital signal processing
techniques made possible by the rapidly declining cost of digital hardware. The
motivation for developing fast convolution and OFT algorithms is strongly
rooted in the fact that the direct computation of length-N convolutions and
OFTs requires a number of operations proportional to N 2 which becomes
rapidly excessive for long dimensions. This, in turn, implies an excessively large
requirement for computer implementation of the methods.
Historically, the most important event in fast algorithm development has
been the fast Fourier transform (FFn, introduced by Cooley and Tukey in
1965, which computes OFTs with a number of operations proportional to N
log N and therefore reduces drastically the computational complexity for large
transforms. Since convolutions can be computed by OFTs, the FFT algorithm
can also be used to compute convolutions with a number of operations pro-
portional to Nlog Nand has therefore played a key role in digital signal process-
ing ever since its introduction. More recently, many new fast convolution and
OFT techniques fave been proposed to further decrease the computational
complexity corresponding to these operations. The fast OFT algorithm in-
troduced in 1976 by Winograd is perhaps the most important of these methods
because it achieves a theoretical reduction of computational complexity over the
FFT by a method which can be viewed as the converse of the FFT, since it com-
putes a OFT as a convolution. Indeed, as we shall see in this book, the rela-
tionship between convolution and OFT has many facets and its implications go
far beyond a mere algorithmic procedure.
Another important factor in the development of new algorithms was the
recognition that convolutions and OFTs can be viewed as operations defined in
finite rings and fields of integers and of polynomials. This new point of view has
allowed both derivation of some lower computational complexity bounds and
design of new and improved computation techniques such as those based on
polynomial transforms and number theoretic transforms.
In addition to their practical implications, many convolution and OFT
algorithms are also of theoretical significance because they lead to a better under-
standing of mathematical structures which may have many applications in areas
2 1. Introduction

other than convolution and DFT. It is likely, for instance, that polynomial
transforms will appear as a very general tool for mapping multidimensional
problems into one-dimensional problems.
The matter of comparing different algorithms which perform the same func-
tions is pervasive throughout this book. In many cases, we have used the number
of arithmetic operations required to execute an algorithm as a measure of the
computational complexity. While there is some rough relationship between the
overall complexity of an algorithm and its algebraic complexity, the practical
value of a computation method depends upon a number of factors. Apart from
the number of arithmetic operations, the efficiency of an algorithm is related to
many parameters such as the number of data moves, the cost of ancillary oper-
ation, the overall structural complexity, the performance capabilities of the
computer on which the algorithm is executed, and the skill of the programmer.
Therefore, ranking different algorithms as a function of actual efficiency ex-
pressed in terms of computer execution times is a difficult art so that the com-
parisons based on the number of arithmetic operations must be weighted as a
function of the particular implementation.

1.2 Notations

It is always difficult to avoid the proliferation of different symbols and subscripts


when presenting the various DFT and convolution algorithms. We have
adopted here some conventions in order to simplify the presentation. Discrete
data sequences are usually represented by lower case letters such as X n • We have
not used the representation {xn} for data sequences, because this simplifies the
notation and because the context information prevents confusion between the
sequence and the nth element of the sequence. Thus, in our representation, a
discrete-time signal Xn is a sequence of the values of a continuous signal x(t),
sampled at times t = nT and represented by a number. Polynomials are re-
presented by capital letters such as
N-l
X(z) = L Xn zn. (1.1)
n-O

For transforms, we use the notation Xk> which, for a DFT, has the form
N-l
Xk = L Xn W nk . (1.2)
n=O

We have also sometimes adopted Rader's notation (x)p for the residue of x mod-
ulop.
1.3 The Structure of the Book 3

1.3 The Structure of the Book

Chapter 2 presents introductory material on number theory and polynomial


algebra. This covers in an intuitive way various topics such as the divisibility of
integers and polynomials, congruences, roots defined in finite fields and rings.
This background in mathematics is required to understand the rest of the book
and may be skipped by the readers who are already familiar with number theory
and modern algebra.
Fast convolution algorithms are discussed in Chap. 3. It is shown that most
of these algorithms can be represented in polynomial algebra and can be con-
sidered as various forms of nesting.
The fourth chapter gives a simple development of the conventional fast
Fourier transform algorithm and presents new versions of this method such as
the Rader-Brenner algorithm.
Chapter 5 is devoted to the computation of discrete Fourier transforms
by convolutions and deals primarily with Winograd Fourier transform which is
an extremely efficient algorithm for the computation of the discrete Fourier
transform.
In Chaps. 6 and 7, we introduce the polynomial transforms which are DFTs
defined in finite rings and fields of polynomials. We show that these transforms
are computed without multiplications and provide an efficient tool for com-
puting multidimensional convolutions and DFTs.
In Chap. 8, we turn our attention to algorithms implemented in modular
arithmetic and we present the number theoretic transforms which are DFTs
defined in finite rings and fields of numbers. We show that these transforms may
have important applications when implemented in special purpose hardware.
2. Elements of Number Theory and Polynomial Algebra

Many new digital signal processing algorithms are derived from elementary
number theory or polynomial algebra, and some knowledge of these topics is
necessary to understand these algorithms and to use them in practical applica-
tions.
This chapter introduces the necessary background required to understand
these algorithms in a simple, intuitive way, with the intent of familiarizing
engineers with the mathematical principles that are most frequently used in this
book. We have made here no attempt to give a complete rigorous mathematical
treatment but rather to provide, as concisely as possible, some mathematical tools
with the hope that this will prompt some readers to study further, with some
of the many excellent books that have been published on the subject [2.1-4].
The material covered in this chapter is divided into two main parts: ele-
mentary number theory and polynomial algebra. In elementary number theory,
the most important topics for digital signal processing applications are the
Chinese remainder theorem and primitive roots. The Chinese remainder the-
orem, which yields an unusual number representation, is used for number
theoretic transforms (NIT) and for index manipulations which serve to map
one-dimensional problems into multidimensional problems. The primitive roots
playa key role in the definition of NITs and are also used to convert discrete
Fourier transforms (OFT) into convolutions, which is an important step in the
development of the Winograd Fourier transform algorithm.
In the polynomial algebra section, we introduce briefly the concepts of rings
and fields that are pervasive throughout this book. We show how polynomial
algebra relates to familiar signal processing operations such as convolution and
correlation. We introduce the Chinese remainder theorem for polynomials and
we present some complexity theory results which apply to convolutions and
correlations.

2.1 Elementary Number Theory


In this section, we shall be essentially concerned with the properties of integers.
We begin with the simple concept of integer division.

2.1.1 Divisibility of Integers


Let a and b be two integers, with b positive. The division of a by b is defined by

a = bq + r, o ~ r < b, (2.1)
2.1 Elementary Number Theory 5

where q is called the quotient and r is called the remainder. When r = 0, band q
arefactors or divisors of a, and b is said to divide a, this operation being denoted
by b Ia. When a has no other divisors than 1 and a, a is a prime. In all other
cases, a is composite.
When a is composite, it can always be factorized into a product of powers of
prime numbers p~', where c, is a positive integer, with

a = I1Yt'. (2.2)
I

The fundamental theorem of arithmetic states that this factorization is unique.


The largest positive integer d which divides two integers a and b is called the
greatest common divisor (GCD) and denoted

d = (a, b); (2.3)

when d = (a, b) = 1, a and b have no common factors other than 1 and they are
said to be mutually prime or relatively prime.
The GCD can be found easily by a division algorithm known as the Euclidean
algorithm. In discussing this algorithm, we shall assume that a and b are positive
integers. This is done without loss of generality, since (a, b) = (- a, b) = (a,
-b) = (-a, -b). Dividing a by b yields

(2.4)

by definition, d = (a, b) <; a or b. Therefore, if rl = 0, b Ia and (a, b) = b. If


rl =1= 0, we obtain, by continuation of this procedure, the following system of
equations:

b = rlq2 + r2,
rl = r q3 + r3>
2
(2.5)

Since rl > r2 > r3 ... , the last remainder is zero. Thus, by the last equation,
rk Irk-I' The preceding equation implies that rk Irk-Z, since rk Irk-I' Finally, we
obtain rk Iband rk Ia. Hence, rk is a divisor of a and b. Suppose now that c is any
divisor of a and b. By (2.4), c also divides rl' Then, (2.5) implies that c divides r 2,
r3 ... rk' Thus, any divisor c of a and b divides rk and therefore c <; rk' Hence,
rk is the GCD of a and b.
An important consequence of Euclid's algorithm is that the GCD of two
integers a and b is a linear combination of a and b. This can be seen by rewriting
(2.4) and (2.5) as
6 2. Elements of Number Theory and Polynomial Algebra

'1 = a - bql
'2 = b - 'lq2

(2.6)

'1
The first equation shows that is a linear combination of a and b. The second
equation shows that'2 is a linear combination of band'l and therefore of both a
and b. Finally, the last equation implies that'k is a linear combination of a and
'k
b. Since = (a, b), we have

(a, b) = ma + nb, (2.7)

where m and n are integers. When a and b are mutually prime, (2.7) reduces
to Bezout's relation

1 = ma + nb. (2.8)

We now change our point of view by considering a linear equation with integer
coefficients a, b, and c

ax + by = c (2.9)

where x and yare a pair of integers which are the solution of this Diophantine
equation. Such an equation has a solution if and only if (a, b) Ic. To demonstrate
°
this point, we note the following. It is obvious from (2.9) that for a = or b = 0,
we must have blc or alc.
For a =1= 0, b =1= 0, it is apparent that if (2.9) holds for integers x and y, then
d = (a, b) is such that dl c~ Conversely, if dl c, c = cld and (2.7) implies the
existence of two integers m and n such that d = ma + nb. Hence c = cld =
clma +clnb, and the solutions of the Diophantine equation are given by
x = Clm, y = cln. Thus, for (a, b) Ic, the solution of the Diophantine equation
is given by the Euclidean algorithm. The solution of the Diophantine equation
is not unique, however. This can be seen by considering a particular solution
c = axo + byo. Assuming x, y is another solution, we have

a(x - x o) = b(yo - y) (2.10)

and, by dividing this expression by d, we obtain

(a/d) (x - x o) = (b/d) (Yo - y) (2.11 )

Since [(a/d), (b/d)] = 1, this implies that (b/d) I(x - xo) and x = Xo + (b/d)s
where s is an integer. Substituting into (2.1l), we obtain

y = Yo - (a/d) s
(2.12)
x = Xo + (bjd)s.
2.1 Elementary Number Theory 7

This defines a class of linearly related solutions for (2.9) which depend upon the
integer s.
As a numerical example, consider the equation

15x + 9y = 21.
We first use Euclid's algorithm to determine the GCD d with a = 15 and
b = 9,

15 = 9·1 +6
9 = 6·1 + 3
6 = 3·2

Hence d = 3. Since 3/21, the Diophantine equation has a solution. We now


define 3 as a linear combination of 15 and 9 by recasting the preceding set of
equations as

6 = 15 - 9-I
3 = 9 - 6-I = - 15 + 2·9
Thus, m = - 1 and n = 2. Dividing c = 21 by d = 3 yields c) = 7. This gives
a particular solution Xo = - 7, Yo = 14. If we divide a = 15 and b = 9 by
d = 3, we obtain (ajd) = 5 and (bjd) = 3. Hence, the general solution to the
Diophantine equation becomes

y = 14 - 5s
x = - 7 + 3s,
where s is any integer.

2.1.2 Congruences and Residues

In (2.1), the division of an integer a by an integer b produces a remainder r. All


integers a which give the same remainder when divided by b can be thought as
pertaining to the same equivalence class relative to the equivalence relation
a = bq + r.
Two integers a) and a2 pertaining to the same class are said to be congruent
modulo b and the equivalence is denoted

a) == a2 modulo b. (2.13)

Thus, two numbers a) and a2 are congruent modulo b if

(2.14)
8 2. Elements of Number Theory and Polynomial Algebra

Underlying the concept of congruence is the fact that, in many physical prob-
lems, one is primarily interested in measuring relative values within a given
range. This is apparent, for instance, when measuring angles. In this case, the
angles are defined from 0 to 359 0 and two angles that differ by a multiple of
360 0 are considered to be equal. Hence angles are defined modulo 360.
Thus, in congruences, we are interested only in the remainder r of the division
of a by b. This remainder is usually called the residue and is denoted by

r == a modulo b. (2.15)

This representation is sometimes simplified to a form with the symbol ( ) [2.4],

(2.16)

where the subscript is omitted when there is no ambiguity on the nature of the
modulus b.
It follows directly from the definition ofresidues given by (2.14) that addi-
tions and multiplications can be performed directly on residues

(a1 + a2) = «a1) + (a2» (2.17)


(a1a2) = «a 1) (a2».

With congruences, division is not defined. We can however define something


close to it by considering the linear congruence

ax == c modulo b (2.18)

This linear congruence is the Diophantine equation ax + by = c in which all


terms are defined modulo b. Thus, we know by the results of the preceding sec-
tion that we can find values of x satisfying (2.18) if and only if die, with d =
(a, b). In this case, the solutions can be derived from (2.12) and are given by

x == Xo + (bjd)s modulo b, (2.19)

where Xo is a particular solution and s can be any integer smaller than b. How-
ever, there are only d distinct solutions since (bjd)s has only d distinct values mo-
dulo b. An important consequence of this point is that the linear congruence
ax == c modulo b always has a unique solution when (a, b) = 1. Thus, when
(a, b) Ic, the linear congruence ax == c modulo b can be solved and Euclid's
algorithm provides a method for computing the values of x which satisfy this
relation. We shall see later that Euler's theorem gives a more elegant solution
to the linear congruence (2.18) when (a, b) = 1.
We consider now the problem of solving a set of simultaneous linear
congruences with different moduli. Changing our notation, we want to find the
integer x which satisfies simultaneously the k linear congruences
2.1 Elementary Number Theory 9

x == " modulo mit i = I, ... , k. (2.20)

The solution to this problem plays a major role in many signal processing algori-
thms and is given by the Chinese remainder theorem
Theorem 2.1: Let m, be k positive integers greater than 1 and relatively prime
in pairs. The set of linear congruences x == ' I modulo m, has a unique solution
k
modulo M, with M = II mi.
I-I

The proof of this theorem is established by using the relations


k
X = L (MImi) "T, modulo M (2.21)
I-I

(MImi) T, == 1 modulo mi. (2.22)

Equation (2.22) defines k linear congruences. Since the m, are mutually prime,
[m" (Mimi)] = 1 and each of these congruences has a unique solution T, which
can be computed by Euclid's algorithm or Euler's theorem (theorem 2.3). Let us
now reduce x in (2.21) modulo m .. , one of the moduli mi. Except for Mlm .., all
the expressions MImi contain mIl as a factor and are therefore equal to zero
modulo mIl. Hence, (2.21) reduces to

x == (Mlm")r"T,, modulo mIl (2.23)

and, since (2.22) implies that (Mlm")T,, == I modulo mIl, (2.23) becomes

x == ' .. modulo mIl. (2.24)

It is seen easily that this operation can be repeated for all moduli m, and there-
'I
fore that (2.21) is the solution of the k linear congruences x == modulo mi.
As a simple application of the Chinese remainder theorem, let us find the
solution to the simultaneous congruences

(X)3 = 2, (X)4 = I, (x), = 3.

Here, we have ml = 3, mz = 4, m3 = 5, M = 60, MImi = 20, Mlmz = 15, and


Mlm3 = 12. The congruences (20 T 1)3 = I, (IS TZ)4 = I, and (12 T 3), = 1
are solved, respectively, by TI = 2, Tz = 3, and T3 = 3. Hence

x == 20·2·2 + 15·1·3 + 12·3·3 modulo 60


x = 53.
The Chinese remainder theorem can be used to define residue number
systems (RNS) which allow one to perform high-speed arithmetic operations
without carry propagation from digit to digit. In such a system, an integer a is
represented by its residues a, modulo a set of relatively prime integers m,
10 2. Elements of Number Theory and Polynomial Algebra

(2.25)

In this system, the addition or the multiplication of two integers a and b is done
by adding or multiplying separately their various residues al and bl, without any
carry from one residue to another one. Thus, if M = IIml is chosen to be the
I
product of many small relatively prime moduli m l , the computation can accom-
modate large numbers although actual calculations are performed on a large set
of small residues, without carry propagation. Hence, residue number systems
are quite effective for high-speed multiplications and additions. Unfortunately,
this advantage is usually offset by many practical difficulties related to the cost
of translating from conventional number systems to RNS, the lack of a division
operation, and the increased word length required for unambiguous operation
in a modular system. Because of these limitations, the RNS is rarely used. We
shall see, however, in Chap. 8 that modular arithmetic and the Chinese re-
mainder theorem play an important role in the definition of number theoretic
transforms and may have significant applications in these areas.
The Chinese remainder theorem is also often used to map an M-point one-
dimensional data sequence Xn into a k-dimensional data array. This is done by
noting that if n is defined modulo M, with n = 0, ... , M - 1, we can redefine n
by the Chinese remainder theorem as
k
n = I; (Mimi) nlTI modulo M, (2.26)
1=1

where the index nl along dimension i takes the values 0, ... , m l - 1. This map-
ping, which is possible only when M is the product of relatively prime factors
mh is very important for the computation of discrete Fourier transforms and
convolutions, as will be seen in the following chapters.
We now introduce the concept of permutation. Let us consider again the set
of M integers n, with n = 0, ... , M - 1. If we multiply modulo M, each element
nl of n by an integer a, we obtain a set of M numbers bl defined by

bl == a·n l modulo M. (2.27)

The nl are all distinct. We would like the bl to be also all distinct in such a way
that when the nl span the M values 0, ... , M - 1, the bl span the same values,
although in a different order.
Each equation (2.27) is a linear congruence and we already know from (2.19)
that the solution of this congruence is unique if (a, M) = 1. Let us assume that
(a, M) = 1 and consider two distinct values nl and nu pertaining to the set of the
M integers n. Since (a, M) = 1, the linear congruence (2.27) defines two integers
bl and bucorresponding, respectively, to nl and nu' Subtracting bufrom bl yields

bl - bu == a(nl - nu) modulo M. (2.28)


2.1 Elementary Number Theory 11

If b, = b", this implies that a(nl - nIl) == 0 modulo M. This is impossible because
a is relatively prime with M and nl - nIl < M. Thus, for (a, M) = 1, the permu-
tation defined by (2.27) maps all possible values of n.
As an example, consider the permutation defined by b == 5n modulo 6 . 5
and 6 are mutually prime. When n takes successively the values 0, 1,2, 3, 4, 5,
the integers b take the corresponding values 0, 5, 4, 3, 2, 1.
We shall see in the following chapters that permutations are often used in
signal processing to reorder a set of data samples. At this point, we return to
the one-dimensional to multidimensional mapping using the Chinese remainder
theorem to show that this method can be simplified by permutation. When M
is the product of two mutually prime factors m) and mz, (2.26) becomes

(2.29)

Since T) and Tz are mutually prime with m) and mz, respectively, mZn) T) and
m)nzTz can be viewed as the two permutations nIT) modulo m) and nzTz modulo
mz of two sets of m) points and mz points, respectively. Hence the mapping
defined by (2.29) can be replaced by the simpler mapping

(2.30)

and, in the case of more than two factors,


k
n == L
1=1
(Mimi) nl modulo M. (2.31)

The advantage of (2.30) over (2.26) is that the computation of the inverses T,
is no longer required.
As an example, consider M = 6, with ml = 2 and mz = 3. The sequence
n is given by: to, 1,2,3,4, 5}. Since T) = 1 and T z = 2, (2.26) yields n == 3n) +
4nz while (2.30) gives n == 3n, + 2n2. When the pair nl> nz takes successively the
values {(O, 0), (1, 0), (0, 1), (1,1), (0, 2), (1, 2)}, the sequence n becomes to, 3, 4,
1, 2, 5} for the first equation and to, 3, 2, 5, 4, I} for the second equation. Thus,
both approaches span the complete set of values of n, although in a different
order.

2.1.3 Primitive Roots

We have seen that defining integers modulo an integer m partitions these integers
into m equivalence classes. Among these classes, those corresponding to integers
which are relatively prime to m playa particularly important role and we shall
often need to know how many integers are smaller than m and relatively prime
to m. This quantity is usually denoted by ~(m) and called Euler's totientfunction.
We may observe that ~(1) = 1 since (1, 1) = 1. When m is a prime, with
m = p, all integers smaller than p are relatively prime to p. Thus,
12 2. Elements of Number Theory and Polynomial Algebra

~(p) = p - 1. (2.32)

If m = pC, the only numbers less than m and not prime with p are the multiples
of p. Therefore,

~(pC) = pC-l(p _ I) = pC(1 - I/p). (2.33)

In order to find ~(m) for any integer m, we first establish that Euler's totient
function is multiplicative.
Theorem 2.2: If a and b are two mutually prime integers, ~(a.b) = ~(a) ~(b).

The theorem is proved by considering all integers u smaller than a·b and
defined by u = aq + ',' = 0, I, ... , a - I and q = 0, I, ... , b - I. It is seen that
'I
u is relatively prime to a if, is one of the ~(a) integers relatively prime to a.
Thus, the b~(a) integers U1 given by ('I), (a + 'I), (2a + 'I), ... , [(b - l)a + ,tJ
are prime to a. If q is chosen among the ~(b) integers smaller than b and mutually
prime with b, the corresponding integers U1 are relatively prime to b, since no fac-
tor of b can divide a or q. Thus, there are ~(a)~(b) integers relatively prime to a
and b and therefore relatively prime to a·b.
An immediate corollary of theorem 2.2 is that, if an integer N is given by its
prime factorization N = p1' p~2 ... pre', then ~(N) becomes
k
~(N) = N II (l - I/PI). (2.34)
1=1

This follows from theorem 2.2 and from (2.33).


An important property of Euler's totient function is that the sum of ~(d)
taken over all divisors d of N is equal to N,

L~(d) = N. (2.35)
diN

This property follows from the fact that N/d is a divisor of N when dl N. Thus,

L ~(d) = :E ~(N/d). (2.36)


diN diN

We now consider the sets S of integers a such that (a, N) = d, I ~ a ~ N. Every


integer from I to N belongs to one and only one set S. In each set, we have
(a, N) = d or (a/d, N/d) = I; thus each set contains ~(N/d) integers. Since we
have N integers,

N= L~(N/d) (2.37)
diN

and (2.35) follows from (2.36) and (2.37).


With Euler's totient function, we have specified the equivalence classes of
integers that are defined modulo m and relatively prime with m. We shall now go
2.1 Elementary Number Theory 13

further in the specification of equivalence classes by defining the order of an


integer modulo m. This is done by considering an integer Xn defined by

(2.38)

where a is an integer. If n takes successively the values 0, 1, 2, ... , then Xn will


successively take the values Xo, Xl> X2, •••• Since Xn can only take the m distinct
values 0 to m - 1, Xn will necessarily repeat a previously computed value X,
for some integer r, with r > i. Let r be the smallest value of n for which such a
repetition occurs. We have

X, == X, modulo m, r> i. (2.39)

Then, by definition, X,+1 == ax, and Xt+l == ax,. Thus, X,+1 == X,+1 and, for any
n ~ r,

Xn == Xn-,+t modulo m. (2.40)

This means that, for n ~ i, the sequence Xn repeats itself cyclically with a period
of r - i elements. When i = 0, the sequence repeats itself from the beginning
(aO = 1) and the cyclic group defined be (2.40) contains all the possible values of
Xn corresponding to a and m. The conditions for this important case are given by
Euler's theorem.
Theorem 2.3: If (a, m) = 1, then

a; (m) == 1 modulo m. (2.41)

This theorem is proved easily by considering the permutations (a.n) modulo m


of the tfi(m) integers n which are relatively prime to m. Since (a, m) = 1, all the
permutation products are distinct. Moreover, since (n, m) = 1, (a.n, m) = 1
and the permutation <a·n) spans the complete set of integers mutually prime
with m. Thus,

II n == II a·n == a;(n) II n modulo (m). (2.42)


(n,m)-I (n,m)-l (0,,,,)-1

We can cancel the product of n on both sides of the congruence because the
various integers n are relatively prime to m. Thus, (2.42) yields (2.41) and the
proof is completed. When m is a prime, with m = p, then (a,p) = 1 ifp does not
divide a and we have tfi(p) = p - 1. In this case, Euler's theorem reduces to
Fermat's theorem
Theorem 2.4: If p is a prime, then, for every integer a,

aP-1 == 1 modulo p (2.43)

or
14 2. Elements of Number Theory and Polynomial Algebra

aP =a modulo p. (2.44)

Theorems 2.3 and 2.4 give a simple alternative to Euclid's algorithm for
solving linear congruences when (a, m) = 1. Consider the congruence

ax =c modulo m. (2.45)

We know that, if (a, m) = 1, this congruence has always a unique solution.


This solution is given by
x= c a;(ml-I modulo m (2.46)

or, when m is a prime with m = p,

x =ca P- 2 modulo p. (2.47)

An interesting application of Euler's theorem can be found for the Chinese


remainder reconstruction process discussed in proving theorem 2.1. We have
seen, with theorem 2.1, that if M is an integer which factors into k relatively
prime integers ml> then an integer x, defined modulo M can be reconstructed by
(2.21) from the various residues r l , with r l = x modulo mi' One of the difficulties
in using the Chinese remainder theorem consists in evaluating the k inverses TI
modulo mi' However we note that, by using Euler's theorem, the Chinese remain-
der reconstruction defined by (2.21) and (2.22) can be replaced by a much
simpler formulation which does not require the computation of inverses
k
X = .L: (M/ml);(m.) rl modulo M. (2.48)
1-1

This equation is established by noting that (M/mu);(m.) = 0 modulo m l for i =1= u


and that (M/mu);(m.) =
1 modulo mu by Euler's theorem.
Thus, Euler's theorem is sometimes used to solve linear congruences. The
main interest of this theorem, however, lies in the specification of the order of
an integer. We have seen above that the sequence Xn = an modulo m repeats itself
with a periodicity r - i, from a value i. If an = 1 for some value r of n, the
sequence will repeat itself from its beginning, since Xo = 1. Hence, if r is the
smallest positive integer such that ar =
1 modulo m, the complete sequence of
integers an modulo m will be periodic with period r.
Let us now determine the maximum value of r for a given m. We know by
Euler's theorem that if (a, m) = 1, r = ~(m). If (a, m) = d =1= 1, we have (a/d,
mid) = 1 and rl = ~(m/d). Since ~(m/d) < ~(m), the period r is maximum for
(a, m) = 1. We shall call the element g which generates a sequence of length
~(m) a primitive root. An element g generating a shorter cyclic sequence of
length r < ~(m) will be simply called a root of order r. We shall now consider
the issue of how many roots of a given order can exist and define ways of finding
these roots. The following theorems are used to support these objectives.
2.1 Elementary Number Theory 15

Theorem 2.5: If g is a root of order r modulo m, the r integers gO, gl, ... , g,-I are
incongruent modulo m.
This theorem is proved by assuming that g" == g" for two distinct values rl
and r2 such that r2 < rl < r. If this were the case, we would have

g"-" == 1 modulo m.
This is impossible since rl - r2 < rand r is, by definition, the smallest integer
such that g' == 1.
Theorem 2.6: If (g, m) = 1 and gb == 1 modulo m, the order r of the integer
g must divide b.
If gb == 1 modulo m, ml(gb - 1). Let us assume that g is of order r. This
means that r is the smallest integer such that g' == 1 modulo m. Since m I(g' - 1),
m I(gd - 1), where d = (r, b). Since d is the GCD of rand b, d ~ r. However,
d cannot be less than r which is by definition the smallest integer such that g' == 1.
Thus, d = r. Since dl b, rib.
Theorem 2.7: If(g, m) = 1, the order r of the integer g must divide ~(m).
This theorem follows directly from theorem 2.6 and Euler's theorem, since
g;<m) == 1 modulo m if(g, m) = 1.
It can also be shown that primitive roots exist only for m = pC or m = 2pc,
with p an odd prime. When p = 2, primitive roots exist only for m = 2 and
m = 4. When m = p, with p an odd prime, the following theorem, first introduced
by Gauss, specifies the number of roots ofa given order.
Theorem 2.8: Ifr Ip - 1, withp an odd prime, there are ~(r)incongruent integers
which have order r modulo p.
Suppose that g has order r modulo p. Then, by theorem 2.5, the r integers
gO, gl, ... , g,-I are incongruent modulo p and satisfy the equation x' == 1 modulo
p. Thus, the sequence gn modulo p is periodic and n is defined modulo r, with
n = 0, 1, ... r - 1. Then, for (b, r) = 1, gbn modulo p is a simple permutation of
the sequence gn modulo p. When (b, r) =1= 1 the sequence b·n modulo r will con-
tain repetitions and therefore the corresponding integers gb will be of order less
than r. Thus, we have either zero or ~(r) incongruent roots of order r. Since all
integers in the set I, ... , p - 1 have some order, the total number of roots, for
all divisors ofp - 1, is equal to p - 1. We note, with (2.35), that

L
,Ip-l
~(r) = p - 1. (2.49)

Thus, there are ~(r) roots of order r for each divisor r of p - 1 and this com-
pletes the proof of the theorem.
The theory of primitive roots is quite complex and a complete treatment can
be found in [2.1-3]. In practice, primitive roots modulo primes less than 10000
are given in [2.5]. When m = pC or m = 2pc, for p an odd prime, the primitive
16 2. Elements of Number Theory and Polynomial Algebra

roots are of order r, with r = <}(m). Thus, an integer will be primitive root if
gft $. 1 modulo m for n < r. Moreover, if r = qft q~' ... q;' is the prime fac-
torization of r, any integer which is not primitive root will be a root of order r/
smaller than r, where r/ is a factor of r. Thus, a primitive root g satisfies the
condition

(2.50)

The use of (2.50) greatly simplifies the search for primitive roots as can be
seen with the following example corresponding to m = 41. Since 41 is a prime,
r = <}(41) = 40 = 5.2 3 • A straightforward approach to checking whether an
integer x is a primitive root would be to compute xn modulo 41 for n = 1,2, ... ,
39 and to check that x ft $. 1 for all these values of n. When m is large, this method
becomes rapidly impracticable and it is much simpler to check that xn $. 1
modulo m only for the n being a factor of <}(m). In our case, we note that 220 == 1,
38 == 1,420 == 1) and 520 == 1. Thus 2, 3,4, and 5 are ruled out as primitive roots.
We note however that 68 == 10 and 620 == 40 and, therefore, 6 is a primitive root
modulo 41.
Once a primitive root g has been found, any root of order r/, where r/I r, can
easily be found by raising g to the power r/r/. Moreover, when m is composite,
with m = m l , m2, ... , mk> we know by the Chinese remainder theorem that
any root of order r l modulo m must also be a root of order r l modulo mi> m2,
... , mk. Thus, these roots are easily found once the primitive roots modulo pC
and modulo 2p c are known.
Primitive roots playa very important role in digital signal processing. We
shall see in Chap. 8 that they may be used to define number theoretic transforms.
Another key application of primitive roots concerns the mapping of DFTs into
circular correlations, which is a crucial step in the development of the Winograd
Fourier transform algorithm. We shall discuss this point in detail in Chap. 5,
but we give here the essence of the technique in the simple case of a p-point
DFT, with p a prime,

k= 1, ... ,p-1 (2.51)


and for k = 0,

(2.52)

The exponents and indices in (2.51) are defined modulo p. Thus, for k =1= 0 and
n =1= 0, we can change the variables with

n == g" modulo p,
(2.53)
k == gV modulop, u, v = 0, ... ,p - 2,
2.1 Elementary Number Theory 17

'*
where g is a primitive root modulo p.
Then, for k 0, (2.51) becomes
p-2
XIfV == Xo + I: XII" WII"+V. (2.54)
.-0
This demonstrates that the main part of the computation of Xk is a circular
correlation of the sequence XII" with the sequence WII".

2.1.4 Quadratic Residues

We have seen in the preceding section that primitive roots were closely analogous
to exponentials. We shall discuss here the concept of quadratic residues. It is
notable that this class of residues can be viewed as the equivalent of square roots
defined in the set of integers modulo an integer m.
If (a, m) = 1, a is said to be a quadratic residue ofm if X2 == a modulo m has
a solution. If this congruence has no solution, a is a quadratic nonresidue of m.
It is obvious, from the Chinese remainder theorem, that if a is a quadratic
residue of m and if m is composite, a must be a quadratic residue of each mutual-
ly prime factor of m. Furthermore, it can be shown that a is a quadratic residue of
pC, withp an odd prime, ifand only if a is a quadratic residue ofp. When m = 2 C
and a is an odd integer, a is a quadratic residue of 2. Moreover, a is a quadratic
residue of 4 if and only if a == 1 modulo 4 and a is a quadratic residue of 2\
k ~ 3 if and only if a == 1 modulo 8. Thus, we can restrict our discussion to odd
prime moduli since all other cases are deduced easily from this particular case.
In the following, we shall determine the number of distinct quadratic residues
of p and show how to check integers for quadratic residue properties. We first
establish the two following theorems.
Theorem 2.9: If p is an odd prime, the number Q(p) of distinct quadratic residues
is given by

Q(P) = 1 + (p - 1)/2. (2.55)

By definition, a is a quadratic residue if X2 == a modulo p has a solution. a = 0


is a trivial solution. For a '*
0, we note that, if X is a solution, p - X is also a
solution, since X2 == a modulo p implies that (p - X)2 == a modulo p. Thus,
there are at most (p - 1)/2 nonzero solutions: P,22, ... , [(p - 1)/2]2. If two sol-
utions were identical, we would have

xi == xi modulo p. (2.56)

However, XI +
X2 ~ P - 1 since XI, X2 ~ (p - 1)/2. Thus, XI + X2 $. 0 modulo
p and, since p is prime, (2.56) would imply that XI == X2' Under these conditions,
all solutions are distinct and we have (p - 1)/2 distinct quadratic residues
different from zero.
18 2. Elements of Number Theory and Polynomial Algebra

Theorem 2.10: If P is an odd prime, the product of two quadratic residues or of


two quadratic nonresidues is a quadratic residue. The product of a quadratic
residue by a quadratic nonresidue is a quadratic nonresidue.
To prove this theorem, consider the quadratic residues al> a2, ... and the
quadratic nonresidues bl> b2, .... If xi == al> xi == a2, then (X IX 2)2 == ala2. More-
over, since P is a prime, if a l is a quadratic residue, with al -=1= 0, the permutation
<aln) maps all values of n for 0 < n < p. We already know that the (p - 1)/2
terms <a/ak) are quadratic residues. Since there can only be (p - 1)/2 quadratic
residues of p, all the terms corresponding to <a/bk) are nonresidues.
Similarly, if we consider the permutations <bin), the (p - 1)/2 terms
corresponding to <a/bk) are nonresidues and therefore the (p - 1)/2 terms
corresponding to <b/bk) are quadratic residues.
We now define the convenient symbol (alp) due to Legendre. For (a, p) = I
and p an odd prime,

I if a is a quadratic residue of p
(alp) = { .. . . (2.57)
-I 1f a 1S a quadrattc nonres1due of p.

We can note immediately that the definition implies that (alp) = 1 ifpi (x 2 - a).
Hence (lIp) = 1 and (a 2lp) = 1, since p I(a 2 - a2).
In order to use Legendre's symbol for the determination of quadratic re-
sidues, we shall use a criterion introduced by Euler. We define this criterion here
without proof.
Theorem 2.11: If p is an odd prime and a is an integer, then

pi [a(p-I)/2 - (alp)]. (2.58)

We note that, by theorem 2.10, the Legendre symbol is multiplicative

(ablp) = (alp) (blp). (2.59)

This symbol is also periodic, with period p, since if x5 == a modulo p, we have


obviously x5 == a + kp modulo p. Thus

[(a + kp)/p] = (alp). (2.60)

We also give, without proof, the following two theorems.


Theorem 2.12 (Gauss): If p and q are distinct odd primes, then

(plq) (qlp) = (_1)[(p-I)/2][(q-1)/2]. (2.61)

Theorem 2.13: If p an odd prime, then

(2Ip) = (_I)(P'-1)/8. (2.62)


2.1 Elementary Number Theory 19

We are now armed with enough material to determine rapidly whether an


integer a is a quadratic residue. Consider, for instance, the case corresponding to
p = 53. We want to know if a = 33 is a quadratic residue. Hence we want to
compute (33/53). As a first step, we use the multiplicative properties of Leg-
endre's symbol. Thus, (33/53) = (3/53)(11/53), and the computation of (33/53)
is reduced to the simpler problem of evaluating (3/53) and (II/53). Since 3 and
53 are odd prime, we compute (3/53) by theorem 2.12, withp = 3, q = 53. This
implies (p - 1)/2 = I and (q - 1)/2 = 26. Thus, (_I)(Cp-I)/2]((q-I)/2] = 1 and
(3/53) = (53/3) by (2.61). We now use the periodicity property of Legendre's
symbol to reduce 53 modulo 3 in (53/3). This gives the relation

(53/3) = (2/3)

and,finally

(3/53) = (2/3) = -I, by (2.62).

The symbol (II/53) is evaluated similarly;

(II/53) = (53/11) by (2.61)


(II/53) = (9/11) by (2.60)
= (3/11)(3/11) by (2.59).

Thus, (II/53) =
1. Since we have already shown that (3/53) = -I, (33/53)
= (3/53)(11/53) =
-I and 33 is a quadratic nonresidue of 53. This means that
it is impossible to find any integer x such that x 2 == 33 modulo 53.

2.1.5 Mersenne and Fermat Numbers

Mersenne numbers are defined by

(2.63)

with p an odd prime. Fermat numbers are defined by

F, = 22' + I, (2.64)

where t is any positive integer. These numbers are important in digital signal
processing because arithmetic operations modulo Mersenne and Fermat num-
bers can be implemented relatively simply in digital hardware. This stems from
the fact that the machine representation of numbers, usually given in binary
notation by a B-bit number
B-1
a= L: a 2 1 1, al f {O, I} (2.65)
;=0
20 2. Elements of Number Theory and Polynomial Algebra

can have an arithmetic overflow at or near the value of Mp and Ft.


If a is an integer, operations modulo a Mersenne number Mp are greatly
simplified by noting that 2p == 1. Therefore reduction of the integer a modulo
Mp can be done without division using the expression
p-I
a = E (E al+kp) 21, (2.66)
1=0 k

where i in (2.65) is replaced by i + kp. When two integers a and b, defined


modulo M p , are added together, this generates a (p + I)-bit result, which is
reduced modulo M p by simply adding the most significant carry to the less
significant bit position. Thus, operations modulo Mp are equivalent to the
familiar one's complement arithmetic. Operations modulo a Fermat number are
slightly more difficult to implement, but remain relatively simple when compared
with other numbers.
We shall see in Chap. 8 that Mersenne and Fermat numbers play an im-
portant role in number theoretic transforms and we shall now establish some
properties of these numbers which will be used in Chap. 8. A particularly
important property is the order of a root modulo Mp or F t since it acts to con-
strain transform length.
Starting with Mersenne numbers, we note that some numbers, like M3 = 7,
are prime, while others, like M u , are composite. For composite Mersenne
numbers, we have the following theorem.
Theorem 2.14: Every prime divisor q of a Mersenne number is given by

q = 2kp + 1. (2.67)

By definition q 12p - 1. Fermat's theorem (theorem 2.4) also implies that


q 12q - 1 - 1. Since 2 p == 1 and 2q - 1 == 1, we have 2d == 1 modulo q, with d = (p,
q - 1). Moreover, the condition 2d == 1 modulo qimplies thatd =1= 1, since q =1= 1.
Therefore, since p is a prime, pi q - 1 and we have q = sp + 1. However,
s cannot be odd, because q would then be even, which is impossible. Thus
q = 2kp + 1.
An immediate consequence of theorem 2.14 is the following theorem.
Theorem 2.15: All Mersenne numbers are relatively prime.
If two Mersenne numbers M p , and M p , were not relatively prime, this would
imply that 2k l PI + 112k2P2 + 1. Since PI> P2, 2k l PI + 1, and 2k2Pz + 1 are
prime, this is possible only for PI = P2 or M p , = M p ,.
We see from theorem 2.14 that, for composite Mersenne numbers, every di-
visor q of Mp is such that 2p Iq-l. Thus, for Mp composite, it is always possible
to find roots of order P and 2p modulo Mp" Since 2 p == 1 modulo M p , 2 and -2
are obviously roots of order P and 2p. When Mp is a prime, the primitive roots
are of order ~(2P - 2). Thus, any root must be of order d such that dl2 p - 2.
Obviously 212P - 2 and any other divisor must be odd. By Fermat's theorem,
2.1 Elementary Number Theory 21

p 12P - 2. Moreover, since a product of 3 consecutive integers is always divisible


by 3, 31 [2 (p-\) /2 - l)]2(p-ll /2[2 (p-J) /2 + 1] and therefore, 31 (2 p - 2). Thus, any
prime Mersenne number has roots which are factors of 6p. In particular, the
roots 2 and - 2 are of orders p and 2p, respectively. We also note that - 1 is
a root of order 2 for any Mersenne number and that there are no roots of order
4. Hence - 1 is a quadratic nonresidue for any Mersenne number.
We consider now the Fermat numbers F,. The first five Fermat numbers for
t = 0 to t = 4 are prime. All other known Fermat numbers are composite. We
shall give here some interesting properties of Fermat numbers.
Theorem 2.16: All Fermat numbers are mutually prime.
To prove this theorem, we note first that F2 = FoFI + 2 and F3 = FoFIF2
+ 2. Assume now that
(2.68)

If we multiply both sides of (2.68) by F" we have

(2.69)

By the definition of Fermat numbers, Ft+1 = 22'+' + 1= (F, - 1)2 + 1. Thus,


F'+I = F; - 2F, +2 (2.70)

and, by substituting P, from (2.69) into (2.70),

(2.71)

Hence, we have established (2.68) by induction. Suppose now that two Fermat
numbers are not relatively prime. Then, (Fm, Fk ) = d, with d =1= 1 and we
would have dl Fm and dl Fk • In this case, (2.71) would imply that d12. This is im-
possible because d would have to be even and thus could not divide any Fermat
number. Hence d = I and all Fermat numbers are mutually prime.
Theorem 2.17: 3 is a primitive root of all prime Fermat numbers.
Any primitive root g must be a quadratic nonresidue because if it were a
quadratic residue, some powers of g would not be distinct. By theorem 2.9, the
number Q(F,) of distinct quadratic nonresidues is equal to 2 2'-1. We also know,
by theorem 2.8, that there are ~(F, - 1) = 22'-1 distinct primitive roots modulo
F,. Since Q(F,) = ~(F, - 1), all quadratic nonresidues are primitive roots and
we need only to show that 3 is a quadratic nonresidue to prove the theorem. In
order to show that 3 is a quadratic nonresidue for all Fermat numbers, we first
note, by direct verification, that 3 is primitive root modulo FI = 5, since 3 is a
root of order 4 modulo 5. We then show, by induction, that for any Fermat
number,
22 2, Elements of Number Theory and Polynomial Algebra

F, == 5 modulo 12. (2.72)

This can be seen by noting that, if F, = 12k + 5, then F'+I = (F, - 1)2 + 1
= (12k + 4)2 + 1 = 12kl + 5. Thus, we can check whether 3 is a quadratic
nonresidue by computing Legendre's symbol [3/(l2k + 5)]. We have

[3j(l2k + 5)] = [(12k + 5)/3] = (2/3) = - l.


Hence, 3 is a quadratic nonresidue modulo F, and, therefore, 3 is a primitive
root modulo F,.
When a Fermat number F, is composite (I > 4), the following theorem is
used for specifying the order of the various roots modulo F,.
Theorem 2.18: Every prime factor of a composite Fermat number F, is of the
form k2'+2 + l.
The proof of this theorem can be found in [2.6]. An immediate consequence
of theorem 2.18 is that every Fermat number has roots of order d = 2n , with
n ~ 1 + 2. Integers 2 and - 2 are obviously roots of order 2'+1. A simple root
of order 2'+2 can be found by noting that

(2.73)

Thus, in the ring of integers modulo F" 22<-2(1 + 22'-') is congruent to -./ -2
and is therefore a root of order 2H2.
We also note that, since (22<-1)2 == - 1 modulo F" - 1 is a quadratic re-
sidue of F,. This means that j = ...r=T is real in the ring of integers modulo
F" with j == 22'-'. We shall use this property in Chap. 8 to simplify the com-
putation of complex convolutions.
Mersenne and Fermat numbers have many other interesting properties that
cannot be discussed in detail here. Some of these properties can be found in [2.7].

2.2 Polynomial Algebra

Polynomial algebra plays an important role in digital signal processing because


convolutions and, to some extent, DFTs can be expressed in terms of operations
on polynomials. This can be seen by considering the simple convolution Yl of
two sequences hn and Xm of N terms
N-I
Yl = E
n=O
hnx1- n, 1= 0, .. " 2N - 2. (2.74)

Now suppose that the N elements of h n and Xm are assigned to be the coefficients
of polynomials H(z) and X(z) of degree N - 1 in z, z being the polynomial
variable. Hence we have
2.2 Polynomial Algebra 23

N-I
X(Z) = ~ xmzm. (2.76)
m-O

If we multiply H(z) by X(z), the resulting polynomial Y(z) will be of degree


2N - 2, since H(z) and X(z) are of degree N - l. Thus,
2N-2
Y(z) = H(z)X(z) = ~ a/z' . (2.77)
1=0

In the polynomial multiplication, each coefficient a, of Zl is obtained by


summing all products h"xm such that n + m = 1. Hence, m = I - nand

(2.78)

(2.79)

This means that the convolution of two sequences can be treated as the multi-
plication of two polynomials. Moreover, if the convolution defined by (2.74)
is cyclic, the indices /, m, and n are defined modulo N. Thus, in N-term cyclic
convolutions, we have N = O. This implies that ZN == 1 and therefore that a cy-
clic convolution can be viewed as the product of two polynomials modulo the
polynomial ZN - 1

Y(z) == H(z)X(z) modulo (ZN - 1). (2.80)

Thus, in order to deal with convolutions analytically, one must define various
operations where the usual number sets are replaced by sets of polynomials.
These operations on polynomials bear a strong relationship to operations on in-
tegers and can be treated in a unified way by using the concepts of groups,
rings, and fields. In the following, we shall give only the flavor ofthese concepts,
since full details are available in any textbook on modern algebra [2.8].

2.2.1 Groups

Consider a set A of N elements a, b, c, .... These elements could be, for instance,
positive integers or polynomials. Now suppose that we can relate elements in
the set by an operation which is denoted EB. Again, this operation is quite general
and could be, for example, an addition or a logical or operation, the only con-
straint at this stage being that a, b, and c pertain to the set A, with

c=aEBb a, b, c EA. (2.81)


24 2. Elements of Number Theory and Polynomial Algebra

Then, any set which satisfies the following conditions is called a group
-Associative law: a EB (b EB c) = (a EB b) EB c
-Identity element. There is an element e of the, group which, for any element
of the group, is such that e EB a = a.
-Inverse. Every element a of A has an inverse ii which is an element of the
group: a EB ii = Ii EB a = e
When the operation is commutative, with a EB b = b EB a, the group is called
Abelian. The order of a group is the number of elements of this group.
Now consider a group having a finite number of elements, and the successive
operations a EB a, a EB a EB a, a EB a EB a EB a, .... Each of these operations
produces an element of the group. Since the group is finite, the sequence will
necessarily repeat itself with a period r. r is called the order of the element a. If
the order of an element g is the same as the order of the group, all elements of the
group are generated by g with the operations g, g EB g, g EB g EB g, .... In this
case, g is called a generator and the group is called a cyclic group.
In order to illustrate these concepts, let us consider the set A of N integers
0, 1, ... , N - 1. For addition modulo N, a + b == c, with a, b, c E A. Moreover,
a + (b + c) == (a + b) + c, 0 + a == a, and a + (N - a) == O. Thus, A is a
group with respect to addition modulo N. This group is Abelian, since a + b ==
b + a. It is also cyclic with the integer 1 as generator, since all elements of the
group are generated by adding 1 to the preceding element. We now consider the
set B of N - 1 integers, 1, 2, ... , N - 1 with identity element 1 and with the
addition modulo N replaced by the multiplication modulo N. B is generally
not a group with respect to multiplication modulo N, because some elements of
the set have no inverse. For instance, if N = 6, only 1 and 5 have inverses. Thus,
the set of integers 1, 2, 3, 4, 5 is not a group with respect to multiplication modulo
6. Note however that, when N is a prime, then B becomes a cyclic group. For
instance, if N = 5, the inverses of 1, 2, 3, 4 are, respectively, 1, 3, 2, 4 and
therefore the set of integers 1,2,3,4 is a group, which is cyclic with generators
2 and 3. It can be seen that the group of the N integers 0, 1, ... , N - 1 with addi-
tion modulo Nhas the same structure as the group of the Nintegers 1,2, ... , N
with multiplication modulo (N + 1), N + 1 being a prime. Such a relation be-
tween two groups is called isomorphism.

2.2.2 Rings and Fields

A set A is a ring with respect to the two operations EB and ® is the following
conditions are fulfilled:
-(A, EB) is an Abelian group
- If c = a ® b, for a, b E A, then c E A
-Associative law: a ® (b ® c) = (a ® b) ® c
-Distributive law: a ® (b EB c) = a ® b EB a ® c and (b EB c) ® a = b
®aEBc®a
2.2 Polynomial Algebra 25

The ring is commutative if the law @ is commutative and it is a unit ring if there
is one (and only one) identity element u for the law @.
It can be verified easily that the set of integers is a ring with respect to addition
and multiplication.
If we now require that the operation @ satisfies the additional condition
that every element a has one (and only one) inverse (a@ ii = u), then a unit ring
becomes afield. It can be verified easily that, for any prime p, the set of integers
0, 1, ... , P - 1 form a field with addition and multiplication modulo p. This field
is called a Galois field and denoted GF(p).
We shall give here several important results concerning fields.
Theorem 2.19: If a, b, c are elements of a field, the condition a @ c = b@c
implies that a = b.
We have a Q9 C = b Q9 c. Thus, if b is the inverse of b, we have b Q9 a Q9 C
= b @ b @ c. This implies that b @ a @ c = c and therefore that b = a.
A consequence of this theorem is that, if we consider the set S of the n distinct
elements al> az, ... , an of a finite field, then the n elements al Q9 al> al Q9 a z, ... ,
al Q9 an are all distinct. Since the result of the operation @ is, by definition, an
element of the field, the n elements al @ ai, al @ az, ... , al @ an are the set S.
This generalizes to any field the concept of permutation that has been introduced
by (2.27) for fields of integers modulo a prime p. By using an approach quite
similar to that used for fields of integers, it is also possible to show that, for all
finite fields, there are primitive roots g which generate all field elements, except e,
by successive operations g Q9 g, g @ g Q9 g, ....
Another important property is that all finite fields have a number of elements
which is pd, where p is a prime. These fields are denoted GF(pd).
In the rest of this chapter, we shall restrict our discussion to rings and fields
of polynomials. In these cases, the operations EB and Q9 usually reduce to addi-
tions and multiplications modulo polynomials. In order to simplify the notation,
we shall replace special symbols EB and @ with the notation that has been de-
fined for residue arithmetic. Using this notation, we first introduce residue poly-
nomials and the Chinese remainder theorem.

2.2.3 Residue Polynomials

The theory of residue polynomials is closely related to the theory of integer


residue classes. Thus, our presentation begins with the concept of polynomial
division. In this presentation, we shall assume that all polynomial coefficients are
defined in a field, in order to ensure that the usual arithmetic operations can be
performed without restrictions on these coefficients. We commence with several
basic definitions. A polynomial P(z) divides a polynomial H(z) if a polynomial
D(z) can be found such that H(z) = P(z)D(z). H(z) is said to be irreducible if its
only divisors are of degree equal to zero. If P(z) is not a divisor of H(z), the di-
vision of H(z) by P(z) will produce a residue R(z)
26 2. Elements of Number Theory and Polynomial Algebra

H(z) = P(z)D(z) + R(z), (2.82)

where the degree of R(z) is less than the degree of P(z). This representation is
unique. All polynomials having the same residue when divided by P(z) are said
to be congruent modulo P(z) and the relation is denoted by

R(z) == H(z) modulo P(z). (2.83)

At this point, it is worth noting that when we deal with polynomials, we are
mainly interested by the coefficients of the polynomials. Thus, if we have a set of
N elements ao, al> ... , aN-I> arranging these elements in the form of a polynomial
H(z) = ao + alZ + a2z2 ... + aN_lzN of the dummy variable z is essentially a
convenient way of tagging the position of an element al relative to the others.
This feature is very important in digital signal processing because each poly-
nomial coefficient represents a sample of an analog signal stream and therefore
defines its location and intensity.
Returning to the congruence relation (2.83), we see that two polynomials
which differ only by a multiplicative constant are congruent. Thus, residue
polynomials deal with the relative values of coefficients rather than with their
absolute values. Equation (2.83) defines equivalence classes of polynomials
modulo a polynomial P(z). It can be verified easily, by referring to the definitions
in the preceding section, that the set of polynomials defined with addition and
multiplication modulo P(z) is a ring and reduces to a field when P(z) is irreduci-
ble.
When P(z) is not irreducible, it can always be factorized uniquely into powers
of irreducible polynomials. Note however that the factorization depends on the
field of coefficients: Z2 + 1 is irreducible for coefficients in the field of rational
numbers. If the coefficients are defined in the field of complex numbers, then
Z2 + 1 = (z - j)(z + j),j = --/-1.
Now suppose that P(z) is the product of dpolynomials PI(z) having no com-
mon factors (these polynomials are usually called relatively prime polynomials
by analogy with relatively prime numbers)

(2.84)

Since each of these polynomials Piz) is relatively prime with all the other poly-
nomials Pt(z), it has an inverse modulo every other polynomial. This means that
we can extend the Chinese remainder theorem to the ring of polynomials modulo
P(z) and therefore express uniquely H(z) as a function of the polynomials HI(z)
obtained by reducing H(z) modulo the various polynomials Plz). The Chinese
remainder theorem is then expressed as
d
H(z) == ~ SI(z)Ht(z) modulo P(z), (2.85)
1=1
2.2 Polynomial Algebra 27

where, for every value u of i,

S,.(z) == 0 modulo P,(z), i =1= u (2.86)


== 1 modulo P,,(z)
and

d
S,.(z) == T ..(z) IT P,(z) modulo P(z) (2.87)
'-I
,¢ ..

with T,,(z) defined by


d
T,.(z) II P,(z) == 1 modulo P,.(z). (2.88)
'-I
,¢,.

Note that (2.88) implies that T,.(z) $. 0 modulo P,.(z). Thus, when S..(z) is re-
duced modulo the various polynomials P,(z), we obtain (2.86). Therefore, when
H(z), defined by (2.85), is reduced modulo P,(z), we obtain H,(z) == H(z) modulo
P,(z), which completes the proof of the theorem.
When computing H(z) from the various residues H,(z) by the Chinese re-
mainder theorem, one must determine the various polynomials S,(z). For a given
P(z), these polynomials are computed once and for all by (2.87). The most dif-
ficult part of calculating S,(z) relates to the evaluation of the inverses T,(z)
defined by (2.88). This is done by using Euclid's algorithm, as described in Sect.
2.1, but with integers replaced by polynomials. The polynomials S,(z) can also
be computed very simply by using computer programs for symbolic mathe-
matical manipulation [2.9-10].

2.2.4 Convolution and Polynomial Product Algorithms in Polynomial Algebra

The Chinese remainder theorem plays a central role in the computation of


convolutions because it allows one to replace the evaluation of a single long
convolution by that of several short convolutions. We shall now show that the
Chinese remainder theorem can be used to specify the lower bounds on the
number of multiplications required to compute convolutions and polynomial
products. In the case of an aperiodic convolution, these lower bounds are given
by the Cook-Toom algorithm [2.11]:
Theorem 2.20: The aperiodic convolution of two sequences oflengths LI and Lz
is computed with L1 +
Lz - 1 general multiplications.
In polynomial notation, the aperiodic convolution y, of two sequences h"
and Xm is defined by

L~I
H(z) = f=6 h"z" (2.89)
28 2. Elements of Number Theory and Polynomial Algebra

L~I
X(Z) = ~ X",Z'" (2.90)

LI~-'}.
Y(Z) = H(z)X(z) = (2.91)
,- y,z'.
Since Y(z) is of degree L\ + L'}. - 2, Y(z) is unchanged if it is defined modulo
any polynomial P(z) of degree equal to LI +~- 1

Y(z) == H(z)X(z) modulo P(z). (2.92)

We now assume that P(z) is chosen to be the product of L\ +~- 1 first degree
relatively prime polynomials

L.-t/:r l
P(z) = 11 (z - a,), (2.93)
'-I
where the a, are LI ~ + - 1 distinct numbers in the field F of coefficients. Since
P(z) is the product of LI + ~ - 1 relatively prime polynomials, we can apply
the Chinese remainder theorem to the computation of (2.92). This is done by
reducing the polynomials H(z) modulo (z - a,), performing LI + L'}. - 1
polynomial multiplications H,(z)K,(z) on the reduced polynomials, and recon-
structing Y(z) by the Chinese remainder theorem. We note however that the
reductions modulo (z - a,) are equivalent to substitutions of a, for z in H(z)
and X(z). Thus, the reduced polynomials H,(z) and X,(z) are the simple scalars
H(a,) and X(a,) so that the polynomial multiplications reduce to LI + L'}. - 1
scalar multiplications H(a,)X(a,). This completes the proof ofthe theorem.
Note that this theorem provides not only a lower bound on the number of
general multiplications, but also a practical algorithm for achieving this lower
bound. However, the bound concerns only the number of general multiplica-
tions, that is to say, the multiplications where the two factors depend on the
data. The bound does not include multiplications by constant factors which
occur in the reductions modulo (z - a,) and in the Chinese remainder recon-
struction. For short convolutions, the L\ + L'}. - 1 distinct a, can be chosen to
be simple integers such as 0, + 1, -1 so that these multiplications are either
trivial or reduced to a few additions. For longer convolutions, the a, must be
chosen among a larger set of distinct values. In this case, some of the a, are no
longer simple so that multiplications in the reductions and Chinese remainder
operation are unavoidable. This means that the Cook-Toom algorithm is
practical only for short convolutions. For longer convolutions, better algorithms
can be obtained by using a transform approach which we now show to be
closely related to the Cook-Toom algorithm and Lagrange interpolation.
This can be seen by noting that Y(z) is reconstructed by the Chinese re-
mainder theorem from the L1 + L z - 1 scalars Y(a,) obtained by substituting
a, for z in Y(z). Thus, the Cook-Toom algorithm expresses a Lagrange inter-
2.2 Polynomial Algebra 29

polation process [2.9]. Since the field F of coefficients and the interpolation values
can be chosen at will, we can select the a/ to be the LI + ~ - 1 successive
powers of a number W, provided that all these numbers are distinct in the field
F. In this case, a/ = W/ and the reductions modulo (z - at) are expressed by

(2.94)

with similar relations for X(W/). Thus, with this particular choice of a" the
Cook-Toom algorithm reduces to computing aperiodic convolutions with
transforms having the DFT structure. In particular, if W = 2 and if F is the field
of integers modulo a Mersenne number (2 P - 1, p prime) or a Fermat number
(2V + 1, v = 2'), the Cook-Toom algorithm defines a Mersenne or a Fermat
transform (Chap. 8).
When W= e- 2jlt/(L,+L,-I), j = ""'-1, the Cook-Toom algorithm can be
viewed as the computation of an aperiodic convolution by DFTs. In this case,
P(z) becomes

(2.95)

Hence, if the interpolation points are chosen to be complex exponentials, the


Cook-Toom algorithm is equivalent to computing with DFTs the circular con-
volution of two input sequences obtained by appending L2 - 1 zeros to the se-
quence h" and LI - 1 zeros to the sequence Xm • This computation method will be
described more exhaustively in Chap. 3 and is known as the overap-add method.
The computational complexity results concerning convolutions have been
extended by Winograd [2.12] to polynomial products modulo a polynomial P(z).
Theorem 2.21: A polynomial product Y(z) == H(z)X(z) modulo P(z) is com-
puted with 2D-d general multiplications, where D is the degree of P(z) and dis
the number of irreducible factors P1(z) of P(z) over the field F.
This theorem is proved by again using the Chinese remainder theorem. Y(z)
is computed by calculating the reduced polynomials H,(z) == H(z) modulo
p/(z), X/(z) == X(z) modulo p/(z), evaluating the d polynomial products Y/(z) ==
H,(z)X,(z) modulo p/(z), and reconstructing Y(z) from Y/(z) by the Chinese re-
mainder theorem. As before, the multiplications by scalars corresponding to the
reductions and Chinese remainder reconstruction are not counted and the only
general multiplications are those corresponding to the d products Y/(z) evaluated
modulo polynomials p/(z) of degree D,. Since P(z) is given by

(2.96)

we have
30 2. Elements of Number Theory and Polynomial Algebra

(2.97)

Each polynomial product Y/(z) can be computed as an aperiodic convolution of


two sequences of D/ terms, followed by a reduction modulo p/(z)

(2.98)

Thus, by theorem 2.20, Y/(z) is calculated with 2D/ - 1 general multiplications


d
and the total number of multiplications becomes L: (2D/ - 1) = 2D - d. This
i=1
completes the proof of theorem 2.21.
We have already seen, with (2.80), that an N-point circular convolution can
be considered as a polynomial product modulo (ZN - 1). If F is the field of com-
plex numbers, (ZN - 1) factors into N polynomials (z - WI) of degree 1, with
W = e- i21C/N, j = ~. In this case, the computation technique defined by
theorem 2.21 is equivalent to the DFT approach and requires only N general
multiplications. Unfortunately, the W/ are irrational and complex so that the
multiplications by scalars corresponding to DFT computation must also be
considered as general multiplications.
When F is the field of rational numbers, ZN - 1 factors into polynomials
having coefficients that are rational numbers. These polynomials are called
cye/otomic polynomials [2.1] and are irreducible for coefficients in the field of
rational numbers. The number d of distinct cyclotomic polynomials which are
factors of ZN - 1 can be shown to be equal to the number of divisors of N, in-
cluding 1 and N. Thus, we have one cyclotomic polynomial of degree D/ for each
divisor N/ of Nand D/ can be shown to be [2.1]

(2.99)

where ~(N/) is Euler's totient function (Sect. 2.1.3). Thus, for circular convolu-
tions with coefficients in the field of rationals, theorem 2.21 reduces to theorem
2.22.
Theorem 2.22: An N-point circular convolution is computed with 2N-d general
multiplications, where d is the number of divisors of N, including 1 and N.
Theorems 2.21 and 2.22 provide an efficient way of computing circular convolu-
tions because the coefficients of the cyclotomic polynomials are simple integers
and can be simply 0, + 1, -1, except for very large cyclotomic polynomials.
When N is a prime, for instance, ZN - 1 = (z - 1)(zN-l + ZN-2 + ... + 1).
Thus, the reductions and Chinese remainder reconstruction are implemented
with a small number of additions and, usually, without multiplications. In order
to illustrate this computation procedure, consider, for instance, a circular con-
volution of 3 points. Since 3 is a prime, we have Z3 - 1 = (z - 1)(z2 + z + 1).
Reducing X(z) modulo (z - 1) is done with 2 additions by simply substituting 1
2.2 Polynomial Algebra 31

for zinX(z). For the reduction modulo (Z2 + + z 1), we note that Z2 == -z - 1.
Thus, this reduction is also done with 2 additions by subtracting the coefficient
of Z2 in X(z) from the coefficients of ZO and Zl. When the sequence H(z) is fixed,
the Chinese remainder reconstruction can be considered as the inverse of the
reductions and is done with a total of 4 additions. Moreover, the polynomial
multiplication modulo (z - 1) is a simple scalar multiplication and the poly-
nomial multiplication modulo (Z2 + + z 1) is done with 3 multiplications and 3
additions as shown in Sect. 3.7.2. Thus, when H(z) is fixed, a 3-point circular
convolution is computed with 4 multiplications and 11 additions as opposed to 9
multiplications and 6 additions for direct computation. We shall see in Chap. 3
that a systematic application of the methods defined by theorems 2.20-2.22
allows one to design very efficient convolution algorithms.
3. Fast Convolution Algorithms

The main objective of this chapter is to focus attention on fast algorithms for
the summation of lagged products. Such problems are very common in physics
and are usually related to the computation of digital filtering processes, con-
volutions, and correlations. Correlations differ from convolutions only by virtue
of a simple inversion of one of the input sequences. Thus, although the develop-
ments in this chapter refer to convolutions, they apply equally well to correlations.
The direct calculation of the convolution of two N-point sequences requires
a number of arithmetic operations which is of the order of N 2 • For large con-
volutions, the corresponding processing load becomes rapidly excessive and,
therefore, considerable effort has been devoted to devising faster computation
methods. The conventional approach for speeding up the calculation of con-
volutions is based on the fast Fourier transform (FFT) and will be discussed in
Chap. 4. With this approach, the number of operations is of the order of N log2
Nwhen Nis a power of two.
The speed advantage offered by the FFT algorithm can be very large for long
convolutions and the method is, by far, the most commonly used for the fast
computation of convolutions. However, there are several drawbacks to the FFT,
which relate mainly to the use of sines and cosines and to the need for complex
arithmetic, even if the convolutions are real.
In order to overcome the limitations of the FFT method, many other fast
algorithms have been proposed. In fact, the number of such algorithms is so
large that an exhaustive presentation would be almost impossible. Moreover,
many seemingly different algorithms are essentially identical and differ only in
the formalism used to develop a description. In this chapter, we shall attempt to
unify our presentation of these methods by organizing them into algebraic and
arithmetic methods. We shall show that most algebraic methods reduce to
various forms of nesting and yield computational loads that are often equal to
and sometimes less than the FFT method while eliminating some of its limita-
tions. We shall then present arithmetic methods which can be used alone or in
combination with algebraic methods and which allow significant processing
efficiency gains when implemented in special purpose hardware.

3.1 Digital Filtering Using Cyclic Convolutions

Most fast convolution algorithms, such as those based on the FFT, apply only
to periodic functions and therefore compute only cyclic convolutions. However,
3.1 Digital Filtering Using Cyclic Convolutions 33

practical filtering applications concern essentially the aperiodic convolution y,


of a limited sequence hn' oflength Nh with a quasi-infinite data sequence Xm

(3.1)

Thus, in order to take advantage of the various fast cyclic convolution


algorithms to speed-up the calculation of digital filtering processes, we need
some means to convert the aperiodic convolution y, into a series of cyclic con-
volutions. This can be done in two ways. The first method is called the overlap-
add method [3.1]. The second method, which is called the overlap-save method
[3.2], is very similar to the first one and yields comparable results in terms of
computational complexity.

3.1.1 Overlap-Add Algorithm

The overlap-add algorithm, as an initial step, sections the input sequence Xm into
v contiguous blocks X +1JN• of equal length N 2, with m = u
M +
vN2, U = 0, ... ,
N2 - 1, and v = 0, 1,2, ... for the successive blocks. The aperiodic convolution
of each of these blocks X +1JN• with the sequence h n is then computed and yields
+
M

output sequences Y.,' of Nl N2 - 1 samples. In polynomial notations, calcu-


lating these aperiodic convolutions is equivalent to determining the coefficients
of a polynomial Y.(z) defined by

Y.(z) = H(z)X.{z) (3.2)

with

(3.3)

(3.4)

(3.5)

Since Y.{z) is a polynomial of degree Nl + N2 - 2, it can be computed modulo


any polynomial of degree N ~ Nl +
N2 - 1 and in particular, modulo (ZN -
1). In this case, the successive aperiodic convolutions Y.,' are computed as
circular convolutions of length N, with N ~ Nl +
N2 - 1, in which the input
blocks are augmented by adding N - Nl zero-valued samples at the end of the
sequence hn and N - N2 zero-valued samples at the end of the sequence XM+1JN ••
The overlap-add method derives its name from the fact that the output of
each section overlaps its neighbor by N - N2 samples. These overlapping output
34 3. Fast Convolution Algorithms

samples must be added to produce the desired y,. Thus, for N = NI + N2 - 1,


a continuous digital filtering process is evaluated with one circular convolution
of N points for every N2 output samples plus (NI - l)/(N - NI + 1) additions
per output sample.

3.1.2 Overlap-Save Algorithm

The overlap-save algorithm sections the input data sequence into v overlapping
blocks x U+VN, of equal length N, with m = u + vN2, u = 0, ... , N - 1, and v
taking the values 0, 1, 2, ... for successive blocks. In this method, each data
block has a length N, instead of N2 for the overlap-add algorithm, and overlaps
the preceding block by N - N2 samples. The output of the digital filter is con-
structed by computing the successive length-N circular convolutions of the
blocks xU+VN, with the block of length N obtained by appending N - NI zero-
valued samples to hn. Hence, the output Y"+,,N, of each circular convolution is
given by
N-I
Y"+,,N, = L:
n-O
hnX<I,-n>+vN" II = O, ... ,N - 1
12 = 0,1, ... (3.6)

(3.7)

where (II - n) is taken modulo N. This means that, for II ~ n, (II - n) = II -


n and, for II < n, (II - n) = N +
II - n. We assume now that N = NI +
N2 - 1. For II ~ NI - 1, (II - n) is always equal to II - n because all samples
hn are zero-valued for n > NI - 1. Hence, the last N - NI + 1 output samples
of each cyclic convolution are valid output samples of the digital filter, while the
first NI - 1 output samples of the cyclic convolutions must be discarded because
they correspond to interfering intervals.
It can be seen that the overlap-save algorithms produce N - NI + 1 output
samples of the digital filter, without any final addition. Thus, the overlap-save
algorithm is often preferred to the overlap-add algorithm because its implemen-
tation as a computer program is slightly simpler when standard computer
programs are used for the calculation of convolutions.

3.2 Computation of Short Convolutions and Polynomial Products

In many fast convolution algorithms, the calculation of a large convolution is


replaced by that of a large number of small convolutions and polynomial pro-
ducts. This means that the processing load is strongly dependent upon the
efficiency of the algorithms used for the calculation of small convolutions and
3.2 Computation of Short Convolutions and Polynomial Products 3S

polynomial products. In this section, we shall describe several techniques which


allow one to optimize the design of such short algorithms.

3.2.1 Computation of Short Convolutions by The Chinese Remainder Theorem

We have seen in Chap. 2 that the number of multiplications required to compute


a circular convolution is minimized by breaking the computation into that of
polynomial products via the Chinese remainder theorem. This is done by
noting that a circular convolution YI of N terms
N-l
YI = L:
n=O
hnXI-n, 1= 0, ... ,N - 1, (3.8)

where hn and Xm are the two input sequences of length N, can be viewed as a
polynomial product modulo (ZN - 1). In polynomial notation, we have
N-l
H(z) L: hnzn
= 11=0 (3.9)

(3.10)

Y(z) == H(z)X(z) modulo (ZN - I) (3.11)

(3.12)

For coefficients in the field of rational numbers, ZN - I is the product of d


cyclotomic polynomials P,(z), where d is the number of divisors of N, including
1 andN
d
ZN - I = IT P,(z). (3.13)
1=1

Y(z) is computed by first reducing the input polynomials H(z) and X(z) modulo
P,(z)

H,(z) == H(z) modulo P,(z) (3.14)

X,(z) == X(z) modulo P,(Z). (3.15)

Then, Y(z) is obtained by computing the d polynomial products H,(z)X/(z)


modulo p/(z) and using the Chinese remainder theorem, as shown in Fig. 3.1 for
N prime, to reconstruct Y(z) from the products modulo p/(z),

(3.16)
36 3. Fast Convolution Algorithms

OF N TERMS

REDUCTION REDUCTION
MODULO MODULO
(ZN_ I )/ (Z-I) (Z-I)

XjZ)

POLYNOMIAL
SCALAR
MULTIPliCATION . - - HjZ)jRjZ)
MODULO MULTIPliCATION
(ZN_l)/(Z-l)

Y2./

CHINESE REMAINDER
RECONSTRUCTION

Fig. 3.1. Computation of a length-N


circular convolution by the Chinese
Y/ remainder theorem. N prime.

where, for each value u of i,

Su(z) =0 modulo p/(z), i =t= u


=1 modulo Pu(z) (3.17)

and

SueZ) = Tiz)/ Riz)


= ([;Ij p/(z)]/ ([;Ij p/(z)] modulo Piz)} ) modulo (ZN - 1). (3.18)
/¢u /¢u

Except for large values of N, the cyclotomic polynomials p/(z) are particu-
3.2 Computation of Short Convolutions and Polynomial Products 37

larly simple, since the coefficients of z can only be 0 or ± 1. This means that the
reductions modulo P,(z) and the Chinese remainder reconstruction are done
without multiplications and with only a limited number of additions. When N is
a prime, for instance, d is equal to 2, and ZN - 1 is the product of the two
cyclotomic polynomials Pl(z) = z - 1 and P2(z) = ZN-l + ZN-2 + ... + 1.
Thus, Xl and Xb) are computed with N - 1 additions by

(3.19)

(3.20)

For the Chinese remainder reconstruction, we note that Slz) = Tu(z)/Rlz),


where l/Riz) is defined modulo Pu(z). Thus, multiplication by l/Riz) can be
combined with the multiplication H.(z)Xu(z) modulo Piz). Moreover, in many
practical cases, one of the input sequences hn is fixed. In this case Hiz)/Riz) can
be precomputed and the only operations relative to the Chinese remainder re-
constructions are the multiplications by Tu(z). For N prime, HIXdRI is a scalar
Yl,l' while H 2(z)X2(z)/R 2(z) modulo P 2(z) is a polynomial of N - 1 terms with
the coefficients of Zl given by Y2,1' Thus, the Chinese remainder reconstruction is
done in this case with 2(N - I) additions by

Y(z) = (Yl,l + Y2,N_2)ZN-1 + Yl,l - Y2,O


N-2
+ I-I
~ (Yl.l - Y2,I + Yl,I-I)ZI. (3.21)

It can be seen that the Chinese remainder operation requires the same
number of additions as the reductions modulo the various cyclotomic poly-
nomials. This result is quite general and applies to any circular convolution, with
one of the sequences being fixed. Hence, the reductions and Chinese recon-
structions are implemented very simply and the main problem associated with
the computation of convolutions relates to the evaluation of polynomial pro-
ducts modulo the cyclotomic polynomials P,(z).

3.2.2 Multiplications Modulo Cyclotomic Polynomials

We note first that, since the polynomials Pi(z) are irreducibles, they always can
be computed by interpolation with 2D - I general multiplications, D being the
degree of P,(z) (theorem 2.21). Using this method for multiplications modulo
(Z2 + I) and (Z3 - 1)/(z - I) yields algorithms with 3 mUltiplications and 3
additions as shown in Sect. 3.7.2.
For longer polynomial products, this method is not practical because it
requires 2D - 1 distinct polynomials z - ai' The four simplest interpolation
polynomials are z, I/z, (z - I), and (z + I). Thus, when the degree D of P,(z) is
38 3. Fast Convolution Algorithms

larger than 2, one must use integers at different from 0 and ± 1, which implies
that the corresponding reductions modulo (z - Ut) and the Chinese remainder
reconstructions use multiplications by powers of at which have to be imple-
mented either with scalar multiplications or by a large number of successive
additions. Thus one is led to depart somewhat from the interpolation method
using real integers, in order to design algorithms with a reasonable balance be-
tween the number of additions and the number of multiplications.
One such technique consists in using complex interpolation polynomials,
such as (Z2 + 1), which are computed with more multiplications than poly-
nomials with real roots but for which the reductions and Chinese remainder
operations remain simple.
Another approach consists in converting one-dimensional polynomial
products into multidimensional polynomial products. If we first assume that we
want to compute the aperiodic convolution YI of two length-N sequences hn and
X m , this corresponds to the simple polynomial product Y(z) defined by

Y(z) = H(z)X(z) (3.22)

2N-2
Y(z) = ~ YIZ 1 (3.23)
1=0

N-I
H(z) = ~ hnz n (3.24)
n=O

N-I
X(z) = ~ xmzm. (3.25)
m=O

The polynomials H(z) and X(z) are of degree N - 1 and have N terms. If N is
composite with N = N 1N 2 , H(z) and X(z) can be converted into two-dimen-
sional polynomials by

(3.26)

(3.27)

(3.28)

(3.29)

(3.30)

with

(3.31)
3.2 Computation of Short Convolutions and Polynomial Products 39

Thus, Y(z) is computed by evaluating a two-dimensional product Y(z, z\) and


reconstructing Y(z) from Y(z, z\) with z\ = ZN,

(3.33)

This operation is equivalent to the computation of an aperiodic convolution of


length-N1 input sequences in which the scalars are replaced by polynomials of
N z terms and each multiplication is replaced by the aperiodic convolution of two
length-Nz sequences. Thus, if Ml and Mz are the number of multiplications
required to calculate the aperiodic convolutions of sequences of length Nl and
N z, the aperiodic convolution of the two length-N sequences is evaluated with M
multiplications where

(3.34)

This approach can be used recursively to cover the case of more than two dimen-
sions and it has the advantage of breaking down the computation of large poly-
nomial products into that of smaller polynomial products. Hence, a polynomial
product modulo a cyclotomic polynomial P(z) of degree N can be computed with
a multidimensional aperiodic convolution followed by a reduction modulo P(z).
In many instances, P(z) can be converted easily into a multidimensional
polynomial by simple transformations such as P(z) = Pz(ZI), Zl = ZN,. A cyclo-
tomic polynomial P(z) = (Z9 - I)J(z3 - 1) = Z6 + Z3 + 1 can be viewed, for ex-
ample, as a polynomial Pz(ZI) = zI + Zl + 1 in which Z3 is substituted to Zl.
In these cases, the multidimensional approach can be refined by calculating the
polynomial product modulo P(z) as a two-dimensional polynomial product
modulo (ZN, - Zl), Pz(z\). With this method, a polynomial multiplication modulo
(Z4 + 1) is implemented with 9 multiplications and 15 additions (Sect. 3.7.2) by
computing a polynomial product modulo (ZZ - Zl) on polynomials defined modulo
(zr + I). This is a significant improvement over the direct computation by inter-
polation of the same polynomial multiplication modulo (Z4 + 1), which requires
7 multiplications and 41 additions.
The main advantage of this approach stems from the fact that the polynomial
multiplications modulo (ZN, - Zl) can be computed by interpolation on powers
ofz 1• More precisely since (ZN, - Zl) is an irreducible polynomial of degree Nl> a
polynomial multiplication modulo (ZN,- Zl) can be evaluated with 2N1 - l general
multiplications by interpolation on 2Nl - 1 distinct points (theorem 2.21). The
two simplest interpolation points are Z = 0 and IJz = O. For the 2Nl - 3 remain-
ing interpolation points, we note that, if the degree N2 of the polynomial P 2 (ZI)
is larger than Nl> the 2Nl points ± 1, ... , ±zi",-l are all distinct. Thus, for
40 3. Fast Convolution Algorithms

N) ~ N z the 2N) - 3 remaining interpolation points can be chosen to be powers


of z). Under these conditions, the interpolation is given by

(3.35)
(3.36)

(3.37)

with similar relations for Bk(z) corresponding to H(z). Hence, Y(z) is computed
by evaluating Ak(z), Bk(z), calculating the 2N)-1 polynomial multiplications
Ak(z)Bk(z) modulo Pz(z), and reconstructing Y(z) by the Chinese remainder
theorem. In (3.35), the multiplications by power of z) correspond to simple
shifts of the input polynomials, followed by reductions modulo Pz(z). When
Pz(z) = zf' + 1, these operations reduce to a rotation by (m)k modulo 2")
words of the input polynomials, followed by a simple sign inversion of the over-
flow words. Thus, (3.35) is calculated without multiplications and, when N) is
composite, the number of additions can be reduced by use of an FFT-type algo-
rithm. In particular, if N) is a power of 2 and Pz(z) = zf' + 1, with N z = 2",
the reductions corresponding to the computation of ZN, - z) can be computed
with only N z(2N) logz N) - 5) additions. We shall see now that, when one of the
input sequences is fixed, the Chinese remainder reconstruction can be done
with approximately the same number of additions.

3.2.3 Matrix Exchange Algorithm

If we consider a polynomial multiplication modulo a polynomial P(z) of


degree N and with M scalar multiplications, the corresponding algorithm for
such a process can usually be viewed as a computation by rectangular trans-
forms,

H=Eh (3.38)

X=Fx (3.39)

Y=H0X (3.40)

y=GY, (3.41)

where h and x are column vectors of the input sequences h n and X m , E and Fare
the input matrices of size M X N, 0 denotes the element by element product,
G is the output matrix of size N x M, and y is a column vector of the output
sequence y/. When the algorithm is designed by an interpolation method, the
matrices E and F correspond to the various reductions modulo (z - at) while the
3.2 Computation of Short Convolutions and Polynomial Products 41

matrix G correspond to the Chinese remainder reconstruction. Thus, Eh, Fx,


and G Yare computed without multiplications, while H ® X is evaluated with
M multiplications. Since the Chinese remainder reconstruction can be viewed
as the inverse operation of the reductions, it is roughly equivalent in complexity
to the two input sequence reductions. Consequently, the matrix G is usually
about twice as complex as matrices E and F. In most practical filtering applica-
tions, one of the input sequences, hn' is constant so that H is precomputed and
stored. Thus, the number of additions for the algorithm is dependent upon the
complexity of F and G matrices but not upon that of the E matrix. In this circum-
stance, it is highly desirable to permute E and G matrices in order to reduce the
number of additions. As a first step toward doing this, we express the process
given by (3.38-41) in the form

(3.42)

where en,k,fm,k, and gk,l are, respectively, the elements of matrices E, F, and G.
For a circular convolution algorithm modulo P(z) = ZN - 1, we must satisfy the
condition
M-I
S = "e
...:.... m,k f,n,kg k,l
k-O
-- 1 if m +n- I == 0 modulo N

= 0 if m +n- I$.O modulo N. (3.43)

Similarly, polynomial product algorithms modulo (ZN + 1) or (ZN - z,) corre-


spond, respectively, to the conditions

S=l ifm+n-I=O
S = -lor z, ifm+n-/=N
S=O if m +n- I$.O modulo N. (3.44)

We now replace the matrices E and G by matrices E' and G', with elements,
respectively, gk,N-m and eN-I,k. Subsequently, S becomes S' and (3.43) obviously
implies
M-I
S' = L:
k-O
eN-I,dn,kgk,N-m = +n-
1 if m I == 0 modulo N

= 0 if m + n - I$.O modulo N. (3.45)

Thus, as pointed out in [3.3], the convolution property still holds when the
matrices E and G are exchanged with simple transposition and rearrangement of
lines and columns. The same general approach can also be used for polynomial
products modulo (ZN + 1) or (ZN - z,). However, in these cases, the conditions
S = 1 and S = -lor z, in (3.44) are exchanged for m or 1= O. Thus, the
42 3. Fast Convolution Algorithms

elements of EI and GI must be gk,N-m for m =1= 0, -gk,N-m or {lfZI)gk,N-m for m = 0,


and eN-I,k for 1=1=0, -eN-I,k or zleN-I,k for I = 0. This approach has been used
for the polynomial product algorithm modulo (Z9 -1 )f(Z3 - 1) given in Sect. 3.7.2.
Using this method for polynomial products modulo (Z2' + 1), with (ZN1 - ZI),
(zf' + 1) and NI = 2tl ~ N z = 2t, yields (NI -6+2N1 logz N 1) N2 additions
for the Chinese remainder reconstruction and therefore (NI - 11 + 4NI logz N 1)
N z additions for the computation of the polynomial product modulo (ZN Z1 ). 1_

Table 3.1. Number of multiplications and additions for short cyclic convolution algorithms.

Convolution Number of Number of


size multiplications additions
N M A

2 2 4
3 4 11
4 5 15
5 10 31
5 8 62
7 16 70
8 14 46
8 12 72
9 19 74
16 35 155
16 33 181

Table 3.2. Number of arithmetic operations for various polynomial products modulo P(z).

Ring P(z) Degree of Number of Number of


P(z) multiplications additions

Z2 + 1 2 3 3
(Z3 - l)/(z - 1) 2 3 3
z' + I 4 9 15
Z4 + 1 4 7 41
(ZS - 1)/(z - 1) 4 9 16
(ZS - 1)/(z - 1) 4 7 46
(Z7 - 1)/(z - 1) 6 15 53
(Z9 - 1}/(Z3 - 1) 6 15 39
Z8 + I 8 21 77
(z? + z, + 1) (z~ - 1)/(Z2 - 1) 8 21 83
(z1 + 1)(zi + Z2 + 1) 8 21 76
(ZT + 1)(z~ - 1)/(Z2 - 1) 8 21 76
z'o + 1 16 63 205
(Z27 - 1)/(z9 - 1) 18 75 267
Z32 + I 32 147 739
zO' + 1 64 315 1899
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 43

We list the details of a number of frequently used algorithms for the short
convolutions and polynomial products in Sects. 3.7.1 and 3.7.2. Tables 3.1 and
3.2 summarize the corresponding number of operations for these algorithms and
others. We have optimized these algorithms to favor a reduction of the number
of multiplications. Thus, the algorithms lend themselves to efficient implementa-
tion on computers in which the multiplication execution time is much greater
than that for addition and subtraction. When mUltiply execution time is about
the same as addition, it is preferable to use other polynomial product algorithms
in which the number of additions is reduced at the expense of an increased num-
ber of multiplications.

3.3 Computation of Large Convolutions by Nesting of


Small Convolutions

For large convolutions, the algorithms derived from interpolation methods


become complicated and inefficient. We shall show here that the computation of
long cyclic convolutions is greatly simplified by using a one-dimensional to
multidimensional mapping suggested by Good [3.4], combined with a nesting
approach proposed by Agarwal and Cooley [3.5].

3.3.1 The Agarwal-Cooley Algorithm

The Agarwal-Cooley method requires that the length N of the cyclic convolution
Yl must be a composite number with mutually prime factors. In the following,
we shall assume that N = N,N2 with (NI> N 2) = 1. The convolution Yl is given
by the familiar expression

N-'
Yl = ~ hnx1- n, 1= 0, ... , N - 1. (3.46)
n-O

Since N, and N z are mutually prime and the indices k and n are defined modulo
N,N2 , a direct consequence of the Chinese remainder theorem (theorem 2.1) is
that I and n can be mapped into two sets of indices II> n, and 12, n2, with

I == N,/2 + N 2/, modulo N,N2 , Iz, nz = 0, ... , N2 - 1


{
n == Nln2 + N2n, modulo NIN2 , II, nl = 0, ... , NI - 1. (3.47)

Thus, the one-dimensional convolution Yl becomes a two-dimensional convolu-


tion of size NI X N2
N - I N,-'
YN,I,+N,I, = ~ ~ hN,n,+N,n,XN,(l,-n,)+N,(l,-n,). (3.48)
"1=0 ":t=0
44 3. Fast Convolution Algorithms

In polynomial notation, this convolution can be viewed as a one-dimensional


polynomial convolution of length N,

(3.49)

N2- 1
Xm,(z) = ~ XN,m,+N,m,Zm, (3.50)
mll.=O

(3.51)

(3.52)

Each polynomial multiplication Hn,(z)X1,_n,(z) modulo (zN'-I) corresponds to


a convolution oflength N z which is computed with Mz scalar multiplications and
A z scalar additions. Thus, the convolution oflength N,Nz is computed by (3.52)
as a convolution oflength N, in which each scalar is replaced by a polynomial of
N z terms and each multiplication is replaced by a convolution of length N z•
Under these conditions, if M, and A, are the number of multiplications and
additions required to compute a convolution of length N" the number of mul-
tiplications M and additions A corresponding to the convolution oflength N,Nz
reduces to

(3.53)

(3.54)

By permuting the role of N, and N z, the same convolution could have been com-
puted as a convolution of N z terms in which the scalars would have been replaced
by polynomials of N, terms. In this case, the number of multiplications would be
the same, but the number of additions would be AzN, + MzA,. Thus, the
number of additions for the nesting algorithm is dependent upon the order ofthe
operations. If the first arrangement gives fewer additions, we must have

(3.55)

or,

(3.56)

Therefore, the convolution to perform first is the one for which the quantity
(M, - N,)/A, is smaller.
When N is the product of more than two relatively prime factors N" N z,
N 3 , ... , N d , the same nesting approach can be used recursively by converting the
convolution of length N into convolution of size N, X N z X N3 '" X Nd and
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 45

by computing this multidimensional convolution as a convolution of length NI


of arrays of size N2 X N3 ... X N d , where all multiplications are replaced by
convolutions of size N2 X N3 ... X N d • The same process is repeated on the
arrays of sizes N2 X N3 ... X N d , N3 X ... X N d , until all convolutions are redu-
ced to MIM2 ... M d - I convolutions of length N d , where the MI are the number
of multiplications corresponding to a convolution of NI terms. Under these
conditions, and assuming that AI is the number of additions for a length NI
one-dimensional convolution, the total number of operations for the convolu-
tion of dimension NIN2 ... Nd computed by the nesting algorithm becomes

(3.57)

A = AIN2 ... Nd + MIA2N3 ... Nd + MIM2A N 3 4 ... Nd + ...


+ MIM2 ... Md-IAd. (3.58)

It should be noted that in these formulas the number of additions AI correspond-


ing to each convolution of length NI contributes to only a fraction of the total
number of additions A, while the number of mUltiplication MI is a direct factor
of M. This means that for large convolutions, it is generally advantageous to
select short convolution algorithms minimizing the number of multiplications,
even if this minimization is done at the expense of a relatively large increase in
the number additions. This is illustrated by considering as an example a convolu-
tion of length 8. Such a convolution can be computed with two different algori-
thms, one with 14 multiplications and 46 additions, and the other with 12 mul-
tiplications and 72 additions. When these algorithms are used for calculating a
simple convolution of 8 terms, the second algorithm is obviously less interesting
than the first one, since it saves only 2 multiplications at the expense of 26 more
additions. However, when a convolution of length 63 is nested with a convolu-
tion of length 8 to compute a convolution of 504 terms, the situation is com-
pletely reversed: using the first length-8 algorithm yields 4256 mUltiplications
and 28240 additions, as opposed to 3648 multiplications and 26304 additions
when the second length-8 algorithm is used.
Table 3.3 summarizes the number of arithmetic operations for a variety of
one-dimensional cyclic convolutions computed via the Agarwal-Cooley algo-
rithm by nesting the short convolutions corresponding to Table 3.1. It can be
seen by comparison with Table 4.6 that the nesting approach yields a smaller
number of operations than the conventional method using FFTs for short and
medium-sized convolutions up to a length of about 200. One significant advant-
age of the nesting method over the FFT algorithms is that it does not use trigono-
metric functions and that real convolutions are computed with real arithmetic
instead of complex arithmetic.
Multidimensional convolutions can also be calculated by the nesting algo-
rithm. A convolution of size NIN2 X NIN2 can, for instance, be calculated as a
convolution of size NI X N, in which each scalar is replaced by an array of size
46 3. Fast Convolution Algorithms

Table 3.3. Number of arithmetic operations for one-dimensional convolutions computed by


the Agarwal-Cooley nesting algorithm.

Convolution Number of Number of Multiplications Additions


size multiplications additions per point per point
N M A MIN AIN

18 38 184 2.11 10.22


20 50 230 2.50 11.50
24 56 272 2.33 11.33
30 80 418 2.67 13.93
36 95 505 2.64 14.03
60 200 1120 3.33 18.67
72 266 1450 3.69 20.14
84 320 2100 3.81 25.00
120 560 3096 4.67 25.80
180 950 5470 5.28 30.39
210 1280 7958 6.10 37.90
360 2280 14748 6.33 40.97
420 3200 20420 7.62 48.62
504 3648 26304 7.24 52.19
840 7680 52788 9.14 62.84
1008 10032 71265 9.95 70.70
1260 12160 95744 9.65 75.99
2520 29184 241680 11.58 95.90

Table 3.4. Number of arithmetic operations for two-dimensional convolutions computed by


the Agarwal-Cooley algorithm.

Convolution Number of Number of Multiplications Additions


size multiplications additions per point per point
NxN M A MIN AIN

3 x 3 16 77 1.78 8.56
4 x 4 25 135 1.56 8.44
5 x 5 100 465 4.00 18.60
7 x 7 256 1610 5.22 32.86
8 x 8 196 1012 3.06 15.81
9 x 9 361 2072 4.46 25.58
12 x 12 400 3140 2.78 21.81
16 x 16 1225 7905 4.79 30.88
20 x 20 2500 15000 6.25 37.50
30 x 30 6400 41060 7.11 45.62
36 x 36 9025 62735 6.96 48.41
40 x 40 19600 116440 12.25 72.77
60 x 60 40000 264500 11.11 73.47
72x72 70756 488084 13.65 94.15
80 x 80 122500 767250 19.14 119.88
120 x 120 313600 1986240 21.78 137.93
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 47

N2 X N2 and each multiplication is replaced by a convolution of size N2 X N 2.


The number of arithmetic operations for various convolutions computed by this
method and the short algorithms of Sect. 3.7.1 is given in Table 3.4. We shall
see in Chap. 5 that the computation of multidimensional convolutions by a nes-
ting method plays a key role in the calculation of DFTs by the Winograd algo-
rithm. When the multidimensional convolutions have common factors in several
dimensions, the multidimensional polymonial rings which correspond to these
convolutions have special properties that can be used to simplify the calculations.
In this case, the nesting method, which does not exploit these properties, be-
comes relatively inefficient and should be replaced by a polynomial transform
approach, as discussed in Chap. 6.

3.3.2 The Split Nesting Algorithm

As indicated in the previous section, the nesting method has many desirable
features for the evaluation of convolutions. The method is particularly attractive
for short- and medium-length convolutions.
The main drawbacks of the nesting approach relate to the use of several rela-
tively prime moduli for indexing the data, the excessive number of additions for
large convolutions, and the amount of memory required for storing the short
convolution algorithm. The first point is intrinsic to the nesting method, since
one-dimensional convolutions are converted into multidimensional convolutions
only if N is the product of several relatively prime factors. If N does not satisfy
this condition, a one-dimensional to multidimensional mapping is feasible only
at the expense of a length increase of the input sequences which translates into an
increased number of arithmetic operations, as will be shown in Sect. 3.4. Thus,
one cannot hope to eliminate relatively prime moduli in the computation of one-
dimensional convolutions by the Agarwal-Cooley nesting approach.
However, the impact of the other limitations concerning storage require-
ments and the number of additions can be relieved by replacing the convolution
nesting process with a nesting of polynomial products. In this method, which is
called split nesting [3.6], the short convolutions of the Agarwal-Cooley method
are computed as polynomial products. We shall restrict our discussion of this
method to convolutions of length N 1N 2, with Nl and N2 distinct odd primes,
since all other cases can be deduced easily from this simple example.
Since Nl is prime, ZN, - 1 factors into the two cyclotomic polynomials (z- 1)
and P2(Z) = ZN,-I + z N,-I + '" + 1. The cyclic convolution Yl of two N 1-
point sequences hn and Xm can be computed as a polynomial product modulo
(ZN, - 1) with

(3.59)

(3.60)
48 3. Fast Convolution Algorithms

N,f=;1
Y(Z) = 2..... YIZI == H(z)X(z) modulo (ZN, - 1). (3.61)
1-0

Using the Chinese remainder theorem, Y(z) is calculated as shown in Fig. 3.1
by reducing H(z) and X(z) modulo P2(Z) and (z - 1) to Hlz), H. and X 2(z), X.,
respectively, computing the polynomial products H 2(z)X2(z) modulo P 2(z) and
H.X. modulo (z - 1), and reconstructing Y(z) by

Y(z) == S.(z)H.X. + S2(z)H2(z)X2(z) modulo (ZN, - 1) (3.62)

Siz) == 1, S.(z) ==° modulo P2(z)


Siz) == 0, S.(z) == 1 modulo (z - 1). (3.63)

The reductions and the Chinese remainder operations always have the same
structure, regardless of the particular numerical value of N •. Therefore, these
operations, when implemented in a computer, need not be stored as individual
procedures for each value of N., N2 ... , but can be defined by a single program
structure. In particular, the reductions modulo p.(z) and (z - 1) are computed
with N. - 1 additions by

(3.64)

(3.65)

Thus, the only part of each convolution algorithm which needs to be indivi-
dually stored is that corresponding to the polynomial products. With this
method, the savings in storage can be quite significant. If we consider for
example a simple convolution of 15 terms, the calculation can be performed by
nesting a convolution of 3 terms with a convolution oflength 5. Since the 3-point
and 5-point convolutions are computed, respectively, with 11 additions and 31
additions, a typical computer program would require about 42 instructions to
implement the short convolution algorithms in the conventional nesting method.
Alternatively, if the 3-point and 5-point convolutions are computed as polyno-
mial products, the calculation breaks down into scalar multiplications and poly-
nomial multiplications modulo (Z3 -1 )/(z -I) and modulo (ZS - 1)/(z - 1). Since
these two polynomial products are calculated, respectively, with 3 additions and
16 additions, a program for the split nesting approach would require a general
purpose program for reductions and Chinese remainder reconstruction modulo
a prime number, plus about 19 instructions to implement the two polynomial
products.
The implementation of a convolution of length N.N2 by split nesting can be
conveniently described using a polynomial representation. As a first step (similar
to that used with conventional nesting), the convolution YI oflength N.N2 is con-
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 49

verted by index mapping into a convolution YN,i,+N,i, of size Nl X N 2 , as shown


by (3.48). This two-dimensional convolution can be viewed as the product
modulo (zf' - 1), (z~, - 1) of the two-dimensional polynomials H(zlJ Z2) and
X(ZIJ Z2) given by

(3.66)

(3.67)

== H(ZI' Z2)X(ZI, Z2) modulo (zf' - 1), (z~, - 1). (3.68)

Since Nl and N2 are primes, zf' - 1 and z~, - 1 factor, respectively, into the
cyclotomic polynomials(zl - 1), P 2(ZI) = Zf,-I Zf,-2 + + ... +
1 and (Z2 - 1),
P 2(Z2) = Z~,-l +
Z~,-2 + ... +
1. Y(ZIJ Z2) can therefore be computed by a
Chinese remainder reconstruction from polynomial products defined modulo
these cyclotomic polynomials with

modulo (zf' - 1), (z~, - 1) (3.69)

S2,2(ZlJ Z2) == 1 , S2,l(ZlJ Z2), SI,2(ZIJ Z2), SI,I(ZIJ Z2)


== 0 modulo P2(ZI), Pz(Z2)
Sl,2(ZlJ Z2) == 1 , SI,I(ZIJ Z2), S2,I(ZIJ Z2), S2,2(ZlJ Z2)
== 0 modulo (Zl - 1), P 2(Z2)
S2,I(ZIJ Z2) == 1 ,SI,I(ZIJ Z2), Sl,2(ZlJ Z2), S2,2(ZIJ Z2)
== 0 modulo P 2(ZI), (Z2 - 1)
SI,I(ZIJ Z2) == 1 , S2,l(ZlJ Z2), SI,2(ZIJ Z2), S2,2(ZIJ Z2)
== 0 modulo (Zl - I), (Z2 - 1), (3.70)

where

X l,2(Z2) == X(ZIJ Z2) modulo (Zl - I), P 2(Z2)


X 2,I(ZI) == X(ZIJ Z2) modulo Pz(zl), (Z2 - 1)
Xl,l == X(ZIJ Z2) modulo (Zl - 1), (Z2 - I) (3.71)

and similar relations for Ht,k(ZIJ Z2)' A detailed representation of the convolution
50 3. Fast Convolution Algorithms

of N1N2 points is given in Fig. 3.2. As shown, the procedure includes a succes-
sion of reductions modulo (Zl - 1), (Z2 - 1), P 2(ZI), PiZ2) followed by one
scalar multiplication, two polynomial multiplications modulo P 2(ZI) and modulo
P 2(Z2), and one two-dimensional polynomial multiplication modulo P 2(ZI),
P 2(Z2). The convolution product is then computed from these polynomial prod-
ucts by a series of Chinese remainder reconstructions. In this approach the
polynomial product modulo P 2(ZI), P 2(Z2) is computed by a nesting technique
identical to that used for ordinary two-dimensional convolutions, with a poly-
nomial multiplication modulo PiZl) in which all scalars are replaced by a poly-
nomial of N2 - 1 terms and in which all scalar multiplications are replaced by
polynomial multiplications modulo P 2(Z2).

+
I ORDERING I
_t_ ARRAY OF N}xN2 TERMS

REDUCTION
+ REDUCTION
+
MODULO MODULO
N2

!
(Z2 -})/(Z-1) (Zr})

ARRAY OF N}x(Nr }) TERMS


} POLYNOMIAL OF NI TERMS

t
REDUCTION
t
REDUCTION
+
REDUCTION
1
REDUCTION
MODULO MODULO MODULO MODULO
N} N}
(Z) -I)/(Zrl) (ZI-I) (ZI -1)/(Zr l ) (Z}-})

~ X2.2(Zj.Z2) XU(Z2)

POLYNOMIAL POLYNOMIAL POLYNOMIAL


MULTIPLICATION MULTIPLICATION MULTIPLICATION SCALAR
MODULO N MODULO MODULO MULTIPLICATION
NI
NJ (Z2 2 -J)/(Zr1) (ZJ -J)/(Zr1)
(ZJ -1)/(Zr J)

N2
(Z2 -1)/ (Zrl)

I t t
X2.1(ZI)
t , Xu
I.CHINESE REMAINDER CHINESE REMAINDER
RECONSTRUCTION AND
RECONSTRUCTION
REORDERING
I I
" "
CHINESE REMAINDER
Fig. 3.2. Split nesting com
RECONSTRUCTION
of a convolution of lengt
+ with N" N2 odd prime.
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 51

We shall see now that the split nesting method reduces the number of ad-
ditions. Assuming that the short convolutions oflengths NI and N2 are computed,
respectively, with Al additions, MI multiplications and A2 additions, M2 multi-
plications, the polynomial products modulo P2(ZI) and P 2(Z2) are calculated with
MI - 1 and M2 - 1 multiplications while the number of additions breaks down
into

Al = AI,I + A ,I
2

A2 = A I ,2 + A2,2, (3.72)

where A 2 ,I and A 2 ,2 are the additions corresponding to the polynomial prod-


ucts and AI,I> A I,2 are the additions corresponding to the reductions and the
Chinese remainder operations. Since the number of multiplications for the
polynomial product modulo P 2(ZI), P 2(Z2) computed by nesting is (MI - 1)
(M2 - 1), the total umber of multiplications M for the convolution of dimen-
sion NIN2 is given by

(3.73)

The computation of the polynomial product modulo P 2(ZI), PiZ2) is done with
(N2 - 1)A2,1 + (MI - 1)A 2,2 additions. Hence, the total number of additions A
for the convolution oflength NIN2 computed by split nesting reduces to

(3.74)

Since MI > Nl and N lA l,2 + M lA2,2 < M l(A l,2 + A 2,2) = MlA2' it can be
seen, by comparison with (3.54), that splitting the computations saves (Ml - N l )
A l ,2 additions over the conventional nesting method, while using the same
number of multiplications.
The split nesting technique can be applied recursively to larger convolutions
of length NlN2N3 ... Nd and, with slight modifications, to factors that are powers
of primes. In Table 3.5, we give the number of arithmetic operations for one-
dimensional convolutions computed by split nesting of the polynomial product
algorithms corresponding to Table 3.2. It is seen, by comparing with conven-
tional nesting (Table 3.3), that the split nesting method reduces the number of
additions by about 25 %for large convolutions.
In the split nesting method, the computation breaks down into the calcula-
tion of one-dimensional and multidimensional polynomial products, where the
latter products are evaluated by nesting. We have shown previously in Sect. 3.2.2
that such multidimensional polynomial products can be computed more ef-
ficiently by multidimensional interpolation than by nesting. Thus, additional
computational savings are possible, at the expense of added storage require-
ments, if the split nesting multidimensional polynomial product algorithms are
designed by interpolation.
52 3. Fast Convolution Algorithms

Table 3.5. Number of arithmetic operations for cyclic convolutions computed by the split
nesting algorithm.

Convolution Number of Number of Multiplications Additions


size multiplications additions per point per point
N M A MIN AIN

18 38 184 2.11 10.22


20 50 218 2.50 10.90
24 56 244 2.33 10.17
30 80 392 2.67 13.07
36 95 461 2.64 12.81
60 200 964 3.33 16.07
72 266 1186 3.69 16.47
84 320 1784 3.81 21.24
120 560 2468 4.67 20.57
180 950 4382 5.28 24.34
210 1280 6458 6.10 30.75
360 2280 11840 6.33 32.89
420 3200 15256 7.62 36.32
504 3648 21844 7.24 43.34
840 7680 39884 9.14 47.48
1008 10032 56360 9.95 55.91
1260 12160 72268 9.65 57.36
2520 29184 190148 11.58 75.46

For instance, the evaluation of a convolution of 120 points involves the


calculation of a polynomial product modulo (zt + 1), (z~ - 1)/(zz - 1), (zi +
Z3 + 1). Such a polynomial product can be computed with 189 multiplications
and 640 additions instead of 243 multiplications and 641 additions, if it is com-
puted by nesting a polynomial product modulo (z~ - 1)/(zz - 1) with the
polynomial product modulo (zt + 1), (zi + Z3 + 1) designed by interpolation
(Table 3.2). With this approach, a convolution of 120 terms is evaluated with
only 506 multiplications and 2467 additions as opposed to 560 multiplications
and 2468 additions for the conventional split nesting method. The price to be
paid for these savings is the additional memory required for storing the poly-
nomial product algorithm modulo (zt + 1), (zi + Z3 + 1).

3.3.3 Complex Convolutions

Complex convolutions can be computed via nesting methods by simply replacing


real arithmetic with complex arithmetic. In this case, if M and A are the number
of real multiplications and additions corresponding to a real convolution of N
terms, the number of real mUltiplications and additions becomes 4M and 2A
+ 2M for a complex convolution of N terms. If complex multiplication is imple-
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 53

mented with an algorithm using 3 multiplications and 3 additions, as shown in


Sect. 3.7.2, the number of operations reduces to 3M multiplications and 2A
+ 3M additions.
However, a more efficient method can be devised by taking advantage of the
fact that j = -J -1 is real in certain fields [3.7]. In the case of fields of poly-
nomials modulo (zq + 1), with q even, zq == -1 and j is congruent to zql2. Thus,
for a complex convolution oflength N = N!N2, with N! odd and N2 = 2', with
the use of the index mapping given by (3.47-52), one can define a complex one-
dimensional polynomial convolution by

~,(Z) + jYI,(z) == :~: [Hn,(z)XI,_n,(z) - It,(z)XI,_n,(z)]

+ j[lt,(z)XI,_n,(z) + Hn,(z)XI,_n,(Z)] modulo (ZN, - 1) (3.75)

where Hn (z), Xm (z), and YI (z) are the polynomials corresponding to the real
parts of the input and outP~t sequences, and it,(z), Xm,(z), and YI,(z) are the
polynomials corresponding to the imaginary parts of the input and output
sequences with

(3.76)

(3.77)

(3.78)

,-1
Since N2 = 2', ZN, - 1 = (z - 1) IT (Z2· + 1), and ~,(z) + jYI,(z) can be
v-o
computed by the Chinese remainder reconstruction from the various reduced
polynomials YV.I,(z) + jYv,I,(z) defined by

Yv,I,(Z) + jYv,I,(z) == YI,(z) + jYI,(z) modulo (Z2· + 1) (3.79)

Yu,l, + jYu,l, == YI,(z) + jYI.(z) modulo (z - 1), (3.80)

the terms Yu,l, + jYU,I, and Yo, I, + jYo,l, correspond,


respectively, to Z == 1
and Z == -1 and are therefore scalar convolutions of dimension N, which are
computed with M! complex multiplications. Each of these convolutions is
calculated with 3M! real multiplications by using complex arithmetic with 3
real multiplications per complex multiplication. For v =1= 0, j == Z2·-' and each
complex polynomial convolution YV,I,(Z) + jYv,I,(z) is computed with only two
real convolutions
54 3. Fast Convolution Algorithms

(3.81)

(3.82)

where

YV,I,(Z) = [QV,I,(Z) + Qv,I,(z)]/2 (3.83)

YV,I,(Z) == -Z2'-'[QV,/,(Z) - Qv,/,(z)]/2 modulo (Z2' + 1). (3.84)

The multiplications by Z2'-' modulo (Z2' + I) in these expressions correspond to


a simple rotation by 2v -) words of the 2 word polynomials, followed by sign
V

inversion of the overflow words. Thus, for v =1= 0, complex multiplication is


implemented with only two real multiplications and the total number of real
multiplications for the convolution of size N)N2 becomes 2M)(M2 + I) instead
of 3M)M2 or 4M)M2 with conventional approaches using 3 or 4 real multiplica-
tions per complex multiplication.
It should be noted that this saving in number of multiplications is achieved
without an increase in number of additions. This can be seen by noting that the
computation process is equivalent to that encountered in evaluating 2 real con-
volutions of length N, plus 6M) additions corresponding to the complex multi-
plications required for the calculation of Yu,/, + jYu,l, and Yo, I, + jYO,/, and a
total of 4N)(N2 - 2) additions for constructing the auxiliary polynomials.
Therefore, if A is the number of additions for a real convolution of length N, the
number of real additions for a complex convolution becomes 2A + 4N +
6M) - 8N) if the sequence hn + jt is fixed. The same complex convolution of
N terms would have required 2A + 2M)M2 or 2A + 3M)M2 real additions if
complex multiplication algorithms with 4 multiplications and 2 additions or 3
mUltiplications and 3 additions had been used. This demonstrates how, in most
cases, computing complex convolutions by using the properties of j in a ring of
polynomials can save multiplications and additions over conventional methods.
With this approach, a complex convolution of 72 terms is computed with 570
real multiplications and 3230 real additions as opposed to 1064 multiplications
and 3432 additions for the same convolution calculated with a complex multi-
plication algorithm using 4 real multiplications, and 798 multiplications, 3698
additions in the case of a complex multiplication algorithm using 3 multipli-
cations.
We have, thus far, described the computation of complex convolutions in
rings of polynomials modulo (zq + 1), q even. With some loss of efficiency, the
3.3 Computation of Large Convolutions by Nesting of Small Convolutions 55

same concept can also be applied to other rings. For example, in a field modulo
P(z) = (Z3 - l)/(z - 1) = Z2 + z + 1, we have [(2z + l)/,J3']2 == - I modulo
P(z). Thus, in this case, j== (2z + 1)/,J3' modulo P(z). Note however, that this
approach is less attractive than when j is defined modulo (zq + I), q even, since
multiplications by j = (2z + 1)/,J3' cannot be implemented with simple poly-
nomial rotations and additions.

3.3.4 Optimum Block Length for Digital Filters

In many digital filtering applications, one of the sequence, hn' is of limited


length, N I, and represents the impulse response of the filter. The other sequence,
X m , is usually the input data sequence and can be considered to be of infinite
length. The noncyclic convolution of these sequences can be obtained by com-
puting a series of circular convolutions of length N and reconstructing the
digital filter output by the overlap-add or overlap-save techniques described in
Sect. 3.1.
With the overlap-add technique, the data sequence Xm is sectioned into blocks
oflength N2 and the aperiodic convolution of each block with the sequence h n is
computed by using a length-N cyclic convolution such that N = NI + N2 - 1.
In this cyclic convolution, the input blocks of length N are obtained by append-
ing N2 - 1 zeros to the sequence h n and NI - 1 zeros to the data sequence
blocks. If MI(N) and M 2 (N, N I) are, respectively, the number of multiplications
per output point for the cyclic convolutions of length N and for the NI-tap
digital filter, M 2 (N, N I) is given by

(3.85)

Similarly, A 2 (N, N I), the number of additions for the digital filter, is given as a
function of AI(N), the number of additions per output point for the cyclic con-
volutions oflength N, by

(3.86)

Since MI(N) is an increasing function of Nand N/(N - NI + I) is a de-


creasing function of N, there is an optimum block size N which minimizes the
number of multiplications. Table 3.6 lists optimum block sizes N and cor-
responding numbers of operations for digital filters of various tap length NI
computed by circular convolutions and split nesting (Table 3.5). It can be seen
that N is typically not much larger than N I. This is due to the fact that MI(N) is a
rapidly increasing function of N.
When compared to FFT filter methods, these results show that the split nest-
ing method is preferable to the FFT approach for tap lengths up to about 256.
FFT methods require larger block sizes for filter implementation because MI(N)
increases much more slowly with N for FFTs than for split nesting.
56 3. Fast Convolution Algorithms

Table 3.6. Optimum block sizes and number of operations for digital filters computed by
circular convolutions and split nesting.

Filter tap Multiplications Additions per Optimum


length per point point block size
N, MlN, N,) AlN, N,) N

2 2.23 10.88 18
4 2.53 12.46 18
8 3.29 14.77 24
16 4.44 21.76 60
32 6.04 34.25 84
64 8.12 37.98 180
128 9.78 51.36 360
256 12.10 72.17 1260
512 14.53 94.91 2520

3.4 Digital Filtering by Multidimensional Techniques

We have seen in the preceding sections that a filtering process can be computed
as a sequence of cyclic convolutions which are evaluated by nesting. The nesting
method maps one-dimensional cyclic convolutions of length-N into a multidi-
mensional cyclic convolution of size Nt X N2 ... X Nd provided Nh N 2, ... , N d,
the factors of N, are relatively prime.
We shall now present a method introduced by Agarwal and Burrus [3.8]
which maps directly the one-dimensional aperiodic convolution of two length-N
sequences into a multidimensional aperiodic convolution of arrays of dimension
Nt X N2 ... X N d, where Nh N 2, ... , Nd are factors of N which need not be re-
latively prime. The aperiodic convolution of two length-N sequences hn and Xm
is given by
N-I
YI = I:
n=O
hnX'-n, 1=0, ... , 2N - 2, (3.87)

where the output sequence YI is of length 2N - I and the sequences hn' Xm, and
YI are defined to be zero outside their definition length. In polynomial notation,
this convolution can be considered as the product of two polynomials of degree
N - I with

Y(z) = H(z)X(z) (3.88)

(3.89)

N-I
X(z) = I: xmzm (3.90)
m=O
3.4 Digital Filtering by Multidimensional Techniques 57

2N-2
Y(Z) = ~ YIZ'. (3.91)
1=0

We assume now that N is composite, with N = N I N 2. In this case, the one-


dimensional polynomials H(z) and X(z) are mapped into the two-dimensional
polynomials H(z, Zl) and X(z, Zl) by redefining indices nand m and introduc-
ing a new polynomial variable Zh defined by ZN, = ZI

(3.92)

(3.93)

(3.94)

(3.95)

Multiplying H(z, ZI) by X(z, ZI) yields a new two-dimensional polynomial


Y(z, ZI) defined by

(3.96)

(3.97)

The various samples of YI are then obtained as the coefficients of Z in Y(z, ZI),
after setting ZI = ZN,. Hence, the aperiodic convolution YI can be considered as
a two-dimensional convolution of arrays of size NI x N 2 • More precisely, this
two-dimensional convolution is a one-dimensional convolution oflength 2N2 -
1 where the N2 input samples are replaced by N2 polynomials of NI terms and all
multiplications are replaced by aperiodic convolutions of two length-NI se-
quences
N~I
Y.,(z) = L. Hn,(z)X.,_n,(z). (3.98)
11)-=0

Under these conditions, the convolution YI is computed by a method somewhat


similar to the nesting approach used for cyclic convolutions, but with short
cyclic convolution algorithms replaced by aperiodic convolution algorithms.
Thus, if MI and M2 are the number of multiplications required for the aperiodic
convolutions of length 2NI - 1 and 2N2 - 1, then M, the number of multi-
plications corresponding to the aperiodic convolution YI, is given by

(3.99)

The number of additions is slightly more difficult to evaluate than for con-
58 3. Fast Convolution Algorithms

ventional nesting because here the nested convolutions are noncyc1ic. Hence, the
input additions corresponding to the length 2N2 - 1 convolution algorithm are
performed on polynomials of NI terms, while the output additions are done on
polynomials of2NI - I terms. Moreover, since each polynomial Ys,(z) of degree
2NI - 1 is multiplied by zf' = ZN,s" the various polynomials Ys,(z) overlap by
NI - 1 samples. Let A I ,2 and A 2,2 be the number of input and output additions
required for a length-(2N2 - 1) aperiodic convolution and Al be the total
number of additions corresponding to the aperiodic convolution of length
2NI - 1. Then, A, the total number of additions for the aperiodic convolution
Yr, is given by

(3.100)

When N is the product of more than two factors, the same formulation can be
used recursively. Since the factors of N need not be relatively prime, N can be
chosen to be a power of a prime, usually 2 or 3. In this case, with N = 2' or
N = 3', the aperiodic convolution is computed by t identical stages calculating
aperiodic convolutions of length 3 or 5. Only one short convolution algorithm
needs be stored and computer implementation is greatly simplified by use of the
regular structure of the radix-2 or radix-3 nesting algorithm.
We give, in Sect. 3.7.3, short algorithms which compute aperiodic convolu-
tion of sequences of lengths 2 and 3. For the first algorithm, the noncyc1ic con-
volution of length 2 is calculated with 3 multiplications when one of the
sequences is fixed. Using this algorithm, the aperiodic convolution of two se-
quences of length N = 2' is computed with M multiplications, M being given by

M= 3'. (3.101)

The number of multiplications per input point is therefore equal to (1.5)'. If the
same aperiodic convolution is calculated by FFT, with 2 real sequences per FFT
of dimension 21+1, as shown in Sect. 4.6, the number of real multiplications per
input point is 3(2 + t). Since 3(2 + t) increases much more slowly with t than
(1.5)" the FFT approach is preferred for long sequences. However, for small
values of t, up to t = 8, (1.5)' is smaller than 3(2 + t) and the multidimensional
noncyc1ic nesting method yields a lower number of multiplications than the
conventional radix-2 FFT approach. This fact restricts the region of preferred
applicability for the algorithm to aperiodic convolution lengths of up 256 input
samples or less.
The radix-2 nesting algorithm can also be redefined to be a radix-3 nesting
process which uses the short algorithm in Sect. 3.7.3. In this case, the aperiodic
convolution of two sequences oflength 3 is calculated with 5 multiplications and
an aperiodic convolution of N = 3" input samples is computed in tl stages
with M = 5" multiplications. Thus, the number of mUltiplications per input
point is equal to (5/3),'. This result can be compared, for convolutions of equal
3.4 Digital Filtering by Multidimensional Techniques 59

length, to that of the radix-2 algorithm by setting 3" = 2' which yields t\ = t
log3 2. Consequently, the number of multiplications per input point reduces to
(5/3)t1o g ,2 = (1.38)' for an aperiodic convolution computed with the radix-3
nesting algorithm, as opposed to (1.5), for a convolution of equal length calcu-
lated by a radix-2 algorithm. Thus, the number of multiplications increases less
rapidly with convolution size for a radix-3 algorithm than for a radix-2 algo-
rithm, as can be seen in Table 3.7 which lists the number of operations for
various convolutions computed by radix-2 and radix-3 algorithms.

Table 3.7. Number of arithmetic operations for aperiodic convolutions computed by radix-2
and radix-3 one-dimensional to multidimensional mapping.

Length of Number of Number of Multiplications Additions per


input sequences mUltiplications additions per input point input point
N M A MIN AIN

2 3 3 1.50 1.50
3 5 20 1.67 6.67
4 9 19 2.25 4.75
8 27 81 3.37 10.12
9 25 194 2.78 21.56
16 81 295 5.06 18.44
27 125 1286 4.63 47.63
32 243 993 7.59 31.03
64 729 3199 11.39 49.98
81 625 7412 7.72 91.51
128 2187 10041 17.09 78.45
243 3125 40040 12.86 164.77
256 6561 31015 25.63 121.15

It can be seen, for instance, that a convolution of about 256 input terms is
computed by a radix-3 algorithm with approximately half the number of multi-
plications corresponding to a radix-2 algorithm. Unfortunately, this saving is
achieved at the expense of an increased number of additions so that the
advantage of the radix-3 algorithm over the radix-2 algorithm is debatable.
A comparison with cyclic nesting techniques can be made by noting that the
output of an N-taps digital filter can be computed by sectioning the input data
sequence into successive blocks of N samples, calculating a series of aperiodic
convolutions of these blocks with the sequence of tap values, and adding the
overlapping output samples of these convolutions. With this method, the
number of multiplications per output sample of the digital filter is the same as
the number of multiplications per input point of the aperiodic convolutions,
while the numbers of additions per point differ only by (N - 1)/N. Therefore, it
can be seen that the implementation of a digital filtering process by noncyclic
nesting methods usually requires a larger number of arithmetic operations than
60 3. Fast Convolution Algorithms

when cyclic methods are used. This difference becomes very significant in favor
of cyclic nesting for large convolution lengths. Thus, while an aperiodic nesting
method using a set of identical computation stages is inherently simpler to
implement than the mixed radix method used with cyclic nesting, this ad-
vantage is offset by an increase in number of arithmetic operations.
The aperiodic nesting method described above can be considered as a
generalization of the overlap-add algorithm. An analogous multidimensional
formulation can also be developed for the overlap-save technique [3.8]. This
yields aperiodic nesting algorithms that are very similar to those derived from
the overlap-add algorithm and give about the same number of operations.

3.5 Computation of Convolutions by Recursive Nesting


of Polynomials

The calculation of a convolution by cyclic or noncyclic nesting of small length


convolutions has many desirable attributes. It requires fewer arithmetic opera-
tions than the FFT approach for sequence lengths of up to 200 samples and
does not require complex arithmetic with sines and cosines. For large convolu-
tion lengths, however, cyclic and noncyclic nesting techniques are of limited
interest because the number of operations increases exponentially with the
number of stages, instead of linearly as for the FFT. In this section, we shall
describe a computation method, based on the recursive nesting of irreducible
polynomials [3.9], which retains to some extent the regular structure of the
aperiodic nesting approach but, nevertheless, yields a number of operations
which increases only linearly with the number of stages. We first discuss the case
of a length-N circular convolution, with N = 2',
N-J
Yz = L: hnxz- n,
n=O
I = 0, "', N - 1. (3.102)

In polynomial notation, each output sample Yz corresponds to the coefficient


of zZ in the product modulo (ZN - 1) of two polynomials H(z) and X(z)

Y(z) == H(z)X(z) modulo (ZN - 1) (3.103)


N-J
H(z) = L:
n=O
hnz n (3.104)

N-J
X(z) = L:
m=O
xmzm (3.105)

N-J
Y(z) = L: yzzz (3.106)
z=o
3.5 Computation of Convolutions by Recursive Nesting of Polynomials 61

In the field of rational numbers, ZN - 1 factors into t + 1 irreducible polyno-


mials
,-1
ZN - 1 = (z - 1) II
v-o
(Z2' + 1). (3.107)

Thus, the computation of Y(z) can be accomplished via the Chinese remainder
theorem. In this case, the main part of the calculation consists in evaluating the
t + 1 polynomial products modulo (Z2'-1 + 1), ... , (z + 1), (z - 1). For the
polynomial products modulo (z + 1), (z - 1), however, we have z == ± 1 and
the polynomial multiplications reduce to simple scalar mUltiplications.
We shall show now that, for higher order polynomials, the process can be
greatly simplified by using a recursive nesting technique. We have seen in Sect.
3.2.2 that if N = N I N 2 , a polynomial product modulo (ZN +
1) could be com-
puted as a polynomial product modulo (ZN1 - Zl) in which the scalars are re-
placed by polynomials of N2 terms evaluated modulo (zf' + 1). If N = 2' and
NI ~ N 2, the polynomial product modulo (ZN Zl) can be computed by inter-
1 -

polation on z = 0, liz = 0 and powers of Zl> with 2NI - 1 multiplications and


NI - 11 + 4NI log2 NI additions. This means that, if MI and AI are the number
of multiplications and additions corresponding to the polynomial product
modulo (ZN, + 1), and if NI = 2\ N2 = 2\ the number of multiplications
M and additions A required to evaluate the polynomial product modulo (ZN + 1)
is

(3.108)

(3.109)

The same method can be extended to compute a polynomial product modulo


(zNr N, + 1) by nesting a polynomial product modulo (ZN1 - Zl) with a polyno-
mial product modulo (ZN,NI + 1) which is computed as indicated above. Thus,
a polynomial product modulo (ZN + 1), with N = NtN2 and N2 < NI> is calcu-
lated recursively by using d identical stages implementing polynomial products
modulo (ZN Zl), where the scalars are replaced in the first stage by polyno-
1 -

mials of Nt-I N2 terms, in the second stage by polynomials of Nt- 2N2 terms, and
in the last stage by polynomials of N2 terms.
In this case, the total number of multiplications M for the polynomial pro-
duct of dimension N = 2' = NtN2, computed by a radix-NI algorithm, is

(3.110)

or, with dt l + t2 = t,

(3.111)
62 3. Fast Convolution Algorithms

Since (2'1+ 1 - 1)1/'1 is larger than 2, the number of multiplications M increases


exponentially with t. Ifwe take, for instance, tl = t2 = 1 and MI = 3, then M = 3'
and MIN = (1.5),. This result is very similar to that obtained for aperiodic
convolutions.
The foregoing recursive method, however, can be greatly improved by using
a mixed radix approach based upon a set of exponentially increasing radices.
Consider, for example, the computation of a polynomial product modulo (ZN +
1), with N = 2' and t = 2d. We begin the calculation of this polynomial pro-
duct with a polynomial product modulo (Z2 + 1) evaluated with 3 multipli-
cations as shown in Sect. 3.7.2. Then, as shown above, a polynomial product
modulo (Z4 + 1) can be computed modulo (Z2 - ZI), (zr + 1) with 9 multipli-
cations. Next, instead of calculating a polynomial product modulo (Z8 + 1)
as in the fixed radix scheme, we compute directly a polynomial product modulo
(Z16 + 1) as a polynomial product modulo (Z4 - ZI) of polynomials defined mo-
dulo (Z4 + 1). This procedure is then continued recursively, each stage being
implemented with a radix which is the square of the radix of the preceding stage.
Finally, at the stage v, the polynomial product modulo (Z21 .+ 1 + 1) is computed
modulo (Z2 1
ZI), (zf' + 1) with (2 2'+1 - 1) M, multiplications, where M,
' -

is the number of mUltiplications for the polynomial product modulo (Z2 + 1).
1
'

Therefore, the total number of multiplications for the polynomial product mod-
ulo (ZN + 1) is
d-I
M = 3 II (2 2'+1 - 1) (3.112)
V~O

Thus,

(3.113)

and, since N = 22', the number of multiplications per output sample, MIN,
satisfies the inequality

MIN < (9/8) log 2 N. (3.114)

Thus, with this approach, the number of multiplications increases much more
slowly with N than for other methods which are based upon the use of constant
radices, or conventional cyclic or noncyclic nesting. Note that, except for the
constant multiplicative factor 9/8, the law defined by (3.114) is essentially the
same as that for convolution via the FFT method.
The mixed radix nesting technique is not limited to dimensions N such that
N = 2', t = 2d. Vector lengths with t =1= 2d can be accommodated with a slight
loss of efficiency by using an initial polynomial other than (Z2 + I) and a set
of increasing radices that are an approximation of an exponential law. We list
in Table 3.8 the arithmetic operation count for various polynomial products that
can be computed by this technique.
3.5 Computation of Convolutions by Recursive Nesting of Polynomials 63

Table 3.8. Number of arithmetic operations for polynomial products modulo (ZN + 1),
N = 2', computed by mixed radix nesting.

Number of Number of
Ring Radices
multiplications additions

Z2 +1 2 3 3
Z4 +1 2,2 9 15
Z8 + 1 2,2,2 27 57
ZI6 + 1 2,2,4 63 205
Z32 + 1 2,2,2,4 189 599
Z64 + 1 2,2,2,8 405 1599
ZI28 + 1 2,2,4,8 945 4563
Z2S6 + 1 2, 2, 4, 16 1953 10531
Z'12 + 1 2,2,2,4, 16 5859 26921
ZI024 + I 2,2,2,4,32 11907 58889
Z2048 +1 2,2,2,8,32 25515 143041
Z4096 +I 2,2,2,8,64 51435 304769

The number of operations for circular convolutions oflength N, with N = 2',


can easily be deduced recursively from the data given in Table 3.8 by noting that
a circular convolution oflength 2N is computed by reducing the input sequences
modulo (ZN - 1) and (ZN + 1), evaluating a circular convolution of N terms and
a polynomial product modulo (ZN + 1), and reconstructing the output samples
by the Chinese remainder theorem. When one of the input sequences is fixed, the
Chinese remainder operation can be viewed as the inverse operation of the
reductions, and the total number of additions for reductions and Chinese re-
mainder operations is 4N. Thus, all cyclic convolutions oflength 2' are computed
recursively from only two short algorithms, corresponding to the convolution of

Table 3.9. Optimum block size and number of operations for digital filters computed by
circular convolutions of N terms, N = 2', and mixed radix nesting of polynomials.

Filter tap Multiplications Additions per Optimum block


length per point point size
N, MiN, N,) Az{N, N,) N

2 2.00 4.43 8
4 2.80 9.80 8
8 4.16 16.43 32
16 5.98 23.39 64
32 7.19 31.11 128
64 8.00 43.83 512
128 9.34 51.28 512
256 11.91 62.37 2048
512 12.80 76.09 8192
64 3. Fast Convolution Algorithms

2 terms and the polynomial product modulo (Z2 + 1) given, respectively, in


Sects. 3.7.1 and 3.7.2. Using this approach, a circular convolution of 1024 points
is computed with 9455 multiplications and 48585 additions.
We give in Table 3.9 the number of operations per output sample for digital
filters with fixed tap coefficients computed by the overlap-add algorithm with
circular convolutions evaluated by the mixed radix nesting of polynomials. It
can be seen, by comparison with Table 3.6 that this method yields a smaller
operation count than the split nesting technique for digital filters of more than 64
taps. It should be noted that these computational savings are achieved along
with a simpler computational structure, since all the algorithms are designed to
support vector lengths which are powers of two, as opposed to the product of
relatively prime radices for split nesting. Another advantage of the mixed radix
nesting of polynomials is that only two very short algorithms need to be stored,
all other algorithms being implemented by a single FFT-type routine, as des-
cribed in Sect. 3.2.2.

3.6 Distributed Arithmetic

Thus far, we have restricted our discussion of fast convolution algorithms to


algebraic methods. These algorithms can be used with any kind of arithmetic
and are easily programmed on standard computers with conventional binary
arithmetic. We shall see now that significant additional savings can be achieved
if the algorithms are reformulated at the bit level. In this reformulation, a B-bit
word is represented as a polynomial of B binary coded terms, and the arithmetic
is redistributed throughout the algorithm structure [3.10, 11].
With this distributed arithmetic approach, an aperiodic scalar convolution
is converted into a two-dimensional convolution. To demonstrate this point, we
focus initially on the convolution YI of two length-N data sequences Xm and h n
N-J
YI = ~ hnXI-n, I = 0, ... , 2N - 2. (3.115)
n=O

We assume now that each word of the input sequences is binary coded with B
bits

hn,b, E to, I} (3.116)

Xm,b, E to, I}. (3.117)

Substituting (3.116) and (3.117) into (3.115) yields

(3.118)
3.6 Distributed Arithmetic 65

Changing the coordinates with h3 = hi + h2 gives


2B-2 (N-I B-1 )
YI = I: I: I: hn,b,
b,=O n=O b,=O
XI-n,b,-b, 2b, (3.119)

which demonstrates that YI can indeed be considered as the two-dimensional


aperiodic convolution of the two binary arrays of Nx B terms, hn,b, and Xm,b,'
Note that in this formulation, each digit YI,b, of YI is no longer binary coded so
that the operation defined by (3.119) must be followed by a code conversion if
YI is to be defined in the same format as h n and X m•
It can be seen that the computation of YI by (3.119) uses the well-known fact
that a scalar multiplication can be viewed as the convolution of two binary
sequences [3.12]. The full potential of this procedure becomes more apparent
when the order of operations is modified. By changing the order of operations in
(3.118), we obtain

(3.120)

This shows that YI is obtained by summing along dimension hI> the B convolu-
tions Yl,b, defined by

Yl,b, = I:
N-I

n=O
hn,b, I: X I- n,b,2 b, •
(B-1

b,=O
)
(3.121)

Each word
N-word sequence
Yl,b,
B-1
{ b~ XI-n,b, 2 b, = X I- n •
I
is a length-N convolution of the N-bit sequence hn,b, with the
Thus, each word of Yl,b, is obtained
by multiplying XI-n by hn,b, and summing along dimension n. Since hn,b, can only
be 0 or 1, multiplication by hn,b, is greatly simplified and can be considered as
a simple addition of N words, where some of the words are zero.
When the sequence Xm is fixed, large savings in number of arithmetic oper-
ations can be achieved by precomputing and storing all possible combinations of
YI,b,' Since hn,b, can only be 0 or 1, the number of such combinations is equal to
2N. Thus, by storing the 2N possible combinations YI,b" the various values of
Yl,b, are obtained by a simple table look-up addressed by the N bits hn,b, and the
computation of YI reduces to a simple shift-addition of B words. This is equi-
valent in hardware complexity to a conventional binary multiplier, except for
the memory required to store the 2N combinations of YI,b,.
Thus, the hardware required to implement a short length convolution in dis-
tributed arithmetic is not particularly more complex than that corresponding to
an ordinary multiplication. In practice, the amount of memory required to store
the precomputed partial products Yl,b, can be halved by coding the words h n
with bit values equal to ± 1 instead of 0, 1. In this case, the 2N partial products
Yl,b, divide into 2 N - I amplitude values which can take opposite signs, and only
the 2 N - I amplitude values need to be stored, provided that the sign inversion is
66 3. Fast Convolution Algorithms

implemented in hardware. Using this method, only 4 or 8 coefficients need to be


stored for the aperiodic convolution of length-3 or length-4 input sequences.
Larger convolutions are usually implemented by segmenting the input sequences
in groups of 3 or 4 samples in order to avoid excessive memory requirements.
Hence, replacing the direct computation of a convolution by a distributed arith-
metic structure can yield a performance improvement factor of 3 to 4, for ap-
proximately the same amount of hardware. Then, in turn, the various nesting
algorithms used for the evaluation oflarge convolutions can also be implemented
in distributed arithmetic with comparable improvement factors.
Thus, the complementary blending of nesting algorithms with a distributed
arithmetic structure provides an attractive combination for the implementation
of digital filters in special purpose hardware. In practice, the concept of distri-
buting the arithmetic operations at the bit level throughout a convolution can
be applied in a number of different ways. We shall see in Chap. 8 that a particu-
larly interesting case occurs in the computation of circular convolutions via
modular arithmetic. In this case, the two-dimensional formulation of scalar con-
volutions in distributed arithmetic leads to the definition of number theoretic
transforms which can greatly simplify the calculation procedure.

3.7 Short Convolution and Polynomial Product Algorithms

A detailed description of frequently used algorithms for short convolutions and


polynomial products is given in this section. These algorithms have been de-
signed to minimize the number of multiplications while avoiding an excessive
number of additions. The input sequences are labelled Xm and h n • Since h n is as-
sumed to be fixed, expressions involving hn are presumed to be precomputed and
stored. The output sequence is y/. Expressions written between parentheses in-
dicate grouping of additions. Input and output additions must be executed
in the index numerical order.

3.7.1 Short Circular Convolution Algorithms

Convolution of 2 Terms
2 multiplications, 4 additions
ao = Xo + XI bo = (h o + h )/2
l

al = Xo - XI b l = (h o - h l )/2
mk = akbk> k = 0,1
Yo = mo + ml

Convolution of 3 Terms
3.7 Short Convolution and Polynomial Product Algorithms 67

4 multiplications, 11 additions
ao = Xo + XI + Xz bo = (h o + hI + hz)/3
bl = ho - hz
bz = hI - hz
a3 = al + az b3 = (b l + bz)/3
rnk = akbk, k = 0, ... ,3

Yo = rno + Uo
YI = rno - Uo - UI

Yz = rno + UI
Convolution of 4 Terms
5 multiplications, 15 additions
~=~+~ ~=~+~
al = XI + X3 bl = hI + h3
az = ao + al bz = (bo + b )/4 l

a3 = ao - al b 3 = (b o - b l )/4
a4 = Xo - Xz b4 = [(h o - h z) - (hI - h 3)]/2
as = XI - X3 b s = [(h o - h z) + (hI - h 3)]/2
a6 = a4 + as b6 = (ho - hz)/2
rnk = ak+Z bk+Z, k = 0, ... , 4
Uo = rno + rn l Uz = rn4 - rn3

Yo = Uo + Uz
YI = UI +U 3

Yz = Uo - Uz

Convolution of 5 Terms
10 multiplications, 31 additions
bo = ho - hz + h3 - h4
bl = hI - hz + h3 - h4
bz = ( - 2ho - 2hl + 3hz - 2h3 + 3h4)/5
h3 = - ho + hI - hz + h3
68 3. Fast Convolution Algorithms

a4 = X3 - X4 b4 = - ho + hi - h2 + h4
as = a3 + a4 bs = (3h o - 2hl + 3h 2 - 2h3 - 2h4)/5
~=~-~ ~=-~+~
a7 = al -a4 b7 = hi - h2
as = a2 - as bs = (- ho - hi + 4h2 - h3 - h4)/5
a9 = Xo + XI + X2 + X3 + X4 b9 = (ho + hi + h2 + h3 + h4)/5
mk = akbk> k = 0, ... , 9
Uo = mo + m2 U3 = m4 + ms
UI = ml + m2 U4 = m6 + ms
U2 = m3 + ms Us = m7 + ms

Yo = Uo - U4 + m9
YI = - Uo - UI - U2 - U3 + m9
Y2 = U3 + Us + m9
Y3 = U2 + U4 + m9
Y4 = UI - Us + m9

Convolution of 7 Terms
16 multiplications, 70 additions

a3 = Xs - X6
a4 = ao + al
as = - ao + al
a6 = a2 + a3
a7 = - a2 + a3
bo = (-h6 - 2hs + 3h4 - h3 - 2h2 + hi + 2ho)/2
bl = (lOh6 +3h s - Ilh4 + IOh3 + 3h2 - Ilhl
- 4ho)/14
alO = as + as b2 = (- 2h6 + 3h s - h4 - 2h3 + 3h2 - hl )/6
all = a9 + a4 + a4 + as b3 = (- h6 + h4 - h3 + hl)/6
au = al b4 = 2h6 - hs - 2h4 + 3h3 - h2 - 2hl + ho
al3 = X3 - X6 bs = (- 2h6 + hs + 2h4 - h3 - 2h2 + 3hl
- ho)/2
3.7 Short Convolution and Polynomial Product Algorithms 69

b6 = (3h6 - Ilhs - 4h4 + lOh3 + 3h 2 - Ilhl


+ 10ho)/14
b7 = (3h6 - hs - 2h3 + 3h2 - hi - 2h o)/6
al6 = al4 + a6 + a6 + a7 b8 = (h s - h3 + hi - ho)/6
al 7 = a3 b9 = - h6 - 2hs + h4 + 2h3 - h2 - 2hl + 3ho
al 8 = a l3 - a8 blO = (2h 4 - h3 - 2h2 + hl)/2
al 9 = au - a9 b ll = (-2h6 - 2hs - 2h4 + 12h3 + 5h 2 - 9h l
- 2ho)J14
b12 = ( - 2h3 + 3h2 - h l )/6
b l3 = ( - h3 + h l)J6

b l4 = 2h3 - h2 - 2hl + ho
a23 = al 9+ (X2 + XI + Xo) + (X2 + XI + Xo) + X6
bls = (h6 + hs + h4 + h3 + h2 + hi + ho)J7
k = 0, ... , 15
Uo = mo + mlO UI2 = Uo + Ull

UI = ml + mll UI3 = UIO + U3


U2 = m2 + m12
U3 = m3 + ml 3 UlS = + U3 + U3 + u4) + U2
(U13
U4 = m4 + ml 4 UI6 = - U12 - U13 - + U3 + U3 + u4)
(U13

UI7 = U6 + U8
UI8 = +
UI7 U7

U7 = m7 - m12 UI9 = Us + UI8


U20 = UI7 + U8

U21 = U20 - U7
UIO = UI + U3 U22 = (U20 + U8 + U8 + u + U79)

UII = UIO + U2 U23 = - UI9 -U20 - (U20 + U8 + U8 + u 9)

Yo = Ul2 + mlS
YI = UI6 + U23 + mlS

Y2 = U22 + mlS
Y3 = U21 + mlS
Y4 = UI9 + mlS
Ys = UIS + mlS
Y6 = UI4 + mlS
70 3. Fast Convolution Algorithms

Convolution of 8 Terms
14 multiplications, 46 additions
ao = Xo + X4 bo = ho + h4
al = XI + Xs b l = hi + hs
a2 = X2 + X6 b2 = hz + h6
a3 = X3 + X7 b 3 = h3 + h7
a4 = a o + az b = bo + bz
4

as = al + a3 bs = bl + b3
a6 = Xo - X4 b6 = {[- (h o - h4) + (hz - h6)] - [(hi - h s)
- (h3 - h7)]} /2
a7 = XI - Xs b7 = {[- (h o - h4) + (hz - h6)] + [(hi - hs)
+ (h3 - h7)]} /2
as = Xz - X6 b s = {[(h o - h4) + (hz - h6)] + [(hi - h s) + (h3
- h7)]} /2
a9 = X3 - X7 b9 = {[(h o - h4) + (hz - h6)] + [(hi - h s) - (h3
- h7)]} /2
alO = a o - az blO = [- (b o - b z) + (b l - b3)]/4
all = al - a3 b ll = [(b o - bz) + (b l - b3)]/4
al2 = a4 + as b12 = (b4 + bs)/8
al3 = a4 - as b l3 = (b 4 - b s)/8
al4 = a7 + a9 b l4 = [(h o - h4) - (h3 - h7)]/2
alS = a6 + as b ls = [(h o - h4) + (hi - h s)]f2
al6 = al S - al 4 b l6 = (h o - h4)/2
a)7 = as - a9 b l7 = [(h o - h4) + (hz - h6)]/2
alS = a6 - a7 b ls = [- (h o - h4) + (hz - h6)]/2
al 9 = al O + all b l9 = (b o - bz)/4
mk = ak+6 bk+6• k = O•...• 13

Uz = m3 + mll Ull = U6 + Us
U3 = mll - mz UIZ = UI + U3
U4 = ml + mlZ UI3 = U7 + U9
Us = mo - m12 UI4 = Uo + U4

U6 = ml 3 - ms UIS = - U6 + Us
3.7 Short Convolution and Polynomial Product Algorithms 71

U, = ml 3 + m4 UI6 = UI + Us
Us = m, + m6 UI' = - U, + U9

Yo = + Ull
UIO

YI = U 12 + U13
Y2 = U14 + UIS

Y3 = UI6 + UI'

Y4 = - UIO + Ull

Ys = - UI2 + UI3

Y6 = - U14 + UIS

y, = - UI6 + UI'

Convolution of 9 Terms
19 multiplications, 74 additions
bo = - + 2h6
ho - h3

b = - hI - h4 + 2h,
l

b2 = - h2 - hs + 2hs
b3 = ho - 2h3 + h6
b4 = hI - 2h4 + h,
as = Xs - Xs bs = h2 - 2hs + hs
a6 = Xo + X3 + X6 b6 = ho + h3 + h6

a, = XI + X + X,
4 b, = hI + h4 + h,
as = X2 + Xs + Xs bs = h2 + hs + hs
a9 = + a2
ao

alO = a 3 + as
all = a 6 + a, + as b9 = (b 6 + b, + bs)/9
a12 = alO + a 4 blO = (b o + 3b + 2b 2b l 3b 2 - 3 - 4 - bsV18
al 3 = a9 + al b = (b o - b + b + 3b + 2b s)/18
ll 2 3 4

b = blO + b
l2 ll

b13 = ( - bo + b b + bs)/6 l - 4

b l4 = (b o - b2 - b3 + b4 )/6
bls = b13 +b l4

b l6 = (2b o + b l - b2 - 2b 3 + bs)/3
bl , = (2b o - b2 + b4 )f3
72 3. Fast Convolution Algorithms

a20 = aO b ls = b l7 - b l6
a21 = as b l9 = (b o - bl - 2b 2 + b4)/3
a22 = a2 - as b 20 = ( - b l +b 3 - 2b s)/3
a 23 = a2 b21 = b20 - b l9
a 24 = - a 22 +ao - a4 b22 = (b o - b2 - 2b 3 + 2b s)f9
a 25 = al 9 + as - al b23 = (- bo + b2 - b3 + bs)f9
a 26 = - a 2S + a 24 b24 = b23 - b22
a 27 = a6 - as b2S = (b 6 - bs)/3
a2S = a7 - as b26 = (b 7 - bs)f3
a 29 = a27 + a 2S b27 = (b 2S + b26)/3
mk = ak+1l bk + 9 , k = 0, ... , 18
Uo = ml + m2 Ull = Us + mil + U 9

UI = m 4 + ms Ul2 = U4 - Us + U2

U2 = ml 4 + ml S UI3 = U7 + Us + m9 + U6

U3 = Uo + U I U14 = U3 + m l2 + U9 + U2

U4 = m + m3 l UIS = Uo - U + U6 I

Us = m 4 + m6 UI6 = ml 6 - ml S

U6 = ml3 + ml 5 UI7 = ml 7 - ml S

U7 = - U3 + m7 UIS = mo + UI6
Us = U4 + Us UI9 = mo - UI6 - UI7

U9 = mlO - U6 U20 = mo + UI7


UIO = ms + U2 + U7
Yo = Ul3 - UIO + UIS

YI = U14 - Ull + UI9

Y2 = UIS - Ul2 + U20


Y3 = - U l3 + UIS
Y4 = - UI4 + UI9
Ys = - UIS + U20
Y6 = UIO + UIS
Y7 = UII + UI9
Ys = UI2 + U 20
3.7 Short Convolution and Polynomial Product Algorithms 73

3.7.2 Short Polynomial Product Algorithms

Polynomial Product modulo (Z2 + 1)


3 multiplications, 3 additions
ao = Xo + XI bo = ho
al = XI bl = ho + hi
b2 = hi - ho
k = 0,1,2
Yo = mo - ml
YI =m O +m 2

Polynomial Product Modulo (Z3 - 1)/(z - 1)


3 multiplications, 3 additions
b o = ho - hi
bl = ho
a 2 = Xo b2 = hi
mk = ak bk k = 0,1,2
Yo = mo + ml
YI = mo + m2

Polynomial Product Modulo (Z4 + 1)


9 multiplications, 15 additions
ao = (XI + X3) bo = ho - h3
01 = (xo + X2) - (XI + X3) b l = ho
a2 = (xo + X2) b2 = ho + hi
b3 = ho + h2 + hi - h3
b4 = ho + h2
bs = ho + h2 + hi + h3
b6 = - ho + h2 + hi + h3
b7 = - ho + h2
bs = - ho + h2 - hi + h3
k = 0, ... ,8
Yo = (mo + ml) - (m3 + m 4)

YI = (m2 - ml) + (m4 - ms)


Y2 = (mo + ml) + (m6 + m 7)

Y3 = (m2 - ml) + (ms - m7)


74 3. Fast Convolution Algorithms

Polynomial Product Modulo (ZS - 1)/(z - 1)


9 multiplications, 16 additions
ao = Xo bo = ho
al = XI b l = hi
a2 = Xo - XI b2 = - ho + hi
a3 = X2 b 3 = h2
a4 = X3 b 4 = h3
as = X2 - X3 bs = - h2 + h3
a6 = Xo - X2 b6 = h2 - ho
a7 = XI - X3 b7 = h3 - hi
as = - a6 + a7 bs = b6 - b7
mk = ak bk k = 0, ... , 8
Uo = mo - m7 UI = m2 + mo

YI = UI - m3 - m7
Y2 = Uo - m4 + m6
YJ = UI + ms + m6 + ms
Polynomial Product Modulo (Z9 - 1)/(z3 - 1)
15 multiplications, 39 additions
ao = Xo + X2
al = X3 + Xs
a2 = al + X4 b2 = (h o + 3h l + 2h2 - 2h3 - 3h4 - h s)/6
a3 = ao + XI b3 = (h o - h2 + h3 + 3h4 + 2h s)j6
b4 = b2 + b3
bs = ( - ho + hi - h4 + hs)/2
b6 = (h o - h2 - h3 + h4)/2
b7 = bs + b6
bs = 2ho + hi - h2 - 2h3 + hs
b9 = 2ho - h2 + h4
blO = b9 - bs
b ll = ho - hi - 2h2 + h4
b12 = - hi + h3 - 2hs
b13 = b l2 - b ll
3.7 Short Convolution and Polynomial Product Algorithms 75

a14 = - al2 + Xo - x 4 bl4 = (h o - h2 - 2h3 + 2hs)/3


alS = + Xs - XI
a9 bls = (- ho + h2 - h3 + hs)/3
al 6 = - al s + al 4 b l6 = blS - b l4
mk = ak+2 bk+2 k = 0, ... , 14
Uo = mo + ml U7 = - U3 + m6
UI = m3 + m4 Us = U4 + Us
U2 = ml3 + ml 4

U3 = Uo + UI

u4 = mo + m2
Us = m3 + ms
u6 = ml 2 + m l4
Yo = m7 + u2 + u7
YI = Us + mlo + u 9

Y2 = u4 - Us + U2
Y3 = u7 + Us + ms + U6
Y4 = u + ml1 + u9 + u2
3

Ys = U o- UI + u6
Polynomial Product Modulo (Z7 - 1)/(z - 1)
15 multiplications, 53 additions
ao = Xo + X2 bo = (- 2hs + 3h4 - h3 - 2h2 + hi + 2ho)/2
al = ao + XI bl = (3h s - llh4 + 10h 3 + 3h 2 - llhl - 4h o)/14
a2 = al + X2 b 2 = (3h s - h4 - 2h3 + 3h2 - h l )/6
a3 = X3 + Xs b3 = (h 4 - h3 + h )/6
l

a4 = a3 + X4 b4 = - hs - 2h4 + 3h3 - h2 - 2hl + ho


as = a4 + Xs b s = (h s + 2h4 - h3 - 2h2 + 3h ho)/2 l -

b6 = ( - llhs - 4h4 + 10h 3 + 3h 2 - llhl + 10ho)/14


b7 = (- hs - 2h3 + 3h2 - hi - 2h o)/6
as = ao - XI b s = (h s - h3 + hi - ho)/6
a9 = a2 + a2 - Xo b9 = - 2hs + h4 + 2h3 - h2 - 2hl + 3h o
blO = (2h4 - h3 - 2h2 + h )/2 l

b l1 = (- 2hs - 2h4 + 12h3 + 5h 2 - 9h l - 2h o)/14


bl2 = ( - 2h3 + 3h2 - h l )/6
76 3. Fast Convolution Algorithms

b13 = (- h3 + h )/6
l

b l4 = 2h3 - hz - 2hl + ho

mk = ak+6 bk k = 0, ... ,14


Uo = ms + mo
Ul = m6 + ml
Uz = m7 + mZ
U3 = +
mS m3

U4 = m9 + m 4

Us = mlO + mO

U6 = mil + ml

U7 = ml2 + mZ

Us = ml 3 + m3

U9 = ml + m 4 4

UIO = Ul + U3 UZ3 = UZI - Uzo

Ull = UIO + Uz
Ul2 = U o + Ull

Yo = UI S Y3 = Ul 4 + UZI
Yl = Ul6 + Ul 9 Y4 = Ul2 + UZZ
Yz = UI S + UZ3 Ys = Uzo

Polynomial Product Modulo (ZS + 1)


21 multiplications, 77 additions
ao = Xo + Xz
al = Xl + X3

aZ = Xo - Xz
3.7 Short Convolution and Polynomial Product Algorithms 77

as = Xs + X7

as = a O + al bo = (ho + hi + h2 - h3 + h4 + hs + h6 + h7)/4
a9 = a4 + as bl = (- ho - hi - h2 - h3 + h4 + hs + h6 - h7)/4
alO = as +a 9 b 2 = (h3 - h4 - hs - h6)/4
b3 = (5ho - 5h l + 5h2 - 7h3 + 5h4 - 5h s + 5h6
- h7 )/20
b4 = ( - 5h o + 5h l - 5h 2 + h3 + 5h4 - 5h s + 5h6
-7h7 )/20
al3 = all + a12 bs = (3h3 - 5h4 + 5h s - 5h6 + 4h7)/20
al4 = a2 + a3 b6 = (h o + hi - h2 - h3 + h4 - hs - h6 + 3h7)/4
GIS = G6 +a 7 b7 = ( - ho + hi + h2 - 3h3 + h4 + hs - h6 - h7)/4

al6 = al 4 + alS bs = ( - hi + 2h3 - h4 + h6 - h7)/4

b9 = (5ho - 5h l - 5h 2 + h3 + 5h4 + 5h s - 5h6


- 3h7 )/20
blO = (- 5h o - 5h l + 5h 2 + 3h3 + 5h4 - 5h s - 5h6
+ h7 )/20
GI9 = al7 + alS bll = (5hl - 2h3 - 5h 4 + 5h6 + h7)/20
a 20 = Xo + X4 bl2 = ho - h3 + h4
bl3 = - 2ho + h3 - h7
bl4 = h3 - 2h4 + h7
bls = - h2+ h6 - 2h7
bl6 = 2h3 - 2h6 + 2h7
a2S = X7 b l7 = 2h2 - 2h3 + 2h7
a26 = al S - a9 + Xo - a23 bls = ( - h3 + h7)/5

a2S = a26 + a27 b20 = h3/ 5


mk = ak+8 bk k = 0, ... ,20
Uo = mo + m2
UI = ml + m 2

U2 = m3 + ms
u3 = m4 + ms
U4 = m6 + ms
78 3. Fast Convolution Algorithms

Us = m7 + mS U21 = UI9 + UI9


U6 = m9 + mll U22 = ml S - U21

U7 = mID + mll

Us = m20 + m l9
U9 = + Us
ml 2 U 2S = Us + Us
UID = U + ml
9 3 U26 = U2S + U2S

Ull = Uo + U2 U27 = UI9 + U2S

UI2 = Uo - U2

Yo = UI3 + UI7 + U20


YI = UI2 - UIS + U23

Y2 = Ull - UIS + U2S

Y3 = UI2 + UIS + U27

Y4 = Ull + UIS + UID + UI9

Ys = - UI4 - UI6 + U24 + U26

Y6 = - UI3 + UI7 + U21 + U2S

Y7 = - UI4 + UI6 + UI9

3.7.3 Short Aperiodic Convolution Algorithms

Aperiodic Convolution of 2 Sequences of Length 2

3 multiplications, 3 additions
(1 input addition, 2 output additions)
ao = Xo bo = ho

bl = ho + hI
b2 = hI

k = 0, ... ,2
Yo = mo
3.7 Short Convolution and Polynomial Product Algorithms 79

Aperiodic Convolution of 2 Sequences of Length 3

5 multiplications, 20 additions
(7 input additions, 13 output additions)
ao = XI + X2

a2 = Xo bo = h o/2
a3 = Xo + ao bl = (h o + hi + h 2)/2
a4 = Xo + al b2 = (h o - hi + h 2)/6

as = ao + ao + al + a3 b3 = (ho + 2hl + 4h2)/6

a6 = X2 b4 = h2

mk = ak+2 bko k = 0, ... , 4


Uo = m4 + m4 U3 = m2 + m2
UI = ml + ml U4 = Uo - mo - m3

U2 = mo + mo Us = ml + m2
Yo = U2 Y3 = - U4 - US

YI = UI - U3+ U4 Y4 = m4

Y2 = - Uz + U3 + Us - m 4
4. The Fast Fourier Transform

The object of this chapter is to briefly summarize the main properties of the
discrete Fourier transform (DFT) and to present various fast DFT computation
techniques known collectively as the fast Fourier transform (FFn algorithm.
The DFT plays a key role in physics because it can be used as a mathematical
tool to describe the relationship between the time domain and frequency do-
main representation of discrete signals. The use of DFT analysis methods has
increased dramatically since the introduction of the FFT in 1965 because the
FFT algorithm decreases by several orders of magnitude the number of arithme-
tic operations required for DFT computations. It has thereby provided a prac-
tical solution to many problems that otherwise would have been intractable.

4.1 The Discrete Fourier Transform

The DFT Xk of a sequence Xm of N terms is defined by


_ N-I

Xk =" ~
m=O
X m W mk , k = 0, ... , N - I
(4.1)

The sequence Xm can be viewed as representing N consecutive samples x(mT)


of a continuous signal x(t), while the sequence Xk can be considered as repre-
senting N consecutive samples X(kf) in the frequency domain. Thus, the DFT is
approximation of the continuous Fourier transform of a function. The rela-
tionship between the discrete and continuous Fourier transform is well known
and can be found in [4.1-4]. In our discussion of the DFT, we shall restrict our
attention to some of the properties that are used in various parts of this book.
An important property of the DFT is that Xm and Xk are uniquely related
by a transform pair, with the direct transform defined by (4.1) and an inverse
transform defined by
I N-I
Y 1-
-
-"
Nf;;o Xk W-Ik , 1= 0, ... , N - 1. (4.2)

It can easily be verified that (4.2) is the inverse of (4.1) by substituting Xk , given
by (4.1), into (4.2). This yields
N-I 1 N-I
YI = L: Xm - L: W(m-I)k. (4.3)
m=O N k=O
4.1 The Discrete Fourier Transform 81

Since WN = 1, m - I is defined modulo N. For m - I == 0 modulo N, S =


N-I
L: w<m-llk = N. For m - 1$.0 modulo N, we have S = [w<m-llN - 1]/
k-O
(Wm -/ 1) and since W m -/ =1= 1, S = O. Therefore, the only nonzero case
-

corresponds to I == m, which gives y/ = X m •


The DFT can be used to compute a circular convolution y/ of N terms, with

N-I
y/ = L:
n=O
hn X/- n , 1= 0, ... , N - l. (4.4)

This is done by computing the DFTs Hk and Xk of hn and X m, by multiplying Hk


by Xk , and by computing the inverse transform c/ of Hkxk • Hence c/ is given by

N-I N-I 1 N-I


c/ = L: L: hn Xm - L: w<m+n-llk. (4.5)
n=O m=O N k=O

N-I
Using the same procedure as above, one finds that S = L: w<m+n-llk be-
k=O
comes S == 0 for m + n - 1 $. 0 modulo Nand S == N for m + n - 1 == 0
modulo N. Thus, m == I - nand y/ = C/. Hence an N-point circular convolution
is calculated by three DFTs plus N multiplications. When the DFTs are com-
puted directly, this approach is not of practical value because each DFT is
computed with N 2 multiplications whereas the direct computation of the con-
volution requires N 2 mUltiplications. We shall see however that the method be-
comes very efficient when the DFTs are evaluated by a fast algorithm.

4.1.1 Properties of the DFT

In order to present the main properties of the DFT as compactly as possible, we


shall use the following notation to represent a DFT relationship between a se-
quence Xm and its transform Xk :

(4.6)

Assuming a second sequence hn and its DFT Hk , we now establish the following
DFT properties.

Linearity

{Xm} + {h n} ~ {Xk } + {Hk } (4.7)

{pxm} ~ {pXk }. (4.8)

These properties follow directly from the definitions (4.1) and (4.2).
82 4. The Fast Fourier Transform

Symmetry

(4.9)

This can be seen by noting that the transform Ak of X-m is given by


N-\ N-\
Ak = L: X-m
m=O
W mk = L:
m=O
Xm W- mk (4.10)

Time Shifting

{Xm+l} ~ {W- 1k Xk } (4.11)

This property is established by computing the DFT Ak of X m +1

wmk =
N-l N-l
A k = .L...
'" X m+1 '"
.L... X m+1 W (m+1-1) k • (4.12)
m=O m=O

Hence

(4.13)

Frequency Shifting

{W 1m xm} ~ {XH/1. (4.14)

This property follows directly from the proof given for the time shifting property
by replacing the direct DFT with an inverse DFT.

DFT of a permuted sequence

(4.15)

We assume that the sequence Xm is permuted, with m replaced by pm modulo N,


where p is an integer relatively prime to N. The DFT Ak of xpm is given by
N-\
Ak = .L...
'"
m=O
X pm wmk (4.16)

Since (p, N) = 1, we can find an integer q such that qp == 1 modulo N. Equation


(4.16) is not changed if m is replaced by qm modulo N. We then have

(4.17)

Correlation of real sequences

(4.18)
4.1 The Discrete Fourier Transform 83

Since a correlation is derived from a convolution by inverting one of the input


sequences, the OFT convolution property implies that the transform of the
correlation of the two sequences hn and Xm is obtained by evaluating the OFT of
the convolution of h-n by X m. Hence

(4.19)

with

(4.20)

Since h n is real, (4.20) therefore implies that il-k = ilt, where il: is the complex
conjugate of ilk'

Parseval's theorem

(4.21)

The Parseval theorem is a direct consequence of the correlation property because


N-\
I: x;' is the first term of the autocorrelation of X m • Thus,
m=O

(4.22)

where IX k I is the magnitude of X k

4.1.2 DFTs of Real Sequences

In many practical applications, the input sequence Xm is real. In this case, the
OFT Xk of Xm has special properties that can be found by rewriting (4.1) as
N-\ N-\
Xk = I: Xm cos(2rrmk/N)- j m=O
m=O
I: Xm sin(2rrmk/N). (4.23)

Since Xm is real, (4.23) implies that the real part Re {Xk } of Xk is even and that the
imaginary part 1m {X k } of Xk is odd

Re {X k } = Re {X- k } (4.24)

Im{Xk} = -1m {X- k} (4.25)

IXkl = IX-kl. (4.26)

Similarly, when Xm is a pure imaginary sequence, we have


84 4. The Fast Fourier Transform

Re {X k } = - Re {X- k } (4.27)

1m {Xk } = 1m {X- k } (4.28)

IXkl = IX-kl· (4.29)

These properties can be used to compute simultaneously the transforms Xk


and 1'1 of two real N-point sequences Xm and x~ with a single, complex DFT.
This is done by evaluating the DFT Yk of the sequence Xm + jx~, with

(4.30)

(4.31)

Hence

Re {Yk } = Re {X k } - 1m {Xl} (4.32)

Im{Yk } = 1m {Xk } + Re{Xl}. (4.33)

Then} by using the symmetry property of pure real and pure imaginary sequences

Re{Xk} = (Re{Yk} + Re{Y_k})/2 (4.34)

1m {Xk} = (1m {Yk} - 1m {Y-k} )/2 (4.35)

Re{Xl} = (1m {Yk } + 1m {Y_ k })/2 (4.36)

Im{Xl} = (Re{Y_ k} - Re{Yk})/2. (4.37)

4.1.3 DFTs of Odd and Even Sequence

If Xm is a real, even sequence, Xm = X- m• In this case, we have


N-I N/2-1
L: Xm sin(2rcmk/N)= m=O
m=O
L: (xm - x- m) sin(2rcmk/N) = O. (4.38)

This implies that the DFT Xk of Xn is even and real, since


N-I
2: Xm cos(2rcmk/N).
Xk = m=O (4.39)

Similarly, if Xm is an odd, real sequence, with Xm = X- m, the DFT Xk is odd and


imaginary.
An immediate consequence of these properties is that, if Xm is a conjugate
symmetric sequence defined by Xm = x!m, the DFT Xk of Xm is real. Similarly, if
4.2 The Fast Fourier Transform Algorithm 85

Xm is a conjugate anti symmetric sequence defined by Xm = -x!m, the DFT Xk


of Xm is imaginary. These properties can be used to compute the DFTs of two
conjugate symmetric sequences in one transform step [4.5]. This is done by
constructing an auxiliary sequence Ym which is derived from the two conjugate
symmetric sequences Xm and x!, by

Ym = Xm wm + x!,. (4.40)

Then, from the shift theorem, the DFT Yk of Ym is given by

(4.41)

Since Xm and x! are conjugate symmetric sequences, their DFTs Xk and Xl are
such that Xk+l = X-k- 1 and Xl = X~k' This implies that

(4.42)

Xo and XJ are computed directly with


N-I
L: Xm
Xo = m=O (4.43)

N-I
XJ = L:
m-O
x!,. (4.44)

Then, all the other values of Xk and Xl are obtained recursively by


(4.45)

(4.46)

Similar techniques can be used to compute the DFTs of two conjugate antisym-
metric sequences or of four real even or odd sequences in one transform step.

4.2 The Fast Fourier Transform Algorithm

We consider a DFT Xk of dimension N, where N is composite, with


N-I
L: Xm W mk ,
Xk = m=O k = 0, ... , N - 1

(4.47)

If N is the product of two factors, with N = N1Nz, we can redefine the indices m
and k by

(4.48)
86 4. The Fast Fourier Transform

(4.49)

Substituting (4.48) and (4.49) into (4.47) yields

(4.50)

(4.51)

which shows that the OFT of length N1N2 can be viewed as a OFT of size Nl X
N 2 , except for the introduction of the twiddle/actors Wm,k,. Thus, the computa-
tion of Xk by (4.51) is done in three steps, with the first step corresponding to the
evaluation of the Nl OFTs Ym,.k, corresponding to the Nl distinct values of ml

(4.52)

Ym"k, is then multiplied by the twiddle factors Wm,k, and Xk is obtained by


calculating N2 OFTs of Nl points on the N2 input sequences Ym,.k, Wm,k" with

(4.53)

Note that the computation procedure could have been organized in reverse
order, with the multiplications by the twiddle factors preceding the evaluation
of the first OFTs instead of being done after the calculation of these OFTs. In
this case,

(4.54)

Hence there are generally two different forms for the FFT algorithm, each being
equivalent in terms of computational complexity. It should be noted that, in
both procedures, the order of the input and output row-column indices are per-
muted. Thus, while the input sequence can be viewed as N2 polynomials of Nl
terms, the output sequence is organized as Nl polynomials of N2 terms. This
implies that a permutation step must be added at the end of the three basic steps
described above to complete the FFT procedure.
The FFT algorithm derives its efficiency by replacing the computation of one
large OFT with that of several smaller OFTs. Since the number of operations
required to directly compute an N-point OFT is proportional to N 2 , the number
of operations decreases rapidly when the computation structure is partitioned
into that of many small OFTs. In the simple case of a OFT oflength N 1N 2 , the
4.2 The Fast Fourier Transform Algorithm 87

direct computation would require NiN't multiplications. If we now evaluate this


DFT by the FFT algorithm corresponding to (4.52) and (4.53), the computation
breaks down into that of NI DFTs of N z terms, N z DFTs of NI terms plus N1Nz
multiplications by twiddle factors. Thus, the number M of multiplications re-
quired to evaluate the DFT of N1N z points with this simple two-factor algorithm
reduces to

(4.55)

which is obviously less than NiN't. In practice, the FFT algorithm is extremely
powerful because the procedure can be used iteratively when N is highly compo-
site. In such cases, and with the two-factor decomposition discussed above, NI
and N2 are composite and the DFTs of lengths NJ and N2 are again computed
by an FFT procedure. With this approach, each stage provides an additional
reduction in number of operations so that the algorithm is the most efficient
when N is highly composite. This feature, together with the need for a regular
computational structure, motivates application of the FFT algorithm to DFT
lengths which are the power of an integer. In most cases, N is chosen to be a
power of two, and this was the original form of the FFT algorithm [4.6].

4.2.1 The Radix-2 FFT Algorithm

We now consider the DFT Xk of an N-point sequence X m , with N = 2'. In this


case, the first stage of the FFTcan be defined by choosing NI = 2 and N2 = 2'-1.
This is equivalent to splitting the N-point input sequence Xm into two (NI2)-point
sequences X2m and X2m+J corresponding, respectively, to the even and odd samples
of X m • Under these conditions, Xk becomes

+ Wk 2:
_ N/2-J N/2-1
Xk = 2: X2m W2mk X2m+1 W2mk (4.56)
m=O m=O

and, since WN/2 = -1,

N/2-1 N/2-1
Xk+N / 2 = 2: X2m W 2mk - Wk 2: X2m+l W 2mk
m=O m=O

k = 0, ... , NI2 - 1. (4.57)

With this approach, called decimation in time, the computation of an N-point


DFT is replaced by that of two DFTs of length NI2 plus N additions and NI2
multiplications by Wk. The same procedure can be applied again to replace the
two DFTs of length NI2 by 4 DFTs of length NI4 at the cost of N additions and
NI2 mUltiplications. A systematic application of this method computes the DFT
of length 2' in t = log2N stages, each stage converting 21 DFTs of length 2'-1
into 21+1 DFTs oflength 2,-1-1 at the cost of N additions and NI2 multiplications.
88 4. The Fast Fourier Transform

Consequently, the number of complex multiplications M and complex additions


A required to compute a DFT oflength N by the radix-2 FFT algorithm is

M = (Nj2)log2N (4.58)

(4.59)

The decimation in time approach is illustrated in Fig. 4.1 for an 8-point DFT. In

;=1 ; =2 ;= 3

x 1
o
XO~------------~~-----------'~------------~iO

X7~------------"-------------'~------------"i7
-1 _w2 _w3

Fig. 4.1. Decimation in time FFT signal flow graph. N = 8.


4.2 The Fast Fourier Transform Algorithm 89

this signal flow graph, each node represents a variable and each arrow terminat-
ing at a node represents the additive contribution of the variable at the originat-
ing node of the arrow. Multiplications by a constant are represented by the
constant written near the arrowhead.
A second form of the FFT algorithm can be obtained by simply splitting the
N-point input sequence Xm into two (NI2)-point sequences Xm and X m + N /2 cor-
responding, respectively, to the NI2 first samples and to the NI21ast samples of
X m • With this approach, called decimation infrequency, Xk becomes

X-k --
N!2-J
"
k...o (Xm + W Nk / 2 )wmk
Xm+N/Z' (4.60)
m=O

We now compute the even-and odd-numbered samples of Xk separately. For k


even, replacing k by 2k, we obtain

-
X Zk =
N!2-J
~
m=O
(Xm + X m+ N /2) W 2 k
m , k = 0, ... , NI2 - 1. (4.61)

Replacing k by 2k + I for k odd, we get


k = 0, ... , NI2 - 1. (4.62)

Thus Xk is computed by (4.61) and (4.62) in terms of two DFTs of length N12,
but with a premultiplication by wm of the input sequence in (4.62). Therefore,
the computation of a DFT of N terms is replaced by that of two DFTs of NI2
terms at the cost of N complex additions and NI2 complex multiplications. As
with the decimation in time algorithm, the same procedure can be used recursive-
ly to compute the DFT in log2N stages, each stage converting 21 DFT of length
21-1 into 21+1 DFTs of length 2,-1-1 at the cost of N additions and NI2 multipli-
cations. This means that the decimation in frequency algorithm requires the
same number of operations as the decimation in time algorithm. The compu-
tation structure for decimation in frequency is shown in Fig. 4.2 for N = 8. It
can be seen that the flow graph has the same geometry as the decimation in time
flow graph, but different coefficients.
Since the FFT algorithm computes a DFT with NlogN operations instead of
N 2 for the direct approach, practical reduction of the computation load can be
very large. In the case of a 1024-point DFT, for instance, we have N = 210 and
the direct computation requires 220 complex multiplications. On the other hand,
the FFT algorithm computes the same DFT with only 5.2 10 complex multipli-
cations or about 200 times fewer multiplications. Significant additional reduction
can be obtained by noting that a number of the multiplications are trivial multi-
plications by ± I or ±j.
In the case of a decimation in time algorithm, the twiddle factors in the first
stage are given by W kN /2 = (-I)k. Thus all multiplications in the first stage are
90 4. The Fast Fourier Transform

Xo--------------~~------------~~------------ ..

-/

Fig. 4.2. Decimation in frequency signal flow graph. N = 8.

trivial. The multiplications in the second stage are also trivial because the
twiddle factors are then defined by W kNI4 = (_j)k. In the following stages, the
twiddle factors are given by W k NI8, W k NII6, ... and the number of trivial multi-
plications is N/4, N/8, .... Under these conditions, the number of nontrivial
complex multiplications becomes (N/2)( -3 + log2 N) + 2 and, if the complex
multiplications are implemented with 4 real multiplications and 2 real additions,
the numbers of real multiplications M and real additions A required to imple-
ment the radix-2 FFT algorithm are
4.2 The Fast Fourier Transform Algorithm 91

M = 2N( -3 + log2N) + 8 (4.63)

A = 3N(-1 + log2N) + 4. (4.64)

If the complex multiplications are implemented with 3 mUltiplications and 3


additions (Sect. 3.7.2), M and A become

M = (3N/2)( -3 + log2N) + 6 (4.65)

A = (N/2)( -9 + 710g 2N) + 6. (4.66)

It can also be noted that, since WN/S = (l - j)/-/2:, the multiplications by

WkN/S can be implemented with 2 real multiplications and 2 real additions.


Since the stages of order 3, 4, 5, ... use, respectively, N/4, N/8, ... such multi-
plications, we have a total of N/2 - 2 multiplications by WkN/S and the total
number of multiplications for the DFT reduces to (N/2)( -4 + log2N) + 4
complex multiplications plus N/2 - 2 multiplications by WkN/S. Thus, when the
complex multiplications are implemented with 4 real multiplications and 2 real
additions, the number M and A of real operations is

M = N(-7 + 210g2N) + 12 (4.67)

A = 3N(-1 + log2N) + 4 (4.68)

and, for the complex multiplication algorithm using 3 multiplications and 3


additions,

M = (N/2)( -10 + 310g2N) + 8 (4.69)

A = (N/2)( -10 + 7 log2N) + 8. (4.70)

Hence significant additional reduction is obtained when full use is made of the
symmetries in sine and cosine functions. In the case of a DFT of 1024 points,
for instance, the straightforward computation by (4.56) would require 20·2\0
real multiplications, as opposed to about 10·2\0 real multiplications when the
FFT is computed by the approach corresponding to (4.69).

4.2.2 The Radix-4 FFT Algorithm

We now turn our attention to a DFT of dimension N = 2' with t even. In


this case, the first stage of the FFT can be defined by choosing N\ = 4 and
N2 = 2t-2. This is equivalent to splitting the N-point input sequence Xm into
the 4 sequences of N/4 points corresponding to X 4m , X 4m +1> X4m+2, and X 4m + 3 for
m = 0, ... , N/4 - 1. For this partition, Xk becomes
92 4. The Fast Fourier Transform

(4.71)

and, since W N / 4 =- j,

(4.72)

(4.73)

3 N/4-1
" J'1 W'k ~
X- k+3N/4 = ~ " X4m+/ W 4mk
1=0 m=O

k = 0, ... , N/4 - 1. (4.74)

Hence this radix-4 decimation in time algorithm [4.7, 8] converts an N-point


OFT into 4 OFTs oflength N/4 at the cost of N complex multiplications by the
twiddle factors W 'k and 3N complex additions for recombining the output
samples of the OFTs of N/4 points. The same procedure can be applied re-
cursively in t/2 stages, each stage reducing the dimensions of the OFTs by a
factor of four. Thus, the OFT is computed by the radix-4 algorithm with (N/2)
log2N complex multiplications and (3N/2)log2N complex additions. This number
of operations is higher than with the radix-2 FFT algorithm. However, we now
show that the computational complexity can be drastically reduced by exploiting
the symmetries of the sine and cosine functions.
We observe first that, denoting the N/4 point OFTs by X"k, we can decrease
the number of additions per stage to 2N instead of 3N, by the following com-
putation procedure:
_ N/4-1
X/,k = L; X4m+1 W 4mk, 1= 0, ... ,3 (4.75)
m=O

(4.76)

(4.77)

(4.78)

X k+N/ 2 = (YO,k + Y ,k) -2 (Y1,k + Y 3,k) (4.79)

Xk+3N/4 = (YO,k - Y 2 ,k) + j(Y1,k - Y 3,k)' (4.80)

With regard to multiplications,we note that the twiddle factors in the successive
stages are given by W/k, W41k, W 161k , .". Thus, the twiddle factors take the values
Wlk4' for i = 0, 1, .... Each stage i splits the computation of 41 OFTs of length
Nj4I into that of 41+1 OPTs of length N/41+1, with k = 0, .. " N/41+1 - 1. Since
4.2 The Fast Fourier Transform Algorithm 93

the total number of multiplications by twiddle factors per stage is N, each stage
divides into 41 groups of twiddle factors Wlk4', with k = 0, ... , N/4i+! - 1 and
1= 0, ... ,3. For the last stage, the twiddle factors correspond to W~ and are
computed by trivial multiplications by ± 1, ±j. For the other stages and I = 1,
°
3, the only simple multiplications correspond to k = and k = (N/2)4i+'. These
cases correspond, respectively, to a multiplication by 1 and a multiplication by
Wi = [(1 - j)/ ,J-Z]l. For 1= 0, we have W 1k = 1. For 1= 2, the multiplications
by Wlk4' are implemented with 2 trivial multiplications, 2 multiplications by an
odd power of W s , and (N/4i+!) - 4 complex multiplications. Since we have 41
groups per stage, this corresponds to (3N/4) - 8.41 complex multiplications
and 41+' multiplications by odd powers of Ws per stage. Moreover, N = 2' =
4'/2. We must therefore sum these numbers of mUltiplications over (t/2) - 1
stages. Thus, the number M, of nontrivial complex multiplications is given by

M, = (3N/8)log2N - (l7/12)N + 8/3. (4.81)

Similarly, the number M2 of multiplications by odd powers of Ws is given by

M2 = (N - 4)/3. (4.82)

Under these conditions, if the complex multiplications are implemented with 4


real multiplications and 2 real additions, the number of real multiplications M
and real additions A is

M = (3N/2)log2N - 5N + 8 (4.83)

A = (llN/4)log2N - (13N/6) + (8/3) (4.84)

and, when the complex multiplications are implemented with 3 multiplications


and 3 additions,

M = (9N/8)log2N - (43N/12) + (16/3) (4.85)

A = (25N/8)log2N - (43N/12) + (16/3). (4.86)

Thus, the radix-4 algorithm significantly reduces the number of operations in


comparison with the radix-2 algorithm. This is shown in Table 4.1 which gives
the number of real operations for DFTs computed by radix-2 and radix-4 FFT
algorithms corresponding to (4.69, 70) and (4.85, 86), respectively. It can be seen
that the radix-4 algorithm reduces the number of multiplications to a level about
25 % below that of the radix-2 algorithm, while the number of additions is
approximately the same. Slight additional improvement can also be obtained by
using radix-8 or radix-16 algorithms [4.7, 8]. When N is not a power ofa single
radix, one is prompted to use a mixed-radix approach. A DFT of 32 points, for
example, could be computed by a two-stage radix-4 decomposition followed by
94 4. The Fast Fourier Transform

Table 4.1. Number of nontrivial real operations for radix-2 and radix-4 FFTs where the
complex mUltiplications are implemented with 3 real multiplications and 3 real additions and
where the symmetries of the trigonometric functions are fully used

Radix-2 FFT Radix-4 FFT


DET size Number of Number of Number of Number of
N multiplications additions multiplications additions

4 0 16 0 16
16 24 152 20 148
64 264 1032 208 976
256 1800 5896 1392 5488
1024 10248 30728 7856 28336
4096 53256 151560 40624 138928

a one-stage radix-2 FFT. When properly designed, such mixed radix methods
can be optimum from the standpoint of the number of arithmetic operations,
but the additional computational savings are achieved at the expense of a some-
what more complex implementation.

4.2.3 Implementation of FFT Algorithms

The FFT algorithm may be organized in a variety of different ways as a function


of the order in which data are accessed and stored and the implementation of the
twiddle factors in the computation structure. In order to illustrate the implemen-
tation of an actual FFT, we consider here the radix-2 decimation in time algo-
rithm depicted previously in Fig. 4.1.
H is apparent that the computation proceeds in t stages, denoted by i, with
i = 1,2, ... , t, and that each stage must compute the NI2 butterfly operations

(4.87)

(4.88)

where xl-I and xl are, respectively, the input and output data samples corre-
sponding to the ith stage. Since the input and output samples in (4.87, 88) have
the same indices, the computation may be executed in place, by writing the out-
put results over the input data. Thus, the FFT may be implemented with only N
complex storage locations, plus auxiliary storage registers to support the butter-
fly computation.
The complex value of Wd as a function of index I and stage i can be deter-
mined by using a bit-reversal method. This is done by writing I as a t-bit binary
number, scaling this number by t-i bits to the right, and reversing the order of the
bits. Thus, if we consider, for instance, the node x~ corresponding to the second
stage of the 8-point OFT illustrated in Fig. 4.1, we have I = 5 and i = 2. In this
4.2 The Fast Fourier Transform Algorithm 95

°and1 0.weFinally ° °°
case, I is 1 Olin binary notation. Scaling by t - i = 1 bit to the right yields
d is obtained by reversing 1 0, which gives also 1 or integer 2,
x~ = x~ + W 2 xi.
have
The bit-reversal process can also be implemented very simply by counting
in bit-reversed notation. For an 8-point DFT, a conventional 3-bit counter
yields the successive integers 0, 1,2,3,4,5,6,7. If the counter bit positions are
reversed, we have 0, 4, 2,6, 1,5,3,7, which gives the one-to-one correspondence
between the natural order sequence and the bit-reversed order sequence.
The coefficients Wd may also be computed, in each stage, via recursion
formula with

(4.89)

These coefficients may be precomputed and stored for each stage in order to
save computation time at the expense of increased memory.
The algorithm illustrated in Fig. 4.1 produces the DFT output samples Xk
in bit-reversed order. Thus, these samples must usually be reordered at the end
of the computation by performing a bit-reversal operation on the indices k. We
shall see in Sect. 4.6, however, that this operation is unnecessary when the DFTs
are used to compute convolutions.
The foregoing considerations also apply generally to algorithms using
radices greater than 2, and a Fortran program for these FFT forms can be found
in [4.8]. In practice, there are many variations of the basic FFT algorithm which
correspond to different trade-offs between speed of execution and memory re-
quirements. It is possible, for instance, to devise schemes with identical geometry
from stage to stage or with input data and output data in natural order. When
the FFT is programmed in a high-level language with sophisticated functions for
the manipulation of arrays such as APL [4.9], the implementation can be
strikingly simple. This is well illustrated in the radix-2 FFT program designed
by McAuliffe and reprinted in Fig. 4.3 with the kind permission of the author.

V Z+TF33 A,K,M,W,O,P,Q,R,S,V,N
[1] W+ 2 1 o.OO(.~(2.P)p(PpV+O).-O-V[-O-2xtP1HP.OpO+t1.0pS+2.N.OpR+
(M+1)p2.0pZ+A[,V+.(~tM)~«K+M+1+2.P+O.5xN)p2)ptN+-1tpA]
[2] +(O<K+K-1)/2.0pW+W[,.~(2.P)pIP]+O.OpZ+Sp(-/[O] WxZ).+/[O] WxeZ+S
p«O+K).«-K)~O.Mp1)/IM+1)~Rp( .+/[K+O] Z) •• -/[K+O] Z+RpZ
v

Fig. 4.3. Radix-2 FFT program written in APL.

This APL program uses just two instructions, the first one for generating
coefficient values and the second for performing the actual data computation.
In this program, the N-point DFT is computed by executing

Z-TF33 A, (4.90)
96 4. The Fast Fourier Transform

where TF33 is the name of the FFT subroutine and A is an array of2lines and N
columns, the first line representing the real part of the input sequence and the
second line representing the imaginary part of the input sequence. The input
sequence is given by the array Z which has the same structure as the input data
array A. The reader is cautioned, however, to note that this program actually
computes an inverse OFT rather than the direct OFT as defined by (4.1).
We give the execution times for various OFT lengths computed with this
program on an IBM 370/168 computer operating on APL under VM370 in
Table 4.2. These figures can be compared with the execution times for direct
computation of the same OFTs in APL with the same system. It can be seen
that the reduction in arithmetic load made possible by the FFT algorithm does
translate into a comparable reduction in execution time. This is quite apparent
for large OFTs and, for instance, a 1024-point OFT is calculated in only 791 ms
vis the FFT program, as opposed to 165335 ms for direct computation.

Table 4.2. Comparative execution times in milliseconds for DFTs computed by the FFT prog-
ram of Fig. 4.3 and by direct computation. IBM 370j168-APL VM370

Execution times (CPU time) [ms)


DFTsize N
Radix-2 FFT DFT

4 17 16
8 24 32
16 32 80
32 43 234
64 63 776
128 109 2840
256 188 10765
512 368 41708
1024 791 165335

4.2.4 Quantization Effects in the FFT

Since the FFT is implemented with finite precision arithmetic, the results of the
computation are affected by the roundoff noise incurred in the butterfly calcula-
tions, the scaling of the data, and the approximate representation of the coeffi-
cients Wd. These effects have been studied for fixed point and floating point com-
putations [4.10-12]. We shall restrict our discussion here solely to fixed point
radix-2 FFT algorithms.
Consider first the impact of scaling. At each stage, we must compute the
butterflies,

(4.91)
4.2 The Fast Fourier Transform Algorithm 97

Xi+NI2' = Xi-I - Wd xi+1l2' (4.92)

Thus, the magnitude of the signal samples tends to increase at each stage, the
upper bounds on the modulus of xi being given by

Max I xi I ~ 2 Max (I Xi-I I, Ixi+JI/2'I)· (4.93)

Hence the signal magnitude increases by a maximum of one bit at each stage and
a scaling procedure is needed to avoid overflow. An especially efficient scaling
procedure would be to compute each stage without scaling, then to scale the
entire sequence by one bit, only if an overflow is detected. Alternatively, a
simpler, but less efficient method based upon systematic scaling by one bit at
each stage can also be employed. In this case, the implementation is simple, but
sUboptimum. Nevertheless, an evaluation of the quantization effects using this
simple scheme provides an upper bound on quantization noise. Thus, in the
following analytical development, we shall assume that the data is scaled by one
bit at each stage.
It is well known [4.1] that if the product of two B-bit numbers is rounded to
B bits, the error variance is given by

(4.94)

Moreover, when two B-bit numbers are added together, the sum may be a
(B + I)-bit number. Thus, when there is an overflow, the sum must be scaled by
1/2 and one bit is lost. The variance of the corresponding error is

(4.95)

We shall now assume that errors are uncorrelated and that an overflow occurs
at each stage. Since the data input at the first stage of the transform is scaled by
1/2, the variance V(x,,) of x" is given by

(4.96)

The first stage computes a set of N data samples with multiplications by ± 1


and additions. Furthermore, the output samples x! from this first stage must be
scaled by 1/2. Hence

(4.97)

where the factor of 4 accounts for the fact that the error caused by scaling at the
first stage is twice the error at the zeroth stage. Similarly, the second stage im-
plements multiplications by only ± 1, ±j and we have

(4.98)
98 4. The Fast Fourier Transform

(4.99)

In the third stage, half the butterfly operations are nontrivial. For these, we have

(4.100)

which yields

Vex;) = V(x;) + (Re 2 {x;} + 1m 2 {x;}) V(Wd)


+(Re {Wd}
2 + 1m 2 {Wd} )V(x;) +4 3 (J'2 +4 3 6(J'2, (4.101)

where the bars over the symbols represent here an average over the sequence.
Thus,

(4.102)

where the first term in (4.102) is the variance of the first term in (4.100) and
the two next terms in (4.102) correspond to the complex multiplication. The
terms 4 3 (J'2 derive from the rounding after addition and 4 3 6 (J'2 corresponds to
rescaling. We now define Aas the average squared modulus of the input sequence

(4.103)

Since A increases by a factor of two at each stage, we have

(4.104)

However, since the second and third terms in (4.104) appear only when the
multiplications are nontrivial and, since half the multiplications in the third stage
are trivial, Vex!) reduces to

(4.105)

(4.106)

and, assuming a similar computation procedure for all other stages, we have, for
the last stage,

(4.107)

Since the mean square of the absolute values of the output sequence X k (we
delete here our usual bar sign on transforms in order to avoid confusion with
averaging) is 2',1, the ratio ofrms noise output to rms signal output is, for large
DFTs,
4.3 The Rader-Brenner FFf 99

rms (error) _,./N 2- B (0.3)"/8


(4.108)
rms (signal) - rms (input)

which demonstrates that the error-to-signal ratio of the FFT process increases
as ,./ N or 1/2 bits per stage.
Another source of error is due to the use of truncated coefficients. Wein-
stein [4.12] has shown, by a simplified statistical analysis, that this effect tran-
slates into an error-to-signal ratio which increases very slowly with N. Experi-
mental results have tended to confirm this analysis result.

4.3 The Rader-Brenner FFT

The evaluation of OFTs by the conventional FFT algorithm requires complex


multiplications. We shall show now that a simple modification of the FFT algo-
rithm replaces these complex multiplications by multiplications of a complex
number by either a pure real or a pure imaginary number [4.13]. This is realized
by computing an N-point OFT, with N = 2'
N-I
2: xmW mk ,
Xk = m=O k = 0, ... , N - I (4.109)

via a decimation in frequency radix-2 FFT form, which for k even, yields

k = 0, ... , N/2 - 1, (4.110)

where k is replaced by 2k.


For k odd, replacing k by 2k + 1, yields
k = 0, ... , N/2 - 1. (4.111)

Thus, the first stage of the decimation in frequency FFT decomposition replaces
one OFT of length N by two OFTs oflength N/2 at the cost of N complex addi-
tions and N/2 complex multiplications. In order to simplify the calculation of
the OFT X2k +" we define the (N/2)-point auxiliary sequence am by

am = (xm - X m+N/2)/[2 cos (2rcm/N)], m *" 0, N/4.


{
aD = ° a N /4 = 0. (4.112)

We then compute the (N/2)-point OFT Ak of am


_ N/2-1
Ak = L:
m=O
am W 2m\ k = 0, ... , N/2 - 1. (4.113)
100 4. The Fast Fourier Transform

X2k+1 can be recovered from Ak by noting that

(4.114)

or

(4.115)

And, since W N / 4 = -j,

I~2k+1
X2k + 1
=
=
~k
Ak
+ ~k+l + Va
+ + VI
Ak+l
for k even
for k odd (4.116)

with

(4.117)

(4.118)

Under these conditions, the N/2 complex multiplications by the twiddle factors
wm in the first stage are replaced with (N/2) - 2 multiplications by the pure real
numbers 1/[2 cos (2rcm/N)]. Note here that the contributions of Xa - XN/2 and
X N / 4 - x 3N/4must be treated separately, because cos (2rcm/N) = 0 for m = N/4.
The same method is used recursively to compute the (N/2)-point transforms
X2k and Ak , and then the transforms of dimensions N/4, N/8 ... until complete
decomposition is achieved.
Since the mUltiplication of a complex number by a scalar value is imple-
mented with two real multiplications, each stage is computed with N-4 nontrivial
real multiplications. We need also N complex additions for evaluating Xm +
X m +N/2 andxm -xm+N/2 plus N + 2 complex additions for calculating (4.116-118).
However, two complex additions are saved in the computation of lh because
aa = 0 and aN/4 = O. Thus, for each stage, the number of real mUltiplications
M and real additions A become

M=N-4 (4.119)

A=4N. (4.120)

The two last stages of the decomposition correspond to transforms of dimen-


sions 4 and 2 which are computed by the conventional FFT methods with trivial
multiplications by ± I and ±j. Moreover, the two preceding stages, correspond-
ing, respectively, to DFTs oflengths 16 and 8, are also computed more efficiently
by conventional methods such as a radix-4 algorithm (Sect. 4.2.2) or the Wino-
grad algorithm [4.14]. Under these conditions, the number of real operations
4.3 The Rader-Brenner FFT 101

Table 4.3. Number of nontrivial real operations tor complex DFTs computed by the Rader-
Brenner method

DET size Number of real Number of real Multiplications Additions


N multiplications additions per point per point

8 4 52 0.50 6.50
16 20 148 1.25 9.25
32 68 424 2.12 13.25
64 196 1104 3.06 17.25
128 516 2720 4.03 21.25
256 1284 6464 5.02 25.25
512 3076 14976 6.01 29.25
1024 7172 34048 7.00 33.25
2048 16388 76288 8.00 37.25

for the DFTs of complex input sequences evaluated by the Rader-Brenner


method is given in Table 4.3. It can be seen, by comparison with Table 4.1, that
the Rader-Brenner technique reduces the number of multiplications over the
radix-2 and radix-4 FFT algorithms, while requiring about 10% more additions.
The same method may also be implemented in a decimation in time arrange-
ment [4.13]. In this case, the premultiplications by 1/[2 cos (21t miN)] are replaced
by postmultiplications by 1/[2 cos (21t miN)] and the computational complexity
is the same as with the decimation in frequency approach. It should be noted
that, for large transforms, cos (21t miN) becomes very small for some values of
m. Then, the multiplications by 1/[2 cos (21t miN)] for these values introduce
large errors. Cho and Ternes [4.15], however, have proposed a modification of
the basic Rader-Brenner algorithm to overcome this limitation.
In many instances, one needs only the odd output terms of a DFT. These
terms are generated by (4.111) and can be viewed as the modified DFT

Table 4.4. Number of nontrivial real operations for complex reduced DFTs computed by
the Rader-Brenner algorithm

Number of real Number of real


Reduced DFT size
multiplications additions

8 16 64
16 48 212
32 128 552
64 320 1360
128 768 3232
256 1792 7488
512 4096 17024
1024 9216 38144
102 4. The Fast Fourier Transform
_ NIZ-I
Yk = L; YmwmW2m\ k = 0, ... , NI2 - 1
m=O

(4.121)

Such a modified DFT, which occurs naturally in the first stage of a decimation
in frequency FFT algorithm, is often called a reduced DFT [4.16] or an odd
DFT [4.17] and is used, for instance, in the computation of multidimensional
DFTs by polynomial transforms (Chap. 7). The Rader-Brenner algorithm
applies directly to the calculation of such reduced DFTs and we give, in Table
4.4, the number of nontrivial real operations for reduced DFTs computed via
this method.

4.4 Multidimensional FFTs

We consider first a two-dimensional DFT of size NI X N z , with

kl = 0, ... , NI - 1, kz = 0, ... , N z - 1. (4.122)

In order to evaluate this DFT, we first rewrite (4.122) as

(4.123)

As a first step, we evaluate the NI DFTs Ym,.k, of length N z which correspond to


the NI distinct values of ml

(4.124)

Xk,.k, is then obtained by calculating N z DFTs oflength NI on the N z sequences


Ym,.k, corresponding to the N z distinct values of kz

(4.125)

This approach is often called the row-column method because it can be viewed
as equivalent to organizing the input data into sets of row and column vectors
in an array of size Nl X N z and computing, in sequence, first the DFTs of the
columns and then the DFTs of the rows. With this technique, the two-dimen-
sional DFT is mapped, respectively, into NI DFTs of N z terms plus N2 DFTs
of Nl terms. If Nl and N z are powers of two, the one-dimensional DFTs of
4.4 Multidimensional FFTs 103

lengths Nl and N2 can be evaluated by a FFT-type algorithm. In the case of a


simple radix-2 decomposition, the number of complex multiplications M
becomes

(4.126)

and, for a DFT of size N X N,

(4.127)

where Ml is the number of multiplications required to compute a DFT of


length N. The same method also applies to more than two dimensions, and a
d-dimensional DFT of size N X N X N X ... is calculated with dNd-l DFTs of
length N so that the number of mUltiplications becomes

(4.128)

and, in particular,

M = (dNd log2 N)j2 (4.129)

when the DFTs of N terms are evaluated with a simple radix-2 FFT-type algo-
rithm. We shall see in Chap. 7 that the multidimensional to one-dimensional
DFT mapping obtained with the row-column method is sUboptimal and that
better methods can be devised by using polynomial transforms. In order to
support a quantitative comparison of the computational complexities for the
two methods, we present in Table 4.5 the number of real operations for various
complex two-dimensional DFTs calculated by the row-column method and the
Rader-Brenner algorithm.

Table 4.5. Number of nontrivial real operations for complex DFTs of size N x N computed
by the row-column method and the Rader-Brenner algorithm

Number of real Number of real Multiplications Additions


N
multiplications additions per point per point

8 64 832 1.00 13.00


16 640 4736 2.50 18.50
32 4352 27136 4.25 26.50
64 25088 141312 6.12 34.50
128 132096 696320 8.06 42.50
256 657408 3309568 10.03 50.50
512 3149824 15335424 12.02 58.50
1024 14688256 69730304 14.01 66.50
104 4. The Fast Fourier Transform

4.5 The Bruun Algorithm

We shall now discuss an algorithm introduced by Bruun [4.18] which has both
theoretical and practical significance. The practical value of this algorithm
relates to the fact that the DFT of real data can be computed almost entirely
with real arithmetic, thereby simplifying the implementation of DFTs for real
data. We shall present here a modified version of the original algorithm which
will allow us to introduce a polynomial definition of the DFTs that will be used
in later parts of this book.
We consider again an N-point DFT, with N = 2'

x: - "x
N-l
k - """-.I m wmk , k = 0, .'" N - 1
m=O

j = ,J-I. (4.130)

In order to develop the algorithm, we replace (4.130) with a polynomial repre-


sentation of the DFT defined by the two following equations:
N-l
X(z) == ~ xmzm modulo (ZN - 1) (4.131)
m=O

Xk == X(z) modulo (z - Wk). (4.132)

Equations (4.131) and (4.132) are equivalent to (4.130) because the definition of
(4.132) modulo (z - Wk) means that we can replace z by Wk in (4.131). At this
point, the definition of (4.131) modulo (ZN - 1) is unnecessary. However, this
definition is valid because ZN == WkN = 1. We note that the N roots of ZN - 1
are given by W k for k = 0, ". , N - I, with
N-l
ZN - 1= II (z - Wk). (4.133)
k=O

Moreover, since N = 2', we can express ZN - 1 as the product of two polynomials


of NI2 terms in z, with

ZN - I = (ZNI2 - 1) (ZNI2 + I) (4.134)

and

(4.135)

kl = 0, "., NI2 - 1. (4.136)


4.5 The Bruun Algorithm 105

Hence, for k even, all the values of W k correspond to the polynomial ZNIZ - 1
and we can replace (4.131) and (4.132) with
Ni2-1
X\(z) = L
m=O
(xm + Xm+NIZ)Zm == X(z) modulo (ZNIZ - 1) (4.137)

k even. (4.138)

Similarly, for k odd, all the values of W k correspond to the polynomial ZNIZ +1
and Xk(z) is computed by

+ 1)
N/2-1
Xz(z) = L
m=O
(xm - Xm+NIZ)zm == X(z) modulo (ZNIZ (4.139)

k odd. (4.140)

The form (4.137-140) can be easily recognized as equivalent to the first stage of a
decimation in frequency FFT decomposition, since (4.137,138) represent a OFT
of NI2 terms while (4.139,140) represent an odd OFT of NI2 terms. At this stage,
we depart from the conventional FFT decomposition by noting that any poly-
nomial of the form z4q + az Zq + 1 factors into two real polynomials,

z4q + azZq + 1 = (ZZq + -/2 - a zq + 1) (ZZq - -/2 - a zq + 1). (4.141)

This implies, therefore,

ZNIZ + 1= (ZN/4 + -/2" ZNI8 + 1) (ZNI4 - -/2" ZNI8 + 1) (4.142)

with

ZNI4 + -/2" ZNI8 + 1= II (z - WZk,+I) (4.143)


kJEB I

ZNI4 - -/2" ZNI8 + 1 = II (z - W Zk ,+1), (4.144)


k J E B'.1.

where B\ is the set of NI4 values of k\ such that W2k,+1 is a root of ZNI4
+ -/2" ZNI8 + 1 and Bz is the set of the NI4 other values of k\. Under these
conditions, the odd OFT represented by (4.139,140) can be replaced by

X 3(z) == Xz(z) modulo (ZN/4 + -/2" ZNI8 + 1) (4.145)

(4.146)

and

X 4(z) == Xz(z) modulo (ZNI4 - -/2" ZNIB + 1) (4.147)


106 4. The Fast Fourier Transform

(4.148)

The same decomposition process can then be repeated by systematically expres-


sing the polynomials of the form z4q + az 2q + 1 as the products of two real poly-
nomials of degree 2q, until the polynomials are reduced to degree 2. At this
point, further decomposition as the product of two real polynomials is no longer
possible and two DFT output terms are obtained for each degree 2 polynomial
by replacing z with the two complex roots of the polynomial. This process is
summarized in Fig. 4.4 for a DFT of 8 terms. In this diagram, each box re-
presents a reduction modulo the polynomial indicated in the box.
We note that the first stage, which corresponds to the reductions modulo
(ZNI2 - 1) and modulo (ZNI2 + 1) is computed by (4.137) and (4.139) with N com-

Fig. 4.4. Computation of a 8-point DFT by Bruun's algorithm.


4.6 FFT Computation of Convolutions 107

plex additions. In the second stage, the reductions modulo (ZNI4 - 1) and
modulo (ZNI4 + 1) are computed with NI2 complex additions. For the reduc-
tions modulo (ZNI4 + -/2 ZNI8 + 1), we have ZNI4 == - -/2 ZNI8 - 1 and
Z3NI8 = ZNI8 + ,J2,andforthereductionsmodulo(zNI4 - -/2 ZNI8 + 1),zNI4

== -/2 ZNI8 - 1 and Z3NI8 == ZNI8 - -/2. Since -/2 is real, the complex
multiplications are implemented with two real multiplications and the two
reductions are implemented with NI2 real multiplications and 3Nj2 additions.
The second stage corresponds to a = 0 in (4.141). In the following stages, a
takes successively the values ±-/2, ± -/2 ±-/ 2 ... and the reductions pro-
ceed similarly, with mUltiplications by the real factors a and -/2 - a, the only
difference with the second stage being that the multiplications by a are no lon-
ger trivial.
In the last stage, we have two reductions with trivial multiplications by ± 1,
±j and two reductions with multiplications by powers of WNl8 which require 2
real multiplications for each reduction. The Nj2 - 4 other reductions correspond
to multiplications by Wk, W-k which are implemented with 4 real multiplications
and 8 real additions. Hence, with the exception of the last stage, all multiplica-
tions are done with real factors. In practice, the original algorithm proposed by
Bruun uses aperiodic convolutions instead of reductions, and this original ap-
proach requires slightly fewer arithmetic operations than the method described
here.
The principal use of the Bruun algorithm is in the calculation of the DFT for
real data sequences. In this case, since the coefficients in the t - 1 first stages are
real, these stages are implemented in real arithmetic. Moreover, since the reduc-
tions in the last stage correspond to multiplications of a real data sample by the
complex conjugate coefficients Wk and W-k, the operations in the last stage can
also be viewed as implemented in real arithmetic, with 2 real multiplications and
1 real addition for each reduction modulo (z - Wk) and modulo (z - W-k).
Thus, the Bruun algorithm provides a convenient way of computing the DFT of
a real data vector using only real arithmetic.
It should also be noted that the Bruun algorithm is closely related to the
Rader-Brenner algorithm, since

(z - Wk) (z - W-k) = Z2 - 2z cos (2rckjN) + 1. (4.149)

Hence the various coefficients used in the Bruun algorithm are identical to cor-
responding coefficients in the Rader-Brenner algorithm, and the main difference
relates to multiplications by Wk and W-k in the last stage.

4.6 FFT Computation of Convolutions

We have seen in Sect. 4.1 that the DFT has the convolution property. This
means that the circular convolution y, of two sequences hn and Xm can be com-
108 4. The Fast Fourier Transform

puted by evaluating the OFTs ilk and Xk of hn and X m , by multiplying, term by


term, ilk by Xk, and by computing the inverse OFT of ilkXk. Hence we have

Y/ = OFT- 1 {[OFT (h.)] [OFT (xm)]} (4.150)

for the convolution


N-J
Y/ = L::
n=O
h.x/_ m , 1= 0,00', N - 1. (4.151)

Since a OFT can be computed by the FFT algorithm, this method requires a
number of operations proportional to N log N and, therefore, requires consider-
ably less computation than the direct method. More precisely, if the OFTs are
calculated via simple radix-2 algorithm with one of the input sequences fixed,
the circular convolution of length N, with N = 2', requires the computation of
two FFTs and N complex multiplications. Consequently, the number of com-
plex multiplications M required to evaluate the convolution is

M = N(l + logz N). (4.152)

For large convolutions, this is considerably less than the N Z multiplications


required for the direct computation of (4.151).
Frequently, one may wish to evaluate a real convolution by the FFT method.
This can be done by computing the convolutions of two successive blocks simul-
taneously. Assuming that h n is fixed, we compute the convolution of hn with the
two real N-point sequences Xm and X m+N by first constructing the auxiliary
sequence Xm + jxm+N • The complex convolution of h n with Xm + jXm+ N is then
computed by OFTs to yield the complex convolution Y/ + jYI+N' Thus, the con-
volution of the first block with h. is defined by the real part ofthe complex convolu-
tion, and the convolution corresponding to the second block by the imaginary
part. With this method, the number of operations required to compute a real
convolution is half that of a complex convolution.
Tables 4.6 and 4.7 list the number of real operations corresponding to the
calculation of real one-dimensional and two-dimensional convolutions by FFTs,
using the Rader-Brenner algorithm. In these tables, we have assumed that one
of the sequences is fixed, that two real convolutions are computed for each com-
plex OFT, and that complex multiplication is implemented with 3 real multipli-
cations and 3 real additions. It can be verified easily that the FFT approach
reduces drastically the number of operations: for example, a real circular con-
volution of 1024 points computed by FFTs requires only 8708 multiplications,
as opposed to 1048576 multiplications for the direct method, or about 100 times
fewer multiplications.
When real convolutions are calculated by dedicated special purpose FFT
hardware, it is often desirable to compute the OFT and the inverse OFT in a
single transform step with the same hardware, rather than evaluating two real
4.6 FFT Computation of Convolutions 109

Table 4.6. Number of real operations for real circular convolutions computed by the Rader-
Brenner algorithm (2 real convolutions per DFT; one input sequence fixed)

Convolution size Number of real Number of real Multiplications Additions


N multiplications additions per point per point

8 16 64 2.00 8.00
16 44 172 2.75 10.75
32 116 472 3.62 14.75
64 292 1200 4.56 18.75
128 708 2912 5.53 22.75
256 1668 6848 6.52 26.75
512 3844 15744 7.51 30.75
1024 8708 35584 8.50 34.75
2048 19460 79360 9.50 38.75

Table 4.7. Number of real operations for real circular convolutions of size N x N computed
by the Rader-Brenner algorithm. (2 real convolutions per DFT; one input sequence fixed)

N Number of real Number of real Multiplications Additions


multiplications additions per point per point

8 160 928 2.50 14.50


16 1024 5120 4.00 20.00
32 5888 28672 5.75 28.00
64 31232 147456 7.62 36.00
128 156672 720896 9.56 44.00
256 755712 3407872 11.53 52.00
512 3543040 15728640 13.52 60.00
1024 16261120 71303168 15.51 68.00

convolutions simultaneously with separate hardware for the DFT and the inverse
DFT. This can be accommodated using an approach, proposed by McAuliffe
[4.19], which is based on the computation of the DFTs of two real sequences in
a single complex DFT step.
We have already seen in Sect. 4.1 that the DFTs Xk and Xlc of two real
N-point sequences Xm and x~ can be evaluated as a single complex DFT by
computing the D FT Yk of the auxiliary sequence Xm + jx~. The sequences Xk and
Xl are then deduced from Yk by
Xk = (Yk + Y!k)/2 (4.153)

Xl = (Yk - Y~k)/2j, (4.154)

where Y!k is the complex conjugate of Y-k' Following this procedure, the con-
volution YI of the two real sequences Xm and hn is computed as shown in Fig. 4.5.
110 4. The Fast Fourier Transform

MA GINAR Yij------,
INPUT

FFT
REAL INPUT

y/

Fig. 4.5. Computation of a real convolution in a single FFf step.

The transform Xk of Xm is derived from Yk by (4.153). Xk is then multiplied with


iik/N, where iik is the OFT of hn' and the real parts and imaginary part of iik
Xk/N are added, thus yielding the sequence xL.

1 N-l
xl =- ~
N-l
~ hnxm[W(m+n)k + W-(m+n)k
2N .=0 m=O

(4.155)

The sequence xl is then used as the imaginary input to the FFT Yk , and the
transform Xl of xl is obtained by (4.154). Hence
1
Xl = -
N-l N-l
~
N-l
2:: hnxm 2:: [w(m+.+l>k + w-(m+n-l>k
2N .=0 m=O k=O

(4.156)

The terms in the summation over Ware different from zero only for m +n
+ I == 0 modulo Nand m + n - I == 0 modulo N. Thus, we have
N-l
Xl = ~ (hnx-n- r + hnxr-. - jh.x-n-r + jh.xr-n)/2, (4.157)
n=O

where a summation of the real and imaginary parts of Xl obviously yields the
convolution Yr. Clearly, one must account for the FFT computation delay in the
process and, in practice, the imaginary input to the FFT hardware usually cor-
responds to the block X m - N , while the real input corresponds to the block X m •
Hence real convolutions of dimension N can be computed with a single N-point
FFT hardware structure.
It should also be noted that some simplifications of the FFT process are
possible when used to compute convolutions. In particular, when a OFT is
computed by an FFT algorithm, since either the input sequence or the output
4.6 FFT Computation of Convolutions 111

sequence must be in bit-reversed order, some amount of computation is required


to reorder the sequence. However, when the FFT method is used to compute
convolutions, this requirement can be ignored because it is always possible to
organize the two direct FFTs to provide outputs in the bit-reversed order which
is a compatible input to an inverse FFT that produces its output sequence in
natural order.
5. Linear Filtering Computation of Discrete Fourier
Transforms

The FFT algorithm reduces drastically the number of arithmetic operations


required to compute discrete Fourier transforms and is easily implemented on
most existing computers. Thus, it is usually advantageous to compute linear
filtering processes via the circular convolution property of the OFT with the FFT
algorithm. Under these conditions, it would seem paradoxical to develop linear
filtering algorithms for the computation of the OFT. This may explain why some
algorithms which have been introduced in 1968 by Bluestein [5.1, 2] and Rader
[5.3] have long been regarded as a curiosity.
However, a number of recent developments have given an increased impor-
tance to the use of such linear filtering algorithms for the computation of the
OFTs. For real time execution, new devices, such as charge coupled devices
(CCO) or acoustic surface wave devices (ASW) have been developed to imple-
ment fairly complex filters on a single chip and can be used as basic building
blocks in the computation ofOFTs. Moreover, new results in complexity theory
have shown that some convolutions can indeed be computed more efficiently
with linear filtering methods. This point has been clearly demonstrated by
Winograd [5.4], who has introduced a fast OFT algorithm based on the nesting
of small OFTs computed as convolutions. This algorithm, which is fundamental-
ly different from the FFT is, for a variety of vector lengths, more efficient than
the FFT. In this chapter, we shall first discuss several basic algorithms that can
be used to convert OFTs into convolutions: namely, the chirp z-transform algo-
rithm and Rader's algorithm. We shall then show how large OFTs can be
computed from a set of small OFTs by the Good prime factor technique and
the Winograd Fourier transform algorithm.

5.1 The Chirp z-Transform Algorithm

Consider the OFT of a sequence Xn


N-I
Xk = ~
n-O
Xn wnk , k = 0, ... ,N - 1

(5.1)

We now rearrange the exponents nand k by noting that

(5.2)
5.1 The Chirp z-Transform Algorithm 113

Thus, (5.1) becomes


N-I
Xk = Wk'/2 I: xn Wn'/2 W-(k-n)'/7., (5.3)
,,=0

which shows that Xk may be computed by convolving the sequence Xn W"/2 with
the sequence W-n'!2, and postmultiplying by Wk'/2 as indicated on Fig. 5.1.
With this method, the DFT is computed by N complex premultiplications, N
complex postmultiplications, and one complex finite impulse response (FIR)
filter. The impulse response of the FIR filter is that of a chirp filter, well known in
radar signal processing; hence the name chirp z-transJorm given to this DFT
computation technique [5.1-3, 5].

CONVOLUTION
Xn ---+\ X }---I
x n*w- n2j 2

wn 2 / 2 ~/2

Fig. 5.1. DFf computation using chirp filtering

Since we are evaluating an N-point DFT, we only need to compute N output


terms of the chirp filter. Moreover, the indices nand k are defined modulo N
and, for N even, W-(·+N)'!2 = W-·'/2. Thus, for N even, the chirp filtering
process can be regarded as an N-point circular convolution of complex se-
quences. For N odd, W-(·+Nl'/2 = - W-"!2, so that the chirp filtering process
corresponds to a circular convolution of size 2N where the N-point input se-
quence Xn W"/2 is augmented by appending N zeros and where only the first N
output samples are computed.
One of the significant points demonstrated by the chirp z-transform algo-
rithm is that the DFT may always be computed with a number of operations
proportional to N log N, even if N is not highly composite. This can be seen
by considering, for instance, the case of N even. Here, the circular convolution
of N points can always be computed as a circular convolution of length d, with
d ~ 2N - I, by using the overlap-add technique. If d is chosen to be a power
of 2, this augmented circular convolution can in turn be evaluated by FFTs
with a number of operations proportional to N log N.

5.1.1 Real Time Computation of Convolutions and DFfs Using the Chirp .t-
Transform

Relatively complex FIR filters may be implemented very effici~ntly on a single


chip either with CCD or with ASW devices. In these filters, the filter coefficient
114 5. Linear Filtering Computation of Discrete Fourier Transforms

tap values are determined by the geometry of electrodes photoengraved on the


chip. Thus, these devices, which operate at very high speed on sampled analog
signals, are well adapted to the implementation of illters with fixed tap values.
When CCD or ASW devices are used to compute the DFT, the chirp-filter
structure shown in Fig. 5.1 is generally employed, with complex multiplications
and complex convolutions implemented with real multiplications and real con-
volutions arranged in the conventional butterfly configuration. The filters are
integrated on one or several chips with off-chip multipliers for the premultipli-
cations and postmultiplications [5.6].
For filtering applications, the direct implementation of filters by CCD or
ASW devices is often unattractive because it can require a new chip design for
each illter design and it is not readily applicable to time-variant filters. Thus, it
is generally preferable to build digital illters with CCD or ASW implemented
Fourier transform circuits. In this case, some of the premultiplication and
postmultiplication circuits can be eliminated by combining the postmultiplica-
tion in one of the direct transforms with the premultiplication in the inverse
transform. This may be seen more precisely as follows.
We want to evaluate the circular convolution YI of two sequences x" and hm

N-I
YI = L:
.. -0
x"h l -" • (5.4)

This is done via the chirp z-transform Xk of XII by (5.3) and the chirp z-trans-
form i ik of hm by

L:
N-I
i ik = Wk'/2 hm wm'/2 W-(k- ml'/2. (5.5)
m=O

The convolution product YI is obtained by calculating Yk = iikXk and comput-


ing the inverse chirp z-transform of Yk

L:
N-I
YI = (ljN) W-1'12 Yk W-k l /2 W(I-k)1/2. (5.6)
k=O

Note that in (5.6), Yk is multiplied by W-k /2 while the postmultiplication of Xk


l

in (5.3) is equivalent to multiplying Yk by W k1/2 • Thus, the postmultiplication


in Xk and the premultiplication in (5.6) cancel and can be dropped.

5.1.2 Recursive Computation of the Chirp z-Transform

When a DFT is evaluated by the chirp z-transform technique, most of the com-
putation occurs in the chirp filtering process. The z-transform, H(z) of the
impulse response of the chirp filter is given by
5.1 The Chirp z-Transform Algorithm 115

2N-t
H(z) = L W-n'/2 z-n. (5.7)
n=O

We assume now that N is a perfect square, with N = Nr, and we change the
index n with

n2 = 0, ... , 2N t - 1
nt = 0, ... , N1 - 1. (5.8)

Hence,

N,-t 2N,-1
H{z) = L W- nl /2 z-n, L W-N,n,n,{ -l)n, z-N,n" (5.9)
n 2 =0 "2:=0

which, in turn, implies

N,-t
H{z) = L W- nl12 z-n,[(l - Z-2N)/{l + W-N,n, z-N,)]. (5.10)
"1=0

Thus, H(z) may be implemented with a bank of Nt filters corresponding to the


Nt distinct values of nl> each being implemented with a premultiplication by
W- nl12, a delay of nl samples, and the recursive filter (l - z-2N)/{1 W-N,n, +
Z-N,).

5.1.3 Factorizations in the Chirp Filter

We return now to the transversal filter form of the chirp z-transform shown in
Fig. 5.1. The tap coefficients of this filter are given by W-n'/2 for n = 0, ... , N -
1. These Ntap values cannot be all distinct because -n 2 /2 is defined modulo N,
and the congruence -n 2 /2 == a modulo N has no solution for certain values of a.
Thus, the chirp filter can be implemented with less than N distinct multipliers
by adding, prior to multiplication, the data samples which correspond to the
same tap value. We note that the number of distinct taps is given by the number
of distinct quadratic residues modulo N [5.2, 7]. It is therefore possible to use the
results of Sect. 2.1.4 to find the number of distinct multipliers required to im-
plement a given chirp filter.
Consider first the case corresponding to N, an odd prime, with N = p. Then,
we know by theorem 2.9 that the number of distinct quadratic residues is given
by

Q{p) = 1 + {p - 1)/2. (5.11)

If we eliminate the trivial zero solution which corresponds to mUltiplication by


I, the number of nontrivial multipliers M reduces to
116 5. Linear Filtering Computation of Discrete Fourier Transforms

M = (p - 1)/2. (5.12)

For N composite, the two following theorems can be used to find the number of
distinct quadratic residues:
Theorem 5.1: If N is composite, with N = N\N2 ... Nk and NI = p?, with the
PI being distinct primes, the number Q(N) of quadratic residues modulo N is
given by

(5.13)

This theorem is proved by using the Chinese remainder theorem. If a is a quad-


ratic residue modulo N, it must also be quadratic residue modulo the mutually
prime factors of N. Since the representation given by the Chinese remainder
theorem is unique, two distinct quadratic residues a and b must necessarily differ
in at least one of their residues at> bl' If N = N\N2' we have Q(N\) distinct quad-
ratic residues modulo N\ and Q(N2) distinct quadratic residues modulo N 2.
Therefore, we have Q(N\)Q(N2) distinct quadratic residues modulo N\N2 and
(5.13) follows by induction.
Theorem 5.2: For N = pC, the number of distinct quadratic residues is given by
Q(PC) = 1 + (pc+\ - pd)/2(p + 1) (5.14)

ifP is an odd prime, and

(5.15)

if p = 2. In (5.14) and (5.15), we have d = 0 if c is odd and d = 1 if c is even.


Proof of this theorem can be found in [5.8].
Thus, combining data samples prior to multiplying them can significantly
reduce the number of multiplications required to process the chirp filter. In the
case ofa OFT of 16 points, for instance, Q(16) = 4 so that the number of multi-
plications is reduced by a factor of 4 when direct computation is replaced by a
factorization.

5.2 Rader's Algorithm

We have seen that any OFT can be converted into a convolution by the chirp
z-transform algorithm at the cost of 2N complex multiplications performed on the
input and output data samples. We shall see now that OFTs can also be con-
verted into circular convolutions by an entirely different method initially intro-
duced by Rader [5.3]. This method is, in some cases, computationally more
efficient than the chirp z-transform algorithm because it replaces the premulti-
5.2 Rader's Algorithm II?

tiplications and postmultiplications in the chirp z-transform algorithm by a


simple rearrangement of input and output data samples.
We consider first the simple case of a DFT of size N = p, p being an odd
prime

k = 0, ... ,p-l

(5.16)

for k = 0, Xk is computed by a simple summation

(5.17)

for k =1= 0, we have

(5.18)

The indices nand k are defined modulo p. We have seen in Sect. 2.1.3 that, if u is
the set of integers 0, 1, ... , p - 2, there are always primitive roots g defined
modulo p such that g" modulo p takes once and only once all the values
1, 2, ... ,p - 1 when u takes successively the values 0, 1, ... ,p - 2. Thus, for
n, k =1= 0, we can replace nand k by u and v defined by

n == g" modulo p
k == g. modulo p , u, v = 0, ... , p - 2. (5.19)

Under these conditions, (5.18) becomes

(5.20)

which shows that X.' - Xo is computed as a circular correlation of the permuted


data sequence x •• with W··, or equivalently as a (p -I)-point circular convolu-
tion of the data sequence X..-I-. = Xg-. with W··. Thus, for N an odd prime,
most of the computation required to evaluate a DFT of N points reduces to a
circular convolution of N - 1 points. The process is shown schematically in
Fig.5.2.
An obvious implication of Rader's algorithm is that a DFT of size p, where
p is an odd prime, can be computed with a number of operations proportional
to p log p if the circular convolution is calculated by an FFT algorithm. We shall
set:, however, in Sects. 5.3 and 5.4 that the major significance of Rader's algo-
rithm is that it allows one to compute large DFTs very efficiently, when it is
combined with other techniques.
118 5. Linear Filtering Computation of Discrete Fourier Transforms

CONVOLUTION
Xn - - ' - ' " PERMUTATION PERMUTATION

" '" 0

Fig. 5.2. Computation ofap-point DFT by Rader's algorithm.p is an odd prime

We shall now extend Rader's algorithm to accommodate composite dimen-


sions [5.4, 9].

5.2.1 Composite Algorithms

Let us now consider DFTs of size N = pC, where p is an odd prime. We have
seen in Sect. 2.1.3 that primitive roots g modulo pC always exist and that these
primitive roots are of order pc-l (p - 1). Thus, we can expect to convert a DFT
of dimension pC into a circular convolution of length pC-l(p - 1) plus some
additional terms. To demonstrate this point, we first define a change of index

kl = 0, ... ,pc-l - 1
k2 = 0, ... , p - 1. (5.21)

Subsequently, for k2 = 0, we have k == 0 modulo p and Xk becomes

(5.22)

Since Wpnk, defines n modulo pc-I, we change index n to

nl = 0, ... ,p - 1
n2 = 0, ... ,pc-l - 1. (5.23)

Thus, for k2 = 0, Xk becomes a DFT of pc-l points

(5.24)

Next, for k $. 0 modulo p, we compute separately the terms corresponding to


n== 0 modulo p and to n$.O modulo p,
(5.25)
5.2 Rader's Algorithm 119

k $. 0 modulo p (5.26)

k $. 0 modulo p (5.27)

for n == 0 modulo p,

nl = 0, ... ,pc-l - 1. (5.28)

Hence, by reordering index k, we have

k = pc-l kl + k 2, kl=O, ... ,p-l


k2 $. 0 modulo p, k2 = 1, ... ,pc-l - 1 (5.29)

and

(5.30)

Note that the right-hand side of (5.30) is independent of k 1• Thus, Ak is a


DFT of size pc-l in which the output terms corresponding to k2 == 0 are not
computed.
We turn now our attention to Bk. Since n, k $. 0 modulo p, Bk is of length
pC-l(p _ 1) and nk $. 0 modulo p. Thus, the indices nand k can be generated by
a primitive root g defined modulo pC with

n == g" modulo pc
k == gV modulo pC ,u, v = 0, ... , [pc-l(p - 1) - 1] (5.31)

and, by substituting the indices defined by (5.31) into (5.27), we obtain the cor-
relation of dimension pC-l(p - 1)

(5.32)

Thus, the DFTofsizepc has been partitioned into two DFTs of size pc-l and one
correlation of length pC-l(p - 1). The same method can be used recurisvely to
convert the DFTs of size pc-l into correlations. With this approach, a 9-point
DFT is evaluated with a 3-point DFT and a 6-point convolution, plus a 3-point
DFT where the first output term is not computed. When the 3-point DFTs are
also reduced to correlations, the 9-point DFT is computed with I multiplication
by Wo, 2 convolutions of 2 terms and one convolution of 6 terms.
When N is a power of two, the N-point DFT is partitioned into DFTs of size
NI2 by the same method, and the DFT terms corresponding to nand k odd are
computed as a correlation. However, there are no primitive roots for N> 4.
120 5. Linear Filtering Computation of Discrete Fourier Transforms

Thus, for N> 4, one uses a product of roots (_I) nI 3n" with n, = 0, 1 and
nz = 0, ... , (N/4 - 1). These roots generate a two-dimensional correlation of
size 2 X (Nj4).

5.2.2 Polynomial Formulation of Rader's Algorithm

Reducing a DFT into a set of convolutions may become very complex when N
is composite. We shall now introduce a polynomial representation of the DFT
[5.10] which greatly simplifies the formulation of Rader's algorithm. We begin
once again with the N-point DFT

N-I
X- k = k= 0, ... ,N-I
.-0 Xn W·,
~
'" k

j=,v'-I (5.33)

In order to introduce a polynomial notation, we organize the N-point input


sequence x. as a polynomial X(z) of N terms, defined modulo (zH - 1),

N-I
X(z) == L x.zn modulo (zH -
11=0
1). (5.34)

Then, (5.33) is replaced by

Xk == X(z) modulo (z - Wk). (5.35)

Note that (5.34) does not need to be defined modulo (zH - 1). This represen-
tation is therefore superfluous at this stage. However, it is valid, since n is defined
modulo N. Equation (5.35) implies that Xk is obtained by substituting Wk for z in
X(z). A simple inspection shows that (5.34, 35) are a valid alternate represen-
tation of (5.33).
We suppose now that N is an odd prime, with N = p. Since the only divisors
of pare 1 and P. zl> - 1 factors into two cyclotomic polynomials, with

zl> - 1 = (z - I)P(z) (5.36)

P(z) = zp-' + zl>-Z + ... + 1. (5.37)

Thus, the polynomial X(z) is completely determined by its residues modulo


(z - 1) and modulo P(z)

X,(z) == X(z) modulo(z - 1) (5.38)

XZ(z) == X(z) moduloP(z). (5.39)

Note also that the roots of zl> - 1 are given by z = Wk for k = 0, ... , p - 1.
5.2 Rader's Algorithm 121

Moreover, z = wo = 1 is the root of z - 1 and the p - 1 roots of P(z) are


given by z = Wk for k =1= O. Thus, we can compute Xk by

(5.40)

,k =1= O. (5.41)

X 2 (z) is a polynomial of degree p - 2, since it is defined modulo P(z). Hence,


X 2(z) can be expressed as

(5.42)

with

all = XII - Xp-l' (5.43)

Thus, for k =1= 0, Xk is a reduced DFT of p terms in which the last input sample
is zero and the first output sample is not computed

k =1= O. (5.44)

The final result will not be changed if X(z) is multiplied by Zp-l modulo (zp - 1)
and X 2(z) is multiplied by z modulo P(z). In this case, X 2(z) becomes

(5.45)

with

bll = XII+! - Xo (5.46)

and Xk reduces to
p-t
Xk == zX2 (z) modulo(z - Wk) = L: bn- t
n=1
W"k. (5.47)

Since n, k =1= 0 in (5.47), this expression defines a (p - I)-point convolution if


the indices nand k are expressed as powers of a primitive root. Thus, for N a
prime, Rader's algorithm is represented in polynomial notation as shown in Fig.
5.3. Note that the boxes shown in this figure for polynomial ordering and multi-
plications by Zp-l and z are given for illustrative purpose, but do not usually
correspond to any processing since they merely indicate the origin of the data
index. Thus, we shall usually delete such boxes in subsequent polynomial repre-
sentations of DFTs.
The main contribution of the polynomial representation is that it greatly
122 5. Linear Filtering Computation of Discrete Fourier Transforms

zP-l

Fig. 5.3. Polynomial representation of Rader's algorithm for a p-point DFT, p odd priqIe

simplifies the decomposition of composite DFTs into convolutions. If we


consider, for instance, a 9-point DFT, we know that Z9 - 1 factors into 3
cyclotomic polynomials, since the only divisors of 9 are 1, 3, and 9. These
polynomials are given by P1(z) = Z - 1, P 2(z) = (Z3 - 1)/(z - 1) = Z2 + Z + 1,
and Plz) = (Z9 - 1)/(z3 - 1) = Z6 + Z3 + 1. Thus, the 9-point DFT can be
computed as shown in Fig. 5.4 by successive reductions modulo P1(z), P 2 (z), and
P 3(z). The reduction modulo (Z3 - 1) yields a 3-point DFT which can in turn
be calculated as a 2-point convolution plus one multiplication by the approach
of Fig. 5.3. The reduction modulo (Z6 + Z3 + 1) yields a reduced DFT which
computes .Kk for k $. 0 modulo 3. This reduced DFT is evaluated with one 2-
point convolution plus one 6-point convolution by (5.30) and (5.32), respectively.
5.2 Rader's Algorithm 123

l
+
Rl:.IJUCTION
MODULO Z6+Z3+ 1

REDUCTION
MODULO (Z3_ 1)


REDUCED DFT
(6-POINT
CONVOLUTION
------------
3-POINT DFT

+
~i ------------

PLUS 2-POINT REDUCTION REDUCTION


CONVOLUTION)
MODULO Z2+Z+1 MODULO Z-l

2_P01NT
CONVOLUTION

,. ----------- ----- ...J


L... _ _ _ _ _

Xk
k *0 MODULO 3

Fig. 5.4. Computation of a 9-point DFT by Rader's algorithm. Polynomial representation

5.2.3 Short DFf Algorithms

We have seen in Chap. 3 that short convolutions can be computed very efficiently
by interpolation techniques. Thus, Rader's algorithm yields efficient implementa-
tions for small DFTs. In practice, we shall not have to use Rader's algorithm for
large DFTs because there are several other methods, to be discussed in the
following sections, which allow one to construct a large DFT from a limited set
of small DFTs. Thus, we shall be concerned here only with the efficient imple-
mentation of Rader's algorithm for small DFTs.
In practice, the convolutions derived from Rader's method are computed by
using the same techniques as those described in Chap. 3. However, some ad-
ditional simplification is possible because here the sequence of coefficients, we',
has special properties.
Consider first the case of a p-point DFT, where p is an odd prime. Then, the
convolution is of length d = p - 1, with d even. Since gft modulo p generates a
cyclic group of order p - 1, we have gP-1 == 1 modulo p. Therefore, we have
g(p-Il 12 == -1 modulo p and

(5.48)
124 5. Linear Filtering Computation of Discrete Fourier Transforms

Moreover, since

(5.49)

we have

(5.50)

which suggests an even symmetry about midpoint for real coefficients and an
odd symmetry for imaginary coefficients. Thus, when the coefficient polynomial
4-1
L: W"z" is reduced modulo(zd /2 - 1) and modulo(z4/2 + 1), all the coefficients
,,-0
in the reduced polynomials become pure real numbers and pure imaginary
numbers respectively. This means that all complex multiplications reduce to the
multiplication of a complex number by either a pure real of a pure imaginary
number and are therefore implemented with only two real multiplications. This
feature is common to all convolutions derived by partitioning a OFT via Rader's
algorithm.
When N = pC, where p is an odd prime, some additional simplification is
possible [5.10]. We give in Sect. 5.5 the most frequently used small OFT algo-
rithms and, in Table 5.1 the correspondiQ.g number of complex arithmetic opera-
tions.

Table 5.1. Number of complex operations for short DFTs computed by Rader's algorithm.
Trivial multiplications by ±1, ±j are given between parentheses. The number of real opera-
tions is twice the number of operations given in this table

DFTsize Number of Number of


N multiplications additions

2 2 (2) 2
3 3 (1) 6
4 4 (4) 8
5 6 (1) 17
7 9 (1) 36
8 8 (6) 26
9 11 (1) 44
16 18 (8) 74

In many applications, it is acceptable to have the OFT output multiplied by


a constant integer I. In particular, when the OFT method is used to compute
circular convolutions of a fixed sequence with many data sequences, the trans-
form of the fixed sequence is usually precomputed and can be premultiplied by
1W. In such cases, it is possible to design improved short OFT algorithms in
which the number of nontrivial multiplications is minimized. For N = p, with p
5.3 The Prime Factor FFT 125

p-z
an odd prime, this is done by using the property 2: W =
go -1 and this gives an
n"O
algorithm with a scaling factor equal to p - 1 and with two trivial multiplica-
tions by ± 1 instead of one. The corresponding number of operations are given
in Sect. 5.5.

5.3 The Prime Factor FFT

For large DFTs, the derivation of Rader's algorithm becomes cumbersome and
computationally inefficient. In this section, we shall discuss an alternative com-
putation technique which allows one to compute a large DFT of size N by
combining several small DFTs of sizes Nlo N z, ... , N. which are relative prime
factors of N. This technique, which is known today as the prime factor FFT, was
proposed by Good [5.11, 12] prior to the introduction of the FFT and has both
theoretical and practical significance. Its main theoretical contribution is in
showing how a one-dimensional DFT can be mapped by simple permutations
into a multidimensional DFT. This approach has also been shown recently
[5.13, 14] to be of practical interest when it is combined with Rader's algorithm.
Furthermore, Good's algorithm provides one of the foundations on which the
very efficient Winograd Fourier transform algorithm is based [5.4].

5.3.1 Multidimensional Mapping of One-Dimensional DFI's

We first consider the simple case of a DFT %k of size N, where N is the product
of two mutually prime factors Nl and N2
N-l
Xk =" ~
n-O
Xn W nk , k = 0, ... , N - 1

(5.51)

(5.52)

Our objective is to convert this one-dimensional DFT into a two-dimensional


D FT of size Nl X N z. In order to do this, one must covert each of the indices n
and k, defined modulo N, into two sets of indices nlo klo and n z, kz, defined,
respectively, modulo Nl and modulo N z• We have seen already, in Sect. 2.1.2,
two different methods of doing this, one based on the Chinese remainder
theorem, and the other on the use of simple permutations. We shall initially
employ this simpler method by defining the index transformation

n == N1nz + NZnl modulo N, nl , kl = 0, ... , Nl - 1


k == N1kz + Nzkl modulo N, n z, kz = 0, ... , N z - 1. (5.53)
126 5. Linear Filtering Computation of Discrete Fourier Transforms

Note that this definition is valid only for (N), N z) = 1. Now, since NINz ==
modulo N, substituting nand k defined by (5.53) into (5.51) yields
°
(5.54)

with

(5.55)

We note that (5.54) is a two-dimensional DFT of size NI X N z, but with the


exponents nlk l and nzkz permuted, respectively, by N z and N I . Thus, in order to
obtain the two-dimensional DFT in the conventional lexicographic order, it is
convenient to replace kl and k z by their permuted values tzk l and tlkz such that
Nzt z == 1 modulo NI and Nltl == 1 modulo N z. This is equivalent to replacing
the mapping of k given by (5.53) with its Chinese remainder equivalent

(5.56)

Then, Xk reduces to

(5.57)

which is the usual representation of a DFT of size NI X N z• Thus, by using for


n the permutation defined by (5.53), and for k the Chinese remainder corre-
spondence defined by (5.56) (or vice versa), we are able to map a one-dimensional
convolution of length NIN z into a two-dimensional convolution of size NI X N z.
The same method can be used recursively to define a one to many mul-
tidimensional mapping. More precisely, if N is the product of d mutually prime
factors Nio with

(5.58)

then, the one-dimensional DFT of length N is converted into ad-dimensional


DFT of size NI X N z ..• X Nd by the change of indices
d
n == L:
1=1
NndNI modulo N, nl = 0, ... , NI - I (5.59)

d
k == L: NtlkllNI modulo N,
1=1
kl = 0, ... , NI - I, (5.60)

where t I is given by

(5.61)
5.3 The Prime Factor FFT 127

It can be verified easily that, in the product nk modulo N, with nand k defined
by (5.59) and (5.60), all cross-products n/ku for i =1= u cancel, so that
d
nk == L. Nn/kdN/ modulo N,
/~t
(5.62)

which demonstrates that the multidimensional representation indeed has the


desired format.
Once the one-dimensional DFT has been converted into a multidimensional
DFT, several different strategies can be used to compute the multidimensional
DFT. We shall see, in Sect. 5.4, that one possible approach consists of nesting
various NrPoint DFT algorithms. However, we shall first present here the
original method described by Good, which is based on the conventional row-
column approach to multidimensional DFT computation (Sect. 4.4) [5.11-13].

5.3.2 The Prime Factor Algorithm

We now consider a two-dimensional DFT Xkl • k , of size Nt X N 2, with (Nt.


N 2) = 1

k t = 0, ... , NJ - 1
k2 = 0, ... , N2 - 1. (5.63)

This DFT is either a genuine two-dimensional DFT, or is derived from a one-


dimensional DFT by the mapping defined by (5.53, 56, 57). Equation (5.63) can
be rewritten as

(5.64)

This illustrates that Xk"k, can be evaluated by first computing one DFT of NJ
terms for each value of n2' This gives NJ sets of N2 points X.,.k l which are the
input sequences to NJ DFTs of N2 points. Thus, with this method, Xkl •k , is
calculated with N2 DFTs of length NJ plus NJ DFTs of length N 2. A detailed
representation of the computation process is shown in Fig. 5.5 for a 12-point
DFT using the 3-point and 4-point DFT algorithms of Sects. 5.5.2 and 5.5.3.
In order to evaluate the number of multiplications, M, and additions, A,
which are necessary to compute a DFT by the prime factor algorithm, we assume
that MJ> M2 and Ai> A2 are the number of multiplications and additions required
the calculate the DFTs of lengths N J and N 2 , respectively. Then, we have
obviously

(5.65)
128 5. Linear Filtering Computation of Discrete Fourier Transforms

'4 -3/2

Fig. 5.5. Flow graph of a 12-point DFT computed by the prime factor algorithm

(5.66)

The same method can be extended recursively to cover the case of more than
two factors. Thus, for ad-dimensional DFT, we have

(5.67)

and

(5.68)

d
A = 1-1
L. NAtfN" (5.69)
5.3 The Prime Factor FFT 129

where M t and At are, respectively, the number of multiplications and additions


for a DFT of size Nt.
Thus, the computation of a large DFT is reduced to that of a set of small
DFTs of lengths Nh N]., ... , N d • These small DFTs can be computed with any
algorithm, but it is extremely attractive to use Rader's algorithm for this ap-
plication because this particular algorithm is very efficient for small DFTs.
Table 5.2 lists the number of nontrivial real arithmetic operations for various
DFTs computed by the prime factor and Rader algorithms (Table 5.1). It can be
seen, by comparison with Table 4.3 that this approach compares favorably with
the FFT method.

Table 5.2. Number of nontrivial real operations for DFTs computed by the prime factor and
Rader algorithms

DFT size Number of real Number of real Multiplications Additions


N multiplications additions per point per point
M A MIN AIN

30 100 384 3.33 12.80


48 124 636 2.58 13.25
60 200 888 3.33 14.80
120 460 2076 3.83 17.30
168 692 3492 4.12 20.79
240 1100 4812 4.58 20.05
504 2524 13388 5.01 26.56
840 5140 23172 6.12 27.59
1008 5804 29548 5.76 29.31
2520 17660 84076 7.01 33.36

Note that the data given in Table 5.2 apply to multidimensional DFTs as well
as to one-dimensional DFTs. For example, it may be seen that 100 nontrivial
mUltiplications are required to compute a DFT of size 30. The number of
multiplications would be the same for DFTs of sizes 2 X 3 X 5, 6 X 5, 10 X 3,
or 2 X 15, since the only difference between the members of this group is the
index mapping. We shall see, however, in Chap. 7, that it is possible to devise
even more efficient computation techniques for multidimensional DFTs. There-
fore, the main utility of the prime factor algorithm resides in the calculation of
one-dimensional DFTs.

5.3.3 The Split Prime Factor Algorithm

We shall now show that the efficiency of the prime factor algorithm can be
improved by splitting the calculations [5.10]. This can be seen by considering
again a two-dimensional DFT of size N, X N].
130 5. Linear Filtering Computation of Discrete Fourier Transforms

kl = 0, ... , NI - I
k2 = 0, ... , N2 - 1. (5.70)

In order to simplify the discussion, we shall assume that NI and N2 are both odd
primes. In this case, Rader's algorithm reduces each of the DFTs of size NI or N2
to one multiplication plus one correlation of size NI - I or N2 - 1. Therefore,
Xk,.k, is evaluated via the prime factor algorithm as one DFT of NI points, one
correlation of N2 - I points and one correlation of (N2 - I) X (NI - I)
points, with

(5.71)

(5.72)

(5.73)

where hand g are primitive roots modulo NI and N2 and

kl == h modulo Nh
V, nl == ho , modulo NI
k2 == gV, modulo N 2, n2 == gO, modulo N2
Uh VI = 0, ... , NI - 2
U2, V2 = 0, ... , N2 - 2. (5.74)

We note that the two-dimensional correlation defined by (5.73) is half separable.


Hence we can compute this correlation by the row-column method as N2 - I
correlations of NI - I points plus NI - I correlations of N2 - I points. If MI
and M2 are the number of complex multiplications required to compute the
DFTs of lengths NI and N 2, the correlations of lengths NI - I and N2 - I are
computed, respectively, with MI - I and M2 - I complex mUltiplications,
because, for N prime, Rader's algorithm reduces a DFT of N points into one
multiplication plus one correlation of N - I points. Under these conditions, the
total number of complex multiplications required to compute Xk,.k, reduces to

(5.75)

Since the conventional prime factor algorithm would have required NIM2 +
N2
MI multiplications, splitting the computation eliminates NI + N2 - I complex
multiplications. When the two-dimensional convolution is reduced modulo
cyclotomic polynomials, the various terms remain half separable and additional
5.3 The Prime Factor FFT 131

savings can be realized. This can be seen more precisely by representing the
DFT Xk,.k , defined by (5.70) in polynomial notation, and employing an ap-
proach similar to that described in Sect. 5.2.2
N,-I N,-I
X(Zh Z2) == ~ ~ X""'" zi' Z~l modulo (zf' - 1), (Zfl - 1) (5.76)
11 1 -0 "2- 0

(5.77)

Ignoring the permutations and the multiplications by Zh Zf,-I, Z2, and Zr, - I, we
can use this polynomial formulation to represent the split prime factor algorithm
very simply, as indicated by the diagram in Fig. 5.6, which corresponds to a
DFT of size 5 X 7. With this method, the main part of the computation corre-
sponds to the evaluation of a correlation of dimension 4 X 6 which can be
regarded as a polynomial product modulo (zt - 1), (z~ - 1). Since both zt - 1
and zg - 1 are composite, the computation of the correlation of dimension
4 X 6 can be split into that of the cyclotomic polynomials which are factors of
zt - 1 and z~ - 1 and given by

REDUCTION MODULO REDUCTION MODULO


(Z5_l)/(Zrl) (Zl-l)
J

7-POINT DFT
REDUCTION MODULO REDUCTION MODULO
7
(Z2- 1)/(Zr/) (Zr l )

CORRELA TION 4-POINT


OF SIZE 4x6 CORRELA TION

Fig. 5.6. Computation of a DFT of size 5 x


7 by the split prime factor algorithm

zt - 1 = (Zl - 1)(zJ + 1)(zi + 1) (5.78)

z~ - 1 = (zz - l)(zz + 1)(zi + Zz + l)(zi - Zz + 1). (5.79)


....
-'"
~
t;

~
::!1
[
::l
QQ

I
g"
g,
~

!61
POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL POLYNOMIAL 5.
~
PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT
2
(~+l) , (~-Z2+}) (~-Z2+l) (~+}) Z2+Z2+} ~+Z2+}
(~-Z2+l) (~+Z2+})
l
0-
3on

Fig. 5.7. Calculation of the correlation of 4 x 6 points in the complete split prime factor evaluation of a
DFf of size 5 x 7
5.4 The Winograd Fourier Transform Algorithm (WFTA) 133

The complete method is illustrated in Fig. 5.7 with the various reductions
modulo cyclotomic polynomials. Since all the expressions remain half separable
throughout the decomposition, the two-dimensional polynomial products are
computed by the row-column method. Thus, for instance, the two-dimensional
polynomial product modulo (zr + 1), (zi + Zz + 1) is calculated as 2 polynomi-
al products modulo (z~ + Zz + 1) plus 2 polynomial products modulo (zr + 1).
With split-prime factorization, a DFT of size 5 X 7 is evaluated with 76
complex multiplications and 381 additions if the correlation of size 4 X 6 is
computed directly by the row-column method. If the computation of the cor-
relation of size 4 X 6 is reduced to that of polynomial products, as shown in
Fig. 5.7, this correlation is calculated with only 46 multiplications and 150 ad-
ditions instead of 62 multiplications and 226 additions with the row-column
method. Thus, the complete split-prime factor computation reduces the total
number of operations to 60 complex multiplications and 305 complex addi-
tions. By comparison, the conventional prime factor algorithm requires 87
complex multiplications and 299 additions. Thus, splitting the computations
saves, in this case, about 30 %of the multiplications.
The same split computation technique can also be applied to sequence
lengths with more than two factors as well as those with composite factors. It
should also be noted that the computational savings provided by the method
increases as a function of the DFT size. Thus, for large DFTs, the split-prime
factor method reduces significantly the number of arithmetic operations at the
expense of requiring a more complex implementation.

5.4 The Winograd Fourier Transform Algorithm (WFTA)

We have seen in the preceding section that a composite DFT of size N, where N
is the product of d prime factors NIJ N z, ... , N d , can be mapped, by simple index
permutations, into a multidimensional DFT of size Nl X N z X ... X N d • When
this multidimensional DFT is evaluated by the conventional row-column
method, the algorithm becomes the prime factor algorithm. In the following,
we shall discuss another way of evaluating the multidimensional DFT which is
based on a nesting algorithm introduced by Winograd [5.4, 15, 16]. This method
is particularly effective in reducing the number of multiplications when it is
combined with Rader's algorithm.

5.4.1 Derivation of the Algorithm

We commence with a DFT of size Nl X Nz


134 5. Linear Filtering Computation of Discrete Fourier Transforms

kl = 0, ... , NI - 1, k z = 0, ... , N z - 1
(5.80)

This DFT may either be a genuine two-dimensional DFT or a one-dimensional


DFT oflength N = NINz, with (Nh N z) = 1, which has been mapped into a two-
dimensional form using the index mapping scheme of Good's algorithm. We
note that the two-dimensional DFT defined by (5.80) can be regarded as a one-
dimensional DFT of length N z where each scalar is replaced by a vector of NI
terms, and each multiplication is replaced by a DFT of length N I • More pre-
cisely, the DFT defined by (5.80) can be expressed as

(5.81)

where X n, is an NI element column vector of the input data xn,.n, and A is an


NI X NI matrix of the complex exponentials W!,k,

Xo.n,

Xl,,.;,
Xn, = [ ........ . (5.82)
.........
XN 1 -l,1I2

(5.83)

Thus, the polynomial DFT defined by (5.81) is a DFT oflength N z where each
multiplication by W1,k, is replaced with a multiplication by W~,k'A. This last
operation itself is equivalent to a DFT oflength NI in which each multiplication
by W7,k, is replaced with a multiplication by W~,k, W!,k,.
It can be seen that the Winograd algorithm breaks the computation of a D FT
of size NINz or NI X N z into the evaluation of small DFTs of length NI and N z
in a manner which is fundamentally different from that corresponding to the
prime factor algorithm. In fact, the method used here is essentially similar to the
nesting method described by Agarwal and Cooley for convolutions (Sect. 3.3.1).
The Winograd method is particularly interesting when the small DFTs are
evaluated via Rader's algorithm. In this case, the small DFTs are calculated with
Al input additions, M complex multiplications, and AZ output additions. Thus,
the Winograd algorithm for a DFT of size NI X N z can be represented as shown
in Fig. 5.8. In this case, if M z, A~ and A~ are the number of complex multiplica-
tions and input, output additions for the DFT of length N z, the total number of
mUltiplications M and additions A for the DFT of size NI x N z becomes
5.4 The Winograd Fourier Transform Algorithm (WFTA) 135

N] POLYNOMIALS OF NI TERMS

INPUT
ADDITIONS
I
A]

M] DFTs OF M~I MULTIPLICATIONS

NI TERMS

OUTPUT
ADDITIONS
1
A]

Fig. 5.S. Two-factor Winograd Fourier transform algorithm

(5.84)

(5.85)

where MI and Al are, respectively, the number of mUltiplications and additions


for the NI-point DFT. Since the total number of additions Az for the Nz-point
DFT is given by A z = Ai + A~, (5.85) reduces to

(5.86)

The same method can be applied recursively to cover the case of more than two
factors. Hence a multidimensional DFT of size NI X N z ... X Nd or a one-
dimensional DFT of length NIN2 ... N d , where (NI' N k ) = 1 for i "*
k, is com-
puted with M multiplications and A additions, where M and A are given by

(5.87)

(5.88)

where MI and AI are the number of complex multiplications and additions for
an NI-point DFT.
Note that the number of additions depends upon the order in which the
136 5. Linear Filtering Computation of Discrete Fourier Transforms

operations are executed. If we take, for instance, the two-factor algorithm,


computing the DFT of NJ X N2 points as a DFT of NJ points in which all
multiplications are replaced by a DFT of size N2 would give a number of ad-
ditions

(5.89)

The first nesting method will require less additions than the second nesting
method if NJA2 + M2AJ < N2AJ + MJA2 or

(5.90)

Thus, the values (MI - N I)/ AI characterize the order in which the various short
algorithms must be nested in order to minimize the number of additions.
It should be noted that, in (5.84, 86-88), M and A are, respectively, the total
number of complex multiplications and additions. However, when the small
DFTs are computed by Rader's algorithms, all complex multiplications reduce
to multiplications of a complex number by a pure real or a pure imaginary
number and are implemented with only two real additions. Moreover, some of
the multiplications in the small DFT algorithms are trivial multiplications by
± 1, ±j. Thus, if the number of such complex trivial mUltiplications is LI for a
NI point DFT, then the number of nontrivial real mUltiplications becomes

(5.91)

We illustrate the Winograd method in Fig. 5.9 by giving the flow diagram cor-
responding to a 12-point DFT using the 3-point and 4-point DFT algorithms
of Sects. 5.5.2 and 5.5.3.

Table 5.3. Number of nontrivial real operations for one-dimensional DFTs computed by the
Winograd Fourier transform algorithm

DFT Number of real Number of real Multiplications Additions


size multiplications additions per point per point
N M A MIN AIN

30 68 384 2.27 12.80


48 92 636 1.92 13.25
60 136 888 2.27 14.80
120 276 2076 2.30 17.30
168 420 3492 2.50 20.79
240 632 5016 2.63 20.90
420 1288 11352 3.07 27.03
504 1572 14540 3.12 28.85
840 2580 24804 3.07 29.53
1008 3548 34668 3.52 34.39
2520 9492 99628 3.77 39.53
5.4 The Winograd Fourier Transform Algorithm (WFTA) 137

X4

Xs

X9

XI

x5

x6

X/O

x]
xJ

X7

Xli
-I -I

Fig. 5.9. Flow graph of a 12-point DFT computed by the Winograd Fourier transform aJgori-
them

Table 5.3 lists the number of nontrivial real operations for various DFTs
computed by the Winograd Fourier transform algorithm, with the small DFTs
evaluated by the algorithms given in Sect. 5.5 and calculated with the number of
operations summarized in Table 5.1. It can be seen, by comparison with the
prime factor technique (Table 5.2), that the Winograd Fourier transform algo-
rithm reduces the number of multiplications by about a factor of two for DFTs
of length 840 to 2520, while requiring a slightly larger number of additions. If
we now compare with the conventional FFT method, using, for instance, the
Rader Brenner algorithm (Table 4.3), we see that the Winograd Fourier trans-
form algorithms reduce the number of multiplications by a factor of 2 to 3, with
a number of additions which is only slightly larger. These results show that the
principal contribution of the Winograd Fourier transform algorithm concerns a
reduction in number of multiplications. It should be noted, however, that the
short DFT algorithms can also be redesigned in order to minimize the number of
additions at the expense of a larger number of multiplications. Thus, the
Winograd Fourier transform approach is very flexible and allows one to adjust
138 5. Linear Filtering Computation of Discrete Fourier Transforms

the number of additions and multiplications in order to fit the requirement of a


particular implementation.
The WFTA is particularly well suited to computing the DFT of real data
sequences. In this case, all input additions and all multiplications are real, while
only some of the output additions are complex. Thus, contrary to other fast DFT
algorithms, the WFTA computes a D FT of real data with nearly half the number
of operations required for complex data, without any need for processing two
sets of real data simultaneously. Thus, the WFTA is an attractive approach for
processing of real data when storage must be conserved and in real time pro-
cessing applications where the delay required to process simultaneously two
consecutive blocks of real data cannot be tolerated.

5.4.2 Hybrid Algorithms

We have seen that a DFT of size NI X N z ..• X Nd can be computed by either


the prime factor algorithm or the Winograd Fourier transform algorithm. In
order to compare these two methods more explicitly, we consider here a simple
DFT of size NI X N z• Using the prime factor technique, the number of multipli-
cations is NIMz + NzM b while for the Winograd method it is equal to MIM z.
This means that the Winograd method requires a smaller number of mUltipli-
cations than the prime factor technique if

(5.92)

In this formula, MI and Mz are the number of complex mUltiplications cor-


responding to the DFTs of NI points and N z points, respectively. Thus, MI ~ NI
and M z ~ N z. However, since the Rader algorithms are very efficient, MI and
M z are only slightly larger than NI and N z. Moreover, Nt/MI decreases only
slowly with N I , so that the condition defined by (5.92) is almost always met,
except for very large DFTs. Thus, the Winograd algorithm generally yields a
smaller number of mUltiplications than the prime factor method. This can be
seen more clearly by considering a DFT of length 2520. In this case, NIM =
0.53, which implies NdMI + Nz/M z = 1.06 for a DFT of size 2520 X 2520.
Thus, even for such a large DFT, replacing the Winograd method by the split-
prime factor technique only marginally reduces the number of mUltiplications.
If we now consider the number of additions, the situation is completely
reversed. This is due to the fact that the prime factor algorithm computes a DFT
of NI X N z points with NIA z + NzAI additions, while the Winograd method
requires NIA z + MzAI additions. Since M z ~ N z, the Winograd algorithm
always requires more additions than the prime factor technique, except for
M z =Nz•
A quick comparison of Tables 5.2 and 5.3 indicates that, for N .( 168, the
Winograd method requires fewer multiplications than the prime factor tech-
5.4 The Winograd Fourier Transform Algorithm (WFTA) 139

nique, and exactly the same number of additions. For larger OFTs, however, the
prime factor method is better than the Winograd method from the standpoint of
the number of additions. Thus, for large OFTs, it may be advantageous to com-
bine the two methods when the relative cost of multiplications and additions is
about the same [5.13]. For example, a lOO8-point OFT could be computed by
calculating a OFT of size 16 x 63 via the prime factor technique and calculating
the OFTs of 63 terms by the Winograd algorithm. In this case, the OFT would
be computed with 4396 real multiplications and 31852 real additions, as opposed
to 3548 real multiplications and 34668 additions for the Winograd method and
5804 multiplications and 29548 additions for the prime factor technique. Thus,
a combination of the two methods allows one to achieve a better balance be-
tween the number of additions and the number of multiplications.

5.4.3 Split Nesting Algorithms

We have seen in Sect. 5.3.3 that a multidimensional OFT can be converted into
a set of one-dimensional and multidimensional convolutions by a sequence of
reductions if the small OFTs are computed by Rader's algorithm. In particular,
if NJ and N z are odd primes, a OFT of size NJ x N z can be partitioned into a
OFT of length NJ plus one convolution of length N z - 1 and another of size
(NJ - 1) X (Nz - 1). This is shown in Fig. 5.6 for a OFT of size 7 x 5. Thus,
the Winograd algorithm can be regarded as equivalent to converting a OFT
into a set of one-dimensional and multidimensional convolutions, and comput-
ing the multidimensional convolutions by a nesting algorithm (Sect. 3.3). Con-
sequently, it can be inferred from Sect. 3.3.2 that a further reduction in the
number of additions could be obtained by replacing the conventional nesting of
convolutions by a split nesting technique. With such an approach, the short
OFTs are reduced to convolutions by the Rader algorithm discussed in Sect.
5.2 and the convolutions are in turn reduced into polynomial products, defined
modulo cyclotomic polynomials.
In practice, however, this method cannot be directly applied without minor
modifications. Alert readers will notice that the number of additions correspond-
ing to some of the short OFT algorithms in Sect. 5.5 does not tally with the
number of operations derived directly from the reduction into convolutions
discussed in Sect. 5.2. A 7-point OFT, for instance, is computed by the algorithm
described in Sect. 5.5.5 with 9 multiplications and 36 additions. The same OFT,
evaluated by Rader's algorithm according to Fig. 5.3, however, requires 12
additions for the reductions modulo (z - I) and (Z7 - I)/(z - I) and 34 ad-
ditions for the 6-point convolution, a total of 46 additions. This difference is due
to the fact that the reductions can be partly embedded in the calculation of the
convolutions. In the case of a N-point OFT, with N an odd prime, this procedure
reduces the number of operations to one convolution of length N - 1 plus 2
additions instead of one (N - I)-point convolution plus 2(N - 1) additions.
140 5. Linear Filtering Computation of Discrete Fourier Transforms

Thus, direct application of the split nesting algorithm is not attractive because
it reduces the number of additions in algorithms which already have an inflated
number of additions.
In order to overcome this difficulty, one approach consists in expressing the
polynomial products modulo irreducible cyclotomic polynomials of degree
higher than 1 in the optimum short OFT algorithms of Sect. 5.5. This is done
in Sects. 5.5.4, 5, 7, and 8, respectively, for OFTs of lengths 5, 7, 9, and 16.
With this procedure, a 5-point OFT breaks down (Fig. 5.10) into 14 input and
output additions, 3 multiplications, and one polynomial product modulo (Z2 +
1), while the 7-point OFT reduces into 30 input and output additions, 3 multi-
plications, and the polynomial products modulo (Z2 + Z + 1) and modulo
(Z2 - Z + I). Therefore, nesting these two OFTs to compute a 35-point OFT
requires 9 multiplications, 3 polynomial products modulo (z~ + 1),3 polynomial
products modulo (zi + Zz + I) and modulo (zi - Z2 + 1), and the polynomial
products modulo (z~ + 1), (zi + Zz + 1) and modulo (z~ + 1), (zi - Z2 + 1).
This defines an algorithm with 54 complex multiplications and 305 additions,
as opposed to 54 multiplications and 333 additions for conventional nesting.

XII

POLYNOMIAL PRODUCT
MODULO (Z2+l)
3 MULTIPLICATIONS
(3 MULTIPLICATIONS.
3 ADDITIONS)

Fig. 5.10. 5-point DFT algorithm

In Table 5.4, we give the number of real additions for various OFTs com-
puted by split nesting. The nesting and split nesting techniques require the same
number of multiplications, and it can be seen, by comparing Tables 5.3 and 5.4,
5.4 The Winograd Fourier Transform Algorithm (WFTA) 141

Table 5.4. Number of real additions for one-dimensional DFTs computed by the Winograd
Fourier transform algorithm and split nesting

DFT Number of real Additions


size additions per point
N A

240 4848 20.20


420 10680 25.43
504 13580 26.94
840 23460 27.93
1008 30364 30.12
2520 86764 34.43

that, for large OFTs, the split nesting method eliminates 10 to 15 % of the
additions required by the conventional nesting approach. Additional reduction
is obtained when the split nesting method is used to compute large multidimen-
sional OFTs. The implementation of the split nesting technique can be greatly
simplified by storing composite split nested OFT algorithms, thus avoiding the
complex data manipulations required by split nesting. With this approach, a
504-point OFT, for instance, can be computed by conventionally nesting an
8-point OFT with a 63-point OFT algorithm that has been optimized by split
nesting.

5.4.4 Multidimensional DFTs

Until now, we have considered the use of the Winograd nesting method only
for the computation of OFTs of size N 1N 2 ... Nd or Nl X N2 X ... X N d,
where the various factors Nt are mutually prime in pairs. Note, however, that the
condition (Nio N u ) = 1 for i =f.= u is necessary only to convert one-dimensional
OFTs into multidimensional OFTs by Good's algorithm. Thus, the Winograd
nesting algorithm can also be employed to compute any multidimensional OFT
of size Nl X N2 X ... X Nd where the factors Nt are not necessarily mutually
prime in pairs. If each dimension Nt is composite, with Nt = N t.1 N t.2 ... Nt, ..
and (Nt,u, Nt,.) = 1 for u =f.= v, the index change in Good's algorithm maps the
d-dimensional OFT into a multidimensional OFT of dimension ele2 ... ed'
In order to illustrate the impact of this approach, we give in Table 5.5, the
number of real operations for various multidimensional OFTs computed by the
Winograd algorithm. It can be seen that the Winograd method is particularly
effective for this application, since a OFT of size 1008 X 1008 is calculated with
only 6.25 real multiplications per point, or about 2 complex mUltiplications per
point. Moreover, for large multidimensional OFTs, the split nesting technique
gives significant additional reduction in number of additions. For a OFT of size
1008 X 1008, for instance, split nesting reduces the number of real additions
142 5. Linear Filtering Computation of Discrete Fourier Transforms

Table 5.5. Number of nontrivial real operations for multidimensional DFTs computed by the
Winograd Fourier transform algorithm

OFT size Number of real Number of real Multiplications Additions


multiplications additions per point per point

24 x 24 1080 12096 1.87 21.00


30 x 30 2584 24264 2.87 26.96
40 x 40 4536 44736 2.83 27.96
48 x 48 5704 63720 2.48 27.66
72x72 15416 180032 2.97 34.73
120 x 120 41400 517824 2.87 35.96
240 x 240 209824 2683584 3.64 46.59
504 x 504 1254456 17742656 4.94 69.85
1008 x 1008 6350920 93076776 6.25 91.61
120 x 120 x 120 5971536 97203456 3.46 56.25
240 x 240 x 240 68023424 1086647616 4.92 78.61

from 93076776 (conventional nesting) to 64808280 or a saving of about 30% in


number of additions.

5.4.5 Programming and Quantization Noise Issues

The Winograd Fourier transform algorithm requires substantially fewer multi-


plications than the FFT method. This reduction is achieved without any signi-
ficant increase in number of additions and, in some favorable circumstances,
the number of additions for the Winograd technique may also be fewer than for
the FFT method. This result is quite remarkable in a theoretical sense because
the FFT had long been thought to be the optimum computation technique for
DFTs. It is a major achievement in computational complexity theory to have
shown that a method radically different from the FFT could be computationally
more efficient.
A key issue in the application of the Winograd method concerns its ability
to be translated into computer programs that would be more effective than
conventional FFT programs. Clearly, the number of arithmetic operations is in
no way the only measure of computational complexity and, at the time of this
writing (1979), there have been conflicting reports on the relative efficiencies of
the WFTA and FFT algorithms, with actual WFTA computer execution times
reported to be about ±30 percent longer than those for the FFT for DFTs of
about 500 to 1000 points [5.17, 19J. We shall not attempt to resolve these differ-
ences here, but we shall instead try to compare FFT programming with WFTA
programming qualitatively. A practical WFTA program can be found in [5.9J.
We can first note that, since the WFTA is most effective in reducing the num-
ber of multiplications, WFTA programs can be expected to be relatively more
efficient when run on machines in which the execution times for multiplication
5.4 The Winograd Fourier Transform Algorithm (WFTA) 143

are significantly longer than for addition. Another important factor concerns the
relative size of FFT and WFTA programs. When the FFT programs are built
around a single radix FFT algorithm (usually radix-2 or radix-4 FFT algorithm),
the computation proceeds by repetitive use of a subroutine implementing the
FFT butterfly operation. Thus, the FFT programs can be very compact and es-
sentially independent of the OFT size, provided that the OFT size is a power of
the radix. By contrast, the WFTA uses different computation kernels for each
OFT size and each of these is an explicit description of a particular small OFT
algorithm, as opposed to the recursive, algorithmic structure used in the FFT.
Thus, WFTA programs usually require more instructions than FFT programs
for OFTs of comparable size and they must incorporate a subroutine which
selects the proper computation kernels as a function of the OFT size. This
feature prompts one to organize the program structure in two steps; a genera-
tion step and an execution step. The program can then be designed in such a way
that most bookkeeping operations such as data routing and kernel selection, as
well as precomputation of the multipliers, are done within the generation step
and therefore do not significantly impact the execution time.
The WFTA program is divided into five main parts: input data reordering,
input additions, multiplications, output additions, and output data reordering.
The input and output data reordering requires a number of modular multiplica-
tions and additions which can be eliminated by precomputing reordering vectors
during the generation step. These stored reordering vectors may, then, be used
to rearrange the input and output data during the execution step. The input
additions, except for the innermost factor, correspond to a set of additions that
is executed for each factor N/> and operates on N/ input arrays to produce M/
output arrays. Since M/ is generally larger than N/, the calculations cannot
generally be done "in-place". Thus, the generated result of each stage cannot be
stored over the input data sequence to the stage. However, it is always possible
to assign N/ input storage locations from the M/ output storage locations and,
since M/ is not much larger than N 1 , this results in an algorithm that is not
significantly less efficient than an in-place algorithm, as far as memory utiliza-
tion is concerned. The calculations corresponding to the innermost factor Nd
are executed on scalar data and include all the mUltiplications required by the
algorithm to compute the N-point OFT. If M is the total number of multiplica-
tions corresponding to the N-point OFT and M d is the number of multiplications
corresponding to the Nd-point small OFT algorithm, the calculations for the in-
nermost factor Nd reduce to MjMd OFTs of Nd points. The Md coefficients here
are those of the Nd-point OFT, multiplied by the coefficients of the other small
OFT algorithms. In order to avoid recalculating this set of Md coefficients for
each of the M/ Md OFTs of Nd points, one is generally led to precompute, at
generation time, a vector of M coefficients divided into M/ Md sets of Md coeffici-
ents. Since these coefficients are simply real or imaginary, a total of M real me-
mory locations are required, or significantly less than for an FFT algorithm in
which the coefficients are precomputed.
144 5 Linear Filtering Computation of Discrete Fourier Transforms

From this, we can conclude that, although the WFTA is not an in-place
algorithm, the total memory requirement for storing data and coefficients can
be about the same as that of the FFT algorithm. The program sizes will generally
be larger for the WFTA than for the FFT, but remain reasonably small, because
the number of instructions grows approximately as the sum of the number of
additions corresponding to each small algorithm. Thus, if N = NIN2 ... Nt
... Nd and if At is the number of additions corresponding to a Nt-point DFT,
L;A t is a rough measure of program size. L; At grows very slowly with N, as can
/ j

be verified by noting that L;A t = 25 for N = 30 and L;A/ = 154 for N = 1008.
i i
Thus WFTA program size and memory requirements can remain reasonably
small, even for large DFTs, provided that the programs are properly designed to
work on array organized data. Hence, the WFTA seems to be particularly well
suited for systems, such as APL, which have been designed to process array data
efficiently.
Another important issue concerns the computational noise of the Winograd
algorithms, and only scant information is currently available on this topic. The
preliminary results given in [5.18] tend to indicate that proper scaling at each
stage is more difficult than for the FFT because of the fact that all moduli are
different and not powers of 2. In the case of fixed point data, this significantly
impacts the signal-to-noise ratio of the WFTA, and thus, the WFTA generally
requires about one or two more bits for representing the data to give an error
similar to the FFT.

5.5 Short D FT Algorithms

This section lists the short DFT algorithms that are most frequently used with
the prime factor method or the WFTA. These algorithms compute short N-point
one-dimensional DFTs of a complex input sequence Xn
N-l
Xk = L; Xn wnk, k = 0, ... , N - 1
n=O

j=,J-1 (5.93)

for N = 2,3,4,5, 7, 8, 9, 16.


These algorithms are derived from Rader's algorithm and arranged as follows:
input data x o, XI> ... , XN-I and output data Xo, XI> ... , XN - 1 in natural order.
mo, ml> ... , mN-l are the results of the M multiplications corresponding to
length N.
tl> t 2 , ... and Sl, S2, ... are temporary storage for input data and output data,
respectively.
5.5 Short DFr Algorithms 145

The operations are executed in the order II> ml> SI> XI' with indices in natural
order. For DFTs oflengths 5, 7, 9, 16, the operations can also be executed using
the form shown in Sects. 5.6.4, 5, 7, 8 which embeds the various polynomial
products.
The figures between parentheses indicate trivial multiplications by ± I, ±j
At the end of each of algorithm description for N = 3, 5, 7, 9, we give the num-
ber of operations for the corresponding algorithm in which the number of non-
trivial multiplications is minimized and the output is scaled by a constant factor.

5.5.1 2-Point DFf

2 multiplications (2), 2 additions


mo = l·(xo + XI)
Xo =mo
XI =ml

5.5.2 3-Point DFf

u = 21tj3, 3 multiplications (l), 6 additions


II = XI + Xz
mo = l.(xo + Id
m z = j sin u·(xz - XI)
SI = mo + ml
Xo = mo
XI = SI + mz
Xz = SI - mz
Corresponding algorithm with scaled output:
3 multiplications (2), 8 additions, scaling factor: 2

5.5.3 4-Point DFT

4 mUltiplications (4), 8 additions


II = Xo + Xz Iz = XI + X3
mo = 1'(/1 + I z) ml = 1·(t1 - I z)
mz = 1.(xo - Xz) m3 = j (X 3 - X,)
Xo =mo
XI = mz + m3
146 5. Linear Filtering Computation of Discrete Fourier Transforms

X2 =ml
X3 = m 2 - m3

5.5.4 5-Point DFT

U = 2x/5, 6 multiplications (1), 17 additions


tl = XI +X 4

ts = tl + 12
mo = l·(xo + Is)
ml = [(cos U + cos 2u)/2 - 1]15
m2 = [(cos u - cos 2u)/2](/1 - (2)
Polynomial product modulo (Z2 + 1)
m3 = -j(sin U)(/3 + (4)
m4 = -j(sin u + sin 2U)'/ 4
ms = j(sin u - sin 2u)t3

S3 = m3 - m 4

Ss = m3 + ms

Xo = mo X3 = S4 - SS

XI = S2 + S3 X4 = S2 - S3

X2 = S4 + Ss
Corresponding algorithm with scaled output:
6 multiplications (2), 21 additions, scaling factor: 4

5.5.5 7-Point DFT

u = 2x/7, 9 multiplications (1), 36 additions


tl = XI + X6 t2 = X2 + Xs
t4 = tl + t 2+ t3 ts = XI - X6 t6 = X 2 - Xs

t8 = tl - t3 t9 = t3 - t2

tll = t7 - ts t12 = t6 - t7
5.5 Short Off Algorithms 147

ml = [(cos u + cos 2u + cos 3u)/3 - I] t4

t13 = -ts - t9
mZ = [(2cos u - cos 2u - cos 3u)/3] ts
m3 = [(cos u - 2eos 2u + cos 3u)/3] t9
m4 = [(cos u + cos 2u - 2cos 3u)j3] tl3

Polynomial product
modulo (zZ + z + 1)
m, = -j[(sin u + sin 2u - sin 3u)/3]1 10

tl4 = -In - lIZ


m6 = j[(2 sin u - sin 2u + sin 3u)/3]t n
m, = j[sin u - 2 sin 2u - sin 3u)/3]tlZ
ms = j[(sin u + sin 2u + 2 sin 3u)/3]tI 4
Sz = -m6 - m, Polynomial product
SJ = m6 + ms modulo (ZZ - z + 1)
S4 = + ml
mo
S, = S4 + So - SI Ss = m, - Sz
SIO = m, + Sz + S3

Xo = mo XI = S, + Ss Xz = S6 + S9
X4 = S, + SIO Jl, = S6 - S9 X6 = S, - Ss

Corresponding algorithm with scaled output:


9 multiplications (2), 43 additions, scaling factor: 6

5.5,6 8-Point DFT

u = 2TC/8, 8 mUltiplications (6), 26 additions


tl = Xo +X 4 tl = Xl + X6 13 = XI + Xs
t4 = XI - Xs ts = X3 + X, 16 = X3 - X,

t, = tl + tz ts = t3 + Is

mo = 1.(/, + Is) ml = I·(t, - Is)


ml = l.(t l - I l ) m3 = 1.(xo - x 4 )
m4 = cos u.(t4 - ( 6) ms = j(ls - (3 )
148 S. Linear Filtering Computation of Discrete Fourier Transforms

Xo =mo XI = SI + S3 X2 = + ms
m2
X3 = S2 - S4 X4 =ml Xs = S2 + S4
X6 = m2 - ms X7 = SI - S3

5.5.7 9-Point DIT


U = 2rc/9, 11 multiplications (1), 44 additions
II = XI + Xs 12 = X2 + X7 13 = X3 + X6
14 = X4 + Xs Is = II + 12 + 14 t6 = XI - Xs

17 = X7 - X2 ts = X3 - X6 Ig = X4 - Xs
110 = 16 + 17 + Ig 111 = tl - t2 tl2 = 12 - 14
tl3 = t7 - t6 tl4 = t7 - Ig

mo = l.(xo + 13 + Is)
ml = (3/2)t3
m2 = -ls/2

lis = -tl2 - 111


m3 = [(2 cos U - cos 2u - cos 4u)/3]t ll
m4 = [(cos u+ cos 2u - 2 cos 4u)/3]1l2
ms = [(cos u - 2 cos 2u + cos 4u)/3]11S
Polynomial product
modulo (Z2 + Z + 1)
m6 = -j sin 3u·llo
m7 = -j sin 3u·ts

t l6 = - t13 + tl4
ms = j sin u·t 13
mg = j sin 4U·t 14
mlO = j sin 2U·116
Polynomial product
modulo (Z2 - Z + 1)
5.5 Short DFr Algorithms 149

S4 + m2 + m2
= mo
S6 = S4 + m 2 S, = s, - So

Ss = SI + s, S9 = So - SI + s,

Xo =mo XI = s, + SIO X2 = Ss - SII

X3 = S6 + m6 X 4 = S9 + S12 X, = S9 - S12

X6 = S6 - m6 X, = Ss + SII XS = s, - SID

Corresponding algorith with scaled output:


II multiplications (3), 45 additions, scaling factor: 2

5.5.8 16-Point DFf

U = 21t/16, 18 mUltiplications (8), 74 additions


12 = X4 + XI2 13 = X2 + XIO
14 = X2 - XIO t, = X6 + XI4 t6 = X6 - XI4
I, = XI + X9 ts = XI - X9 t9 = X3 + XII

tlO = X3 - XII til = X, + XI3


t l3 = X, + Xu lu = tl + 12
t l6 = t3 + I, tl, = + t l6
lu tl8 = t, + III

t l9 = t, - til t 20 = t9 + tl3 t21 = 19 - t l3

t22 = 118 + 120 123 = Is + tl4 124 = Is - 114


12, = tiD + 112

mO = l.(t" + (22) ml = 1'(11, - (22 )


m2 = 1.(lu - t 16) m3 = 1.(11 - ( 2)
m4 = I.(xo - xs) m, = cos 2U'(119 - ( 21 )

m6 = cos 2U'(14 - ( 6)

m, = cos 3U.(124 + (26)


ms = (cos U + cos 3U)" 24

m9 = (cos 3u - cos U)'126 Polynomial product


S, = ms - m, modulo (Z2 + I)
mlO =j'(120 - (18) mil = j'(I, - ( 3)

ml2 = j,(x12 - x 4 ) ml3 = -j sin 2U'(119 + (21)


150 5. Linear Filtering Computation of Discrete Fourier Transforms

mlS = -j sin 3u o (t23 + (25)


ml 6 = j (sin 3u - sin u)ohJ
ml7 = -j (sin u + sin 3u)ot2S
Polynomial product
modulo(z2 + 1)

SI = m3 + ms S2 = m3 - ms S3 = mil + ml3
S4 = m13 - mil Ss = m 4 + m6 S6 = m 4 - m6

S9 = Ss + S7 SIO = Ss - S7 SII = S6 + Ss

Sl2 = S6 - Ss

SI3 = m l2 + ml 4 SI4 = ml2 - ml4 SI7 = S13 + SIS


SIS = SI3 - SIS SI9 = + SI6
SI4 S20 = SI4 - SI6

Xo =mo XI = S9 + SI7 X2 = SI + S3
X3 = Sl2 - S20 X 4 = m2 + mlO X5 = SII + SI9

X6 = S2 + S4 X7 = SIO - SIS Xs =ml

X9 = SIO + SIS XIO = S2 - S4 Xli = SII - SI9

X l2 = m 2 - mlO X I3 = SI2 + S20 X I4 = SI - S3

XIS = S9 - SI7
6. Polynomial Transforms

The main objective of this chapter is to develop fast multidimensional filtering


algorithms. These algorithms are based on the use of polynomial transforms
which can be viewed as discrete Fourier transforms defined in rings of poly-
nomials. Polynomial transforms can be computed without mUltiplications using
ordinary arithmetic, and produce an efficient mapping of multidimensional
convolutions into one-dimensional convolutions and polynomial products.
In this chapter, we first introduce polynomial transforms for the calculation
of simple convolutions of size p X p, withp prime. We then extend the definition
of polynomial transforms to other cases, and establish that these transforms,
which are defined in rings of polynomials, will indeed support convolution. As
a final item, we also discuss the use of polynomial transforms for the convolution
of complex data sequences and multidimensional data structures of dimension d,
with d > 2.

6.1 Introduction to Polynomial Transforms

We consider a two-dimensional circular convolution of size N X N, with


N-I N-I
Yu.l = ~ ~ h n•m Xu-n.l- m U, I = 0, ... , N - 1. (6.1)
m=O n=O

In order to simplify this expression, we resort to a representation in polynomial


algebra. This is realized by noting that (6.1) can be viewed as the one-dimen-
sional polynomial convolution

(6.2)

N-I
Hm(z) = ~ hn•m zn, m = 0, ... , N - I (6.3)
11=0

N-I
Xr(z)=~x•. rz·, r=O, ... ,N-I, (6.4)
s=o

where Yu.l is obtained from the N polynomials

1= 0, ... , N - 1 (6.5)
152 6. Polynomial Transforms

by taking the coefficients of z" in Y/(z). In order to introduce the concept of


polynomial transforms in a simple way, we shall assume in this section that N
is an odd prime, with N = p. In this case, as shown in Sect. 2.2.4, zp - 1 is the
product of two cyclotomic polynomials

zp - 1 = (z - l)P(z) (6.6)

P(z) = Zp-l + zp-z + ... + 1. (6.7)

Since Y/(z) is defined modulo (zP - 1), it can be computed by reducing Hm(z)
and X,(z) modulo (z - 1) and P(z), computing the polynomial convolutions
Y1,/(z) == Y/(z) modulo P(z), and Yz,/ == Y/(z) modulo (z - 1) on the reduced
polynomials and reconstructing Y/(z) by the Chinese remainder theorem (Sect.
2.2.3) with

(6.8)

ISI(Z) == 1
SI(Z) == 0
Sz(z)
Sz(z)
== 0 modulo P(z)
== 1 modulo (z - 1) (6.9)

with

SI(Z) = [p - P(z)]/p (6.10)

and

Sz(z) = P(z)/p. (6.11)

Computing Y/(z) is therefore replaced by the simpler problem of computing


Y1,/(z) and Yz,/. Yz,/ can be obtained very simply because it is defined modulo
(z - 1). Thus, Yz,/ is the convolution product of the scalars Hz,m and X z"
obtained by substituting 1 for z in Hm(z) and X,(z)
p-l
Yz,/ = ~ Hz,m XZ,I-m, i=O, ... ,p-1 (6.12)
m-O

p-l
~ h",m
Hz,m = n=O (6.13)

p-l
X z" = ~x.". (6.14)
.=0

We now turn to the computation of Y1,/(z). In order to simplify this computa-


tion, we introduce a transform Hk(z) which has the same structure as the DFT,
but with the usual complex exponential operator replaced by one defined as an
exponential on the variable z and with all operations defined modulo P(z). This
6.1 Introduction to Polynomial Transforms 153

transform, which we call a polynomial transform [6.1, 2] is defined by the expres-


sion

k=O, ... ,p-l (6.15)

XJ,r(Z) == Xr(z) modulo P(z). (6.16)

We define similarly an inverse transform by

XJ,r(Z) == ..L
p
Pi: Xk(z)z-rk
k-O
modulo P(z),

r=O, ... ,p-l. (6.17)

We shall now establish that the polynomial transforms support circular convolu-
tion, and that (6.17) is the inverse of(6.15). This can be demonstrated by calcu-
lating the transforms Hk(z) and Xk(z) of HJ,m(z) and XJ,r(z) via (6.15), mUltiplying
Hk(z) by Xiz) modulo P(z), and computing the inverse transform QI(Z) of Hk(z)
Xk(z). This can be denoted as

p-I p-I 1 p-I


Q/(z) == 2.: 2.: HJ,m(z)XJ,r(z) - 2.: zQk modulo P(z) (6.18)
m-O r-O P k=O

+r -
p-I
with q = m = 2.: zqk. Since zP == 1, the exponents of z are defined
l. Let S
k-O
modulo p. For q == 0 modulo p, S = p. For q $. 0 modulo p, the set of exponents
qk defined modulo p is a simple permutation of the integers 0, 1, ... , p - 1. Thus,
p-I
S == 2.: Zk == P(z) == 0 modulo P(z). This means that the only nonzero case cor-
k-O
responds to q == 0 or r == I - m modulo p and that Q/(z) reduces to the circular
convolution

(6.19)

The demonstration that the polynomial transform (6.15) and the inverse poly-
nomial transform (6.17) form a transform pair follows immediately by setting
HJ,m(z) = 1 in (6.19).
Using the foregoing method, YJ./(z) is computed with three polynomial
transforms and p polynomial multiplications Hk(z)Xk(z) defined modulo P(z).
In many digital filtering applications, the input sequence h",m is fixed and its
transform Hk(z) can be precomputed. In this case, only two polynomial trans-
forms are required, and the Chinese remainder reconstruction can also be
simplified by noting, with (2.87), that

SJ(z) == (z - 1)/[(z - 1) modulo P(z)] == (z - 1) TJ(z) (6.20)


154 6. Polynomial Transforms

T1(z) = [_Zp-2 - 2zp - 3 ... - (p - 3)Z2 - (p - 2)z + 1- p]/p. (6.21)

Since T1{z) is defined moduloP{z), premultiplication by T1{z) can be accom-


plished prior to Chinese remainder reconstruction and merged with the precom-
putation of the transform of the fixed sequence. Similarly, the multiplication by
l/p required in (6.11) for the part of the Chinese remainder reconstruction re-
lated to Y2•1 can be combined with the computation of the scalar convolution
Y2 • 1 in (6.12) so that the Chinese remainder reconstruction reduces to

Y1{z) == (z - I)Y1.lz) + (Zp-l + Zp-2 + ... + 1)Y2,1 modulo(zp - 1). (6.22)

Under these conditions, a convolution of size p X P is computed as shown in


Fig. 6.1. With this procedure, the reduction modulo (z - 1) and the Chinese

POLYNOMIAL
TRANSFORM ..-
(lip) H:z.m
MODULO pYZ)
SIZE p. ROOT Z

P POLYNOMIAL
MULTlPliCA TlONS
MODULO pYZ)
..
INVERSE
POLYNOMIAL
TRANSFORM
MODULO pYZ)
SIZE P. ROOT Z

Fig. 6.1. Computation of a two-dimensional convolu-


Yu'/ tion of size p x p by polynomial transforms. p prime
6.2 General Definition of Polynomial Transforms 155

reconstruction are calculated, respectively, by (6.14) and (6.22) without the use
of multiplications. The reductions modulo P(z) also require no multiplications
because Zp-I == _Zp-2 - Zp-3 ... - 1 modulo P(z), which gives for X1,r(z)

(6.23)

The polynomial transform Xk(z) defined by (6.15) requires only multiplications


by zrk and additions. Polynomial addition is executed by adding separately the
coefficients corresponding to each power of z. For multiplications by powers of
z, we note that, since P(z) is a factor of zP - 1,

X1,r(z)zrk == [X1,r(z)zrk modulo (zp - 1)] modulo P(z). (6.24)

Moreover, the congruence relation zP == 1 implies that rk is defined modulo p.


Thus, setting q == rk modulo p and computing X1,r(z)zrk by (6.23) and (6.24)
yields

XI,p-l,r = ° (6.25)

p-q-I p-I
X1,r(z)zq modulo (zP - 1) = 1:
3=0
Xl,s,rZs+q + 1:
s=p-q
Xl,s,rZs+ q

(6.26)

p-2
X1,r(z)zq modulo P (z) = 1:
3=0
(XI,<s-q).r - XI,<p_q_I),r)Z', Xl,p-l,r = 0, (6.27)

where the symbols <> define s - q and p - q - 1 modulo p. Thus, the polynomi-
al transforms are evaluated with additions and simple rotations of p-word
polynomials, and the only multiplications required to compute the two-dimen-
sional convolution Yu.l correspond to the calculation of one convolution of
length p and to the evaluation of the p one-dimensional polynomial products
T1(z)ilk(z)Xk(z) defined modulo P(z). This means that, if the polynomial products
and the convolution are evaluated with the minimum multiplication algorithms
defined by theorems 2.21 and 2.22, the convolution of size p x p, with p prime,
is computed with only 2pz - p - 2 multiplications. It can indeed be shown
that this is the theoretical minimum number of multiplications for a convolution
of size p x p, with p prime [6.3].

6.2 General Definition of Polynomial Transforms

In order to motivate the development of polynomial transforms, we have, until


now, restricted our discussion to polynomial transforms of size p x p, with p
156 6. Polynomial Transforms

prime. A much more general definition of polynomial transforms can be ob-


tained by considering a polynomial convolution of length N defined modulo a
polynomial P(z), where P(z) is no longer a factor of zP - 1, with p prime. Then,
it follows from the demonstration of the circular convolution property given in
Sect. 6.1 that a length-N circular convolution can be computed by polynomial
transforms of length N and root G(z) provided the three following conditions
are met:

- GN(Z) == 1 modulo P(z) (6.28)

- Nand G(z) have inverses modulo P(z) (6.29)

N-\ {O for q =t 0 modulo N


-S == L: [G(Z)]qk modulo P(z) ==
k=O N for q == 0 modulo N. (6.30)

With these three conditions, a polynomial convolution Y/{z) is computed by


polynomial transforms with
N-\
Y/(z) = L: Hm(z)X,_m(z) modulo P(z)
m=O
(6.31)

b-\
Hm(z) = L: hn.mzn
n=O
(6.32)

b-\
X,{z) L: x •. ,z·
= $=0 (6.33)

N-\
iik(z) == L: Hm(z)[G(z)]mk modulo P(z),
m=O
(6.34)

where b is the degree of P{z). Y/(z) is obtained by evaluating the inverse trans-
form of iik(z)Xk{z) by

(6.35)

The polynomial transforms have the same structure as DFTs, but with complex
exponential roots of unity replaced by polynomials G(z) and with all operations
defined modulo P(z). Therefore, these transforms have the same general proper-
ties as the DFTs (Sect. 4.1.1), and in particular, they have the linearity property,
with

{Hn{z)} + {Xn(z)} ~ {iik(z)} + {Xiz)} (6.36)

{AHn(z)} ~ {Aiik(z)}. (6.37)


6.2 General Definition of Polynomial Transforms 157

The principal application of polynomial transforms concerns the computation


of two-dimensional circular convolutions. In this case P{z) must be a factor of
ZN - I and it is desirable that G{z) be as simple as possible in order to avoid
multiplications in the calculation of the polynomial transforms. Since the factors
of ZN - 1 are the cyclotomic polynomials (Sect. 2.2.4) when the coefficients are
defined in the field of rationals, the various polynomial transforms suitable for
the computation of two-dimensional convolutions can be derived by exploiting
the properties of cyclotomic polynomials. We shall now show that when a two-
dimensional convolution has common factors in both dimensions, it is always
possible to define polynomial transforms which have very simple roots G{z) and
which can be computed without multiplications.

6.2.1 Polynomial Transforms with Roots in a Field of Polynomials

A circular convolution of size N X N can be represented as a polynomial


convolution of length N where all polynomials are defined modulo (ZN - 1), as
shown in (6.2-4). For coefficients in the field of rationals, ZN - 1 is the product
of d cyclotomic polynomials p.,{z), where d is the number of divisors e, of N,
including 1 and N, with
d
ZN - 1= II p.(z). (6.38)
1=1 '

The degree of each cyclotomic polynomial p.,(z) is ,pee,), where ,p{e,) is Euler's
totient function (Sect. 2.1.3). Since the various polynomials p. (z) are irreducible,
the polynomial convolution defined modulo (ZN - 1) can be computed sepa-
rately modulo each polynomial p.,(z), with reconstruction of the final result by
the Chinese remainder theorem. We show first that there is always a polynomial
transform of dimension N and root z which supports circular convolution when
defined modulo P••(z), the largest cyclotomic polynomial factor of ZN - 1.
This can be seen by noting that, since ZN == 1 modulo (ZN - 1) and P••{z) is a
factor of ZN - 1, ZN == 1 modulo P•• {z). Thus, condition (6.28) is satisfied. Con-
ditions (6.29) are also satisfied because N always has an inverse in ordinary
arithmetic (coefficients in the field of rationals) and because Z-I == ZN-I modulo
P••(z). Consider now the conditions (6.30). Since ZN == 1 moduloP••(z), we have
N-I
S == ~ zqk == N for q == 0 modulo N. For q =t= 0 modulo N, we have
k=O

(zq - l)S == zqN - 1 == 0 modulo P ••(z). (6.39)

The complex roots of zq - 1 are powers of e-j21t,q, while the complex roots of
P ••(z) are powers of e- j21<'N. Thus, for (q, N) = 1 these complex roots are differ-
ent and zq - 1 is relatively prime to P•.(z), which implies by (6.39) that S == O.
For (q, N) =1= 1, q can always, without loss of generality, be considered as a
factor of N. Then, rfi(N) > ,p(q), and the largest polynomial factors of ZN - 1
158 6. Polynomial Transforms

and of zq - 1 are, respectively, the polynomials P.,(z) and Q(z) of degree 1>(N)
and 1>(q). These polynomials are necessarily different because their degrees 1>(N)
and 1>(q) are different. Moreover, Q(z) cannot be a factor of P•.(z) because
P.,(z) is irreducible. Thus, zq - 1 $. 0 modulo P•.(z) and S == 0 modulo p •.(z),
which completes the proof that conditions (6.30) are satisfied.
Consequently, the convolution Yu.l of dimension N X N is computed by
ordering the input sequence into N polynomials of N terms which are reduced
d-I
moduloP•.(z) and modulopl(z) = II P.,(z). The output samples Yu,l are derived
1=1

by Chinese remainder reconstruction of a polynomial convolution YI,lz)


modulo P•.(z) and a polynomial convolution Y 2 ,I(Z) modulo Pl(Z). YI,lz) is com-
puted by polynomial transforms of root z and of dimension N. Y2 ,I(Z) is a
polynomial convolution of length N on polynomials of N - 1>(N) terms defined
modulo P1(z). At this point, two methods may be used to compute Y 2 ,I(Z). One
can either reduce Y 2 ,I(Z) modulo the various factors of PI(z) and define the cor-
responding polynomial transforms, or one can consider Y 2 ,I(Z) as a two-dimen-
sional polynomial product modulop1(z), ZN - 1.
In order to illustrate this last approach, we consider a two-dimensional
convolution of size pC X pC, where p is an odd prime. In this case, N = pC and 1>
(pc) is given by

1>(pC) = pC-I(p - 1). (6.40)

Then,

P.,(z) = zP'-I(P-I) + zp·-I(p-2) + ... + I (6.41 )

and

zP' - I = (Zp'-I - I)P•.(z). (6.42)

Hence YI,I is computed by polynomial transforms of length pC and root z de-


fined modulo P•.(z), while Y 2 ,l is a convolution of size pC X pc-I. This convolu-
tion can be viewed as a polynomial convolution of length pc-Ion polynomials
of pC terms which is in turn evaluated as a polynomial convolution of length
pc-l defined moduloP.,(z) and a convolution of length pc-I X pc-I. The poly-
nomial convolution defined moduloP.,(z) can be calculated by polynomial
transforms of length pc-l with root zP and the convolution of size pc-I X pc-I
can be evaluated by repeating the same process. Thus, the two-dimensional
convolution of size pC X pC can be completely mapped into a set of one-dimen-
sional polynomial products and convolutions by a c-stage algorithm using
polynomial transforms. We illustrate this method in Fig. 6.2 by giving the first
stage corresponding to a convolution of size p2 X p2. In this case, the second
stage represents a convolution of size p X P which is calculated by the method
shown in Fig. 6.1.
6.2 General Definition of Polynomial Transforms 159

REDUCTION MODULO

PfZ) = (ff-l)/(zP-l)

POLYNOMIAL
TRANSFORM
MODULOPjZ)
SIZE r. ROOT Z

r POLYNOMIAL
MULTIPLICATIONS
MODULOPjZ)

POLYNOMIAL
INVERSE POLYNOMIAL TRANSFORM
TRANSFORM MODULO MODULOPjZ)
PjZ) SIZE p • ROOT zP

SIZE r. ROOT Z

P POLYNOMIAL
MULTIPLICA TIONS .....1 - - -
MODULOPjZ)

INVERSE
POLYNOMIAL
TRANSFORM
MODULOPjZ)
SIZE P . ROOT zP

YII,I

Fig. 6.2. First stage of the computation of a convolution of size p2 x p2. p odd prime

As a second example, Fig. 6.3 gives the first stage of the calculation of a
convolution of size 2' x 2' by polynomial transforms. In this case, the compu-
tation is performed in t - 1 stages with polynomial transforms defined modulo
P,+l(Z) = Z2'-' + 1, P,(z) = Z2'-' + I, ... , Plz) = Z2 + 1. These polynomial
160 6. Polynomial Transforms

I I
I
ORDERING OF
POLYNOMIALS

I

X/Z)

+ I
REDUCTION MODULO

PI+I(Z)=Z
2,-1
+1 I REDUCTION MODULO
Z
],-1
-I

X2,1Z)
.. XI,IZ)
2' POLYNOMIALS OF 2,-1 TERMS
POLYNOMIAL
TRANSFORM I
REORDERING I
MODULO P,+lZ)

.
.2,-1 POLYNOMIALS OF 2' TERMS


2' . ROOT Z


SIZE

2' POL YNOMIAL


REDUCTION MODULO
P,+lZ)=Z2'-~ 1
I REDUCTION MODULO
],-1


Z -I

! -----.
MULTIPUCA TlONS
MODULO P,+lZ) TlZ)HI.IIZ)
+ t
+ POLYNOMIAL

.
TRANSFORM CONVOLUTION OF SIZE
INVERSE POLYNOMIAL MODULO P,./Z) 2'-1 xl,-I
TRANSFORM MODULO SIZE 2'-1 . ROOT Z2

.
PI+I(Z)
,I
SIZE 2' . ROOT Z
.....-
2,-1 POLYNOMIAL
MULTIPUCATIONS ..

MODULO PI+I(Z) TiZ)H2.lIZ)

INVERSE POLYNOMIAL
TRANSFORM MODULO
P,+lZ)
SIZE 2,-1 . ROOT Zl

~
.,.

REORDERING AND CHINESE
REMAINDER RECONSTRUCTION
•I

+
Yu.1

Fig. 6.3. First stage of the computation of a convolution of size 2' x 2' by polynomial trans-
forms

transforms are particularly interesting because, due to their power of two sizes,
they can be computded with a reduced number of additions by a radix-2 FFT-
type algorithm.
6.2 General Definition of Polynomial Transforms 161

6.2.2 Polynomial Transforms with Composite Roots

Previously, we have restricted our discussion to polynomial transforms having


roots G(z) defined in a field or in a ring of polynomials. Additional degrees of
freedom are possible by taking advantage of the properties of the field of coef-
ficients. If a length-NI polynomial transform supports circular convolution when
defined modulo P(z) with root G(z), it is possible to use roots of unity of order
N z in the field of coefficients for the definition of transforms oflength NINz which
also support circular convolution.
Assuming for instance that the coefficients are defined in the field of complex
numbers, we can always define DFTs of length N z and roots W = e-jZn/N, that
support circular convolution. In this case, if (NI> N z) = I, the polynomial trans-
form of root WG(z) defined modulo P(z) supports a circular convolution of
length N, for N = NINz•
This can be seen by verifying that the three conditions (6.28-30) are met.
We note first that, since W N, = 1 and G(Z)N, == 1 modulo P(z),

(6.43)

Condition (6.29) is also obviously satisfied, since NI> N z and W, G(z) have in-
verses. In order to meet condition (6.30), we consider S, with

N-I
S == L:
k-O
[WG(Z)]qk modulo P(z). (6.44)

Since (WG(Z))N == I modulo P(z), the exponents qk are defined modulo N. Thus,
S == N for q == 0 modulo N. For q $. 0 modulo N, we can always map S into a
two-dimensional sum, because NI and N z are mutually prime. This can be done
with

kz = 0, ... , N z - 1 (6.45)

N 2 -1 NJ-l
S == L: WqN,k, L: G(Z)qN,k, modulo P(z). (6.46)
k1-O k 1 =O

The existence of the two transforms of lengths N I and N z with roots G(Z) and W
implies that S == 0 for kl $. 0 modulo NI and kz $. 0 modulo N z, and therefore
that S == 0 for k $. 0 modulo N, which verifies that (6.30) is satisfied.
When NI is odd, the condition (NI> N z) = I implies that it is always possible
to increase the length of the polynomial transforms to NINz, with N z = 2t. The
new transforms will usually require some multiplications since the roots WG(z)
are no longer simple. We note, however, that this method is particularly useful
to compute convolutions of sizes 2NI X 2NI and 4NI X 4NI> because in these
162 6. Polynomial Transforms

cases, we have W = -lor W = -j and the polynomial transforms may still


be computed without multiplications.
A modified form of the method can be devised by replacing DFTs with
number theoretic transforms (Chap. 8) which amounts to defining the poly-
nomial coefficients modulo integers. In this case, it is possible to compute large
two-dimensional convolutions with a very small number of mUltiplications
provided that multiplications by powers of two and arithmetic modulo an inte-
ger can be implemented easily.
Table 6.1 summarizes the properties of a number of polynomial transforms
that can be computed without multiplications and which correspond to two-
dimensional convolutions such that both dimensions have a common factor.

Table 6. 1. Polynomial transforms for the compution of two-dimensional convolutions. P, Ph


P2 odd primes

Transform ring Transform Size of No. of additions Polynomial products


P(z) - - - - - convolutions for reductions, and convolutions
Length Root N x N polynomial
transforms, and
Chinese
remainder
reconstruction

(zl' - I)/(z - I) P z pxp 2(p3 + p2_ P products P(z)


5p + 4) 1 convolution P
(zl' - I)/(z - I) 2p -z 2p xP 4(p3 + 2p2- 2p products P(z)
6p + 4) 1 convolution 2p
(zll' - 1)/(Z2 - 1) 2p -zl'+'2p x 2p 8(p3 + 2p2 - 2p products P(z)
6p + 4) 1 convolution 2 x 2p
(Zl'l - I)/(zl' - I) P zl' P X p2 2(p' + 2p 3 - p products P(z)
4p2 - P + 4) p products (zl' - 1)/
(z - I)
1 convolution p
(Zl'l - 1)/(zl' - 1) p2 z p2 X p2 2(2p' + p'- pep + I) products P(z)
5p 3 + p2 + 6) p products (zl' - 1)/
(z - 1)
1 convolution p
(z 2l" - I)/(zll' - 1) 2p2 - Zl"+1 2p2 X 2p2 8(2p' + 2p' - 2p(p+ I) products P(z)
6 p 3 _ p2 + 2p products (zll' - 1)/
5p + 2) (Z2 - 1)
1 convolution 2 x 2p
(Zl'll'l - l)/(zl'. - 1) PI zl'· PI X PIP2 2plPl + p?- PI products P(z)
5p, + 4) 1 convolution PIP2
Z2'-1 +I 2' z 2' x 2' 2 2,-1(3t + 5) 3.2'-1 products P(z)
I convolution 2'-1 X
2'-1
6.3 Computation of Polynomial Transforms and Reductions 163

6.3 Computation of Polynomial Transforms and Reductions

The evaluation of a two-dimensional convolution by polynomial transforms


involves the calculation of reductions, Chinese remainder reconstructions, and
polynomial transforms. We specify here the number of additions required for
these operations in the cases corresponding to Table 6.1.
When N = p, p prime, the input sequences must be reduced modulo (z - I)
and modulo P(z), with P(z) = Zp-I + Zp-2 + ... + 1. Each of these operations is
done withp(p - I) additions, by summing all terms of the input polynomials for
the reduction modulo (z - I), as shown in (6.14) and by subtracting the last
word of each input polynomial to all other words, for the reduction modulo
P(z), as shown in (6.23). When one of the input sequences, hn,m, is fixed, the
Chinese remainder reconstruction is defined by (6.22). Since Y1,/(Z) is a poly-
nomial of p - 1 terms, it can be defined as

(6.47)

Thus, (6.22) becomes

Y1(z) = Y2,I - YI,O,I


p-2
+ .=1
I: (YI,.-I,I - YI,.,I + Y2,/)Z· + (YI,p-2,1 + Y2,/)ZP-1
1= 0, ... ,p - 1, (6.48)

which shows that the Chinese remainder reconstruction requires 2p(p - 1)


additions.
For N = p2, P prime, or for N = 2', the reductions and Chinese remainder
operations are performed similarly and we give in Table 6.2, column 3, the
corresponding numbers of additions.
For N = p, p prime, the polynomial transform Xk(z) defined by (6.15) is
computed by

k = 0, ... ,p - 2 (6.49)

(6.50)

where the p polynomials X1,,(z) have p - 1 terms. The computation of Xk(z)


proceeds by first evaluating Riz), with
p-I
Rk(z) == I: X1,,(Z)Z,k modulo P(z). (6.51)
7=1
164 6. Polynomial Transforms

Table 6.2. Number of additions for the computation of reductions, Chinese remainder opera-
tions, and polynomial transforms. p odd prime

Transform Ring Number of additions Number of additions for


size for reductions and polynomial transforms
Chinese remainder
operations

p (zP - I)f(z - I) 4p(p - I) pl _ p2 _ 3p + 4


2p (zP - I)f(z - I) 8p(p - I) 2(p' - 4p + 4)
2p (zlP - I)f(Z2 - I) 16p(p - I) 4(p' - 4p + 4)
P (Zp 2 - I)f(zP - I) 4p2(p - I) p(P' _ p2 _ 3p + 4)
p2 (Zp 2 - I)f(zP - I) 4p'(p - I) 2p s - 2p' - 5p' + 5p2 + p + 2
2p (Z2p2 - I)f(zlP - I) 16p2(p - 1) 4p(p' - 4p + 4)
2p2 (Z2 p2 - l)f(zlP - I) 16p'(p - I) 4(2p S - p' - 6pl + 5p2 + p + 2)
2' Z2'-1 +1 22'+1 t2 21 - 1

p-1
Since Ro(z) = L: X1,r(z), Ro(z) is computed with (p - 2)(p - 1) additions. For
k '* r=1

0, Riz) is the sum of p - 1 polynomials, each polynomial being multiplied


modulo P(z) by a power of z. Rk(z) is first computed modulo (zP - 1). Since
zP == 1, each polynomial X1,r(z) becomes, after multiplication by zr\ a poly-
nomial of p words where one of the words is zero. Thus, the computation of each
Rk(z) modulo (zP - 1) requires p2 - 3p + 1 additions. Since the reduction of
'* °
'*
Riz) modulo P(z) is performed with p - 1 additions, each Rk(z) for k and
k p - 1 is calculated with p2 - 2p additions. In order to evaluate (6.50), we
note that, for r 0, '*
p-2
L: zrk == _zr(p-l) modulo P(z). (6.52)
k-O

This implies that, for r '* °


p-2 p-I
Rp_l(z) == - L: L: X1,r(z)zrk modulo P(z) (6.53)
k=O r-I

p-2
== - L: Rk(z) modulo P(z), (6.54)
k=O

which shows that Rp_l(z) is calculated with (p - 1)(P - 2) additions. Finally,


Xiz) is obtained with pcp - 1) additions by adding X1,o(z) to the various poly-
nomials Rk(Z). Thus, a total of p3 - p2 - 3p + 4 additions is required to com-
pute a polynomial transform oflengthp, withp prime.
Polynomial transforms oflength N, with N composite, are computed by using
an FFT-type algorithm. For N = 2', the polynomial transform Xk(z) of dimen-
sion N is defined modulo P,+l(z) with
6.4 Two Dimensional Filtering Using Polynomial Transforms 165

N-\
Xk(Z) == I: X1.r(z)zrk modulo P,+I(Z)
r=O

k = 0, ... ,N-l (6.55)

and
P,+I(Z) = ZZ'-I + 1. (6.56)

The first stage ofradix-2, decimation in time FFT-type algorithm is given by

k = 0, ... , NI2 - 1 (6.57)

Thus, the polynomial transform of length 2' defined modulo(zz'-' + 1) is com-


puted in t stages with a total of (N 2 /2) 10glN additions.
We summarize in Table 6.2, column 4, the number of additions for various
polynomial transforms. Table 6.1, column 5, also gives the total number of ad-
ditions for reductions, polynomial transforms, and Chinese remainder recon-
struction corresponding to various two-dimensional convolutions evaluated
by polynomial transforms.

6.4 Two-Dimensional Filtering Using Polynomial Transforms

We have seen in the preceding sections that polynomial transforms map ef-
ficiently two-dimensional circular convolutions into one-dimensional polynomial
products and convolutions. When the polynomial transforms are properly
selected, this mapping is achieved without multiplications and requires only a
limited number of additions. Thus, when a two-dimensional convolution is
evaluated by polynomial transforms, the processing load is strongly dependent
upon the efficiency of the algorithms used for the calculation of polynomial
products and one-dimensional convolutions.
One approach that can be employed for evaluating the one-dimensional
convolutions involves the use of one-dimensional transforms that support circu-
lar convolution, such as DFTs and NTTs. These transforms can also be used
to compute polynomial products modulo cyclotomic polynomials p.,(z) by
noticing that, since p.(z), defined by (6.38), is a factor of ZN - 1, all computa-
tions can be carried o~t modulo (ZN - 1), with a final reduction modulo p.,(z).
With this method, the calculation of a polynomial product modulo p.,(z) is
replaced by that of a polynomial product modulo (ZN - 1), which is a convolu-
tion of length N. Hence, the two-dimensional convolution is completely mapped
166 6. Polynomial Transforms

by the polynomial transform method into a set of one-dimensional convolutions


that can be evaluated by DFTs and NTIs.
This approach is illustrated in Fig. 6.4 for a convolution of size p x p, with p
prime. In this case, the two-dimensional convolution is mapped into p + 1
convolutions of length p instead of one convolution of length p plus p poly-
nomial products modulo(zp - 1)/(z - 1) as with the method in Fig. 6.1.

POLYNOMIAL
TRANSFORM
MODULO (:d'-I)
ROOT Z • SIZE p

....-
I/q H1 •m

INVERSE
POLYNOMIAL
TRANSFORM
MODULO (:d' -I)
ROOT Z . SIZE p

REDUCTION MODULO

(zP-/)/(Z-I)

Fig. 6.4. Computation of a two-dimensional con-


volution of size p x p by polynomial transforms.
p prime. Polynomial products modulo (z" - 1)/
Yu,/ (z - 1) are replaced by convolutions of length p

6.4.1 Two-Dimensional Convolutions Evaluated by Polynomial Transforms and


Polynomial Product Algorithms

When two-dimensional convolutions are mapped into one-dimensional con-


volutions which are evaluated by DFTs and NTIs, the problems associated with
6.4 Two Dimensional Filtering Using Polynomial Transforms 167

the use of these transforms such as roundoff errors for DFTs or modular arith-
metic for NTIs are then limited to only a part of the total computation process.
We have seen, however, in Chap. 3, that the methods based on interpolation and
on the Chinese remainder theorem yield more efficient algorithms than the DFTs
or NTIs for some convolutions and polynomial products. It is therefore often
advantageous to consider the use of such algorithms in combination with poly-
nomial transforms.
With this method, each convolution or polynomial product algorithm used
in a given application must be specifically programmed, and it is desirable to
use only a limited number of different algorithms in order to restrict total
program size. This can be done by computing the one-dimensional convolutions

P REDUCTIONS
MODULO
p(Z)=(zP-1 )/(Z-J)

POLYNOMIAL
TRANSFORM
MODULO!'(Z)
ROOT Z • SIZE p

p POL YNOMIAL
1 POLYNOMIAL
MULTIPLICATIONS MULTIPLICATION 1 MULTIPLICATION
MODULO !'(Z) MODULOP(Z)

INVERSE
POLYNOMIAL
TRANSFORM
MODULO !,(Z)
ROOT Z . SIZE p

Fig. 6.5. Computation of a convolution of


size p x p by polynomial transforms. p
prime. The convolution of length p is re-
placed by one multiplication and one poly-
Yu.l nomial product modulo P(z)
168 6. Polynomial Transforms

as polynomial products by using the Chinese remainder theorem. For a two-


dimensional convolution of dimensionp x p, withp prime, the two-dimensional
convolution is mapped by polynomial transforms into one convolution of
lengthp plus p polynomial products modulo P(z), with P(z) = (zp - l)/(z - 1).
However, the same computation can also be done with only one polynomial
product algorithm modulo P(z) if the circular convolution of length p is calcu-
lated as one multiplication and one polynomial product algorithm moduloP(z)
by using the Chinese remainder theorem. Thus, the convolution of size p x p
can be computed as shown in Fig. 6.5 with p + 1 polynomial products modulo
P(z) and one multiplication instead of p polynomial products modulo P(z) and
one convolution of length p.
Table 6.3 gives the number of arithmetic operations for various two-dimen-
sional convolutions computed by polynomial transforms using the convolution
and polynomial product algorithms of Sect. 3.7 for which the number of oper-
ation is given in Tables 3.1 and 3.2. These data presume that one of the input
sequences hn ... is fixed and the operations on this sequence are done only once.

Table 6.3. Number of operations for two-dimensional convolutions computed by polynomial


transforms and polynomial product algorithms

Convolution size Number of Number of Multiplications Additions per


multiplications additions per point point

3 x 3 (9) 13 70 1.44 7.78


4 x 4 (16) 22 122 1.37 7.62
5 x 5 (25) 55 369 2.20 14.76
6 x 6 (36) 52 424 1.44 11.78
7 x 7 (49) 121 1163 2.47 23.73
8 x 8 (64) 130 750 2.03 11.72
9 x 9 (81) 193 1382 2.38 17.06
10 x 10 (100) 220 1876 2.20 18.76
14 x 14 (196) 484 5436 2.47 27.73
16 x 16 (256) 634 4774 2.48 18.65
18 x 18 (324) 772 6576 2.38 20.30
24 x 24 (576) 1402 12954 2.43 22.49
27 x 27 (729) 2893 21266 3.97 29.17
32 x 32 (1024) 3658 24854 3.57 24.27
64 x 64 (4096) 17770 142902 4.34 34.89
128 x 128 (16384) 78250 720502 4.78 43.98

6.4.2 Example of a Two-Dimensional Convolution Computed by Polynomial


Transforms

In order to illustrate the computation of a two-dimensional convolution by


polynomial transforms, we consider a simple convolution of size 3 X 3. In this
6.4 Two Dimensional Filtering Using Polynomial Transforms 169

case, we have

p=3 P(z) = ZZ +Z+ 1

The input sequences are defined by

hn,o = {2, 3, I} X.,o = {2, 1, 2}


hn,l = {4, 2, O} X.,l = {I, 3, O}
hn,z = {3, I, 4} x"z = {2, I, 5} .
These sequences become, after reduction modulo P(z)

H1,n(z): {I,2} X1,,(z): to, -I}


{4,2} {I, 3}
{-I, -3} {-3, -4}

The polynomial transforms of H1,n(z) and X1,,(z) are given by

iik(z): {4, I} Xk(z): {-2, -2}


{-3,5} {-4,O}
{2,O} {6, -I}

We note that T1(z), given by (6.21) is

T1(z) = -(z + 2)/3.


Thus, the polynomial mUltiplications are defined by

-(z + 2)ii (z)X (z)/9 moduloP(z):


k k { 4, l4} /9
{-44, 8} /9
{-26, -to} /9

Computing the inverse polynomial transform of this result yields

Y1,I(Z): {-66, I2} /9


{66, 42} /9
{12, -I2}/9.

The reductions modulo (z - 1) are given by

HZ,m: {6, 6, 8} X Z,,: {5, 4, 8},

which yields the length-3 convolution YZ,I


170 6. Polynomial Transforms

Y2 ,1/3: {110, 118, 1I2} 13.

The two-dimensional convolution output Y.,I is then generated by the Chinese


remainder reconstruction of Y\,I(Z) and Y 2 ,I using (6.22),

Y.,I: {44, 28, 38, 32,42,44,36,40, 36}.

6.4.3 Nesting Algorithms

A systematic application of the techniques discussed in the preceding sections


allows one to compute large two-dimensional convolutions by use of composite
polynomial transforms. In particular, the use of polynomial transforms oflength
N, with N = 2', is especially attractive because these transforms can be com-
puted with a reduced number of additions by using a radix-2 FFT-type algo-
rithm.
A nesting algorithm may be devised as an alternative to this approach to
construct large two-dimensional convolutions from a limited set of short two-
dimensional convolutions (Sect. 3.3.1) [6.4]. Using this method, a two-dimen-
sional convolution Yu,l of size N\N2 X N\N2' with N\ and N2 mutually prime,
can be converted into a four-dimensional convolution of size (N\ X N\)
X (N2 X N 2) by a simple index mapping. Yu,l is defined as

(6.59)

Since N\ and N2 are mutually prime, the indices I, m, n, and u can be mapped
into two sets of indices II> ffll> nl> u\ and 12, m2, n2, U2 by use of an approach based
on permutations (Sect. 2.1.2) to obtain

1== Nd2 + N2/\ moduloN\N2


m == N\m2 + N 2m\ modulo N\N2
n == N\n2 + N 2n\ modulo N\N2 , II> ml> nh u\ = 0, ... , N\ - 1
u == N\uz + N 2 u\ modulo N\N2 , 12, m 2, n2, U2 = 0, ... , N z - 1 (6.60)

and

XN,(.,-.,)+N,(u,-.,),N,{/,-m,)+N,{/,-m,)· (6.61)

This four-dimensional convolution can be viewed as a two-dimensional convolu-


tion of size N\ X N\ where all the scalars are replaced by the two-dimensional
polynomials of size N2 X N 2, H."m,(Zh Z2), and Xr".,(Zh Z2) with
6.4 Two Dimensional Filtering Using Polynomial Transforms 171

'10 SI = 0, ... , NI - 1 (6.63)

and Yu.1 defined by


N,,-l N 2 -1
Yu,.I,(Zh zz) = ~ ~ YN,u,+N,u,.N,I,+N,I, z~, z~, (6.64)
",,=0 12 =0

N,-I N,-I
Yu,.I,(Zh zz) == ~ ~ Hn,.m,(Zh zZ)Xu,-n,.I,-m,(Zh zz)
ml=O nl=O

modulo (zr' - 1), (z~, - 1). (6.65)

Each polynomial multiplication modulo (zf. - 1), (zf'. - 1) corresponds to a


convolution of size N z X N z which is computed with Mz scalar multiplications
and A z scalar additions. In the convolution of size NJ X Nb all scalars are re-
placed by polynomials of size N z X N z• Thus, if MJ and Al are the numbers of
multiplications and additions for a scalar convolution of size NJ X Nh the num-
ber of multiplications M and additions A required to evaluate the two-dimen-
sional convolution of size NJNz X NINz becomes

(6.66)

(6.67)

This computation process may be extended recursively to more than two factors
provided that all these factors are mutually prime. In practice, the small con-
volutions of size NJ X NJ and N z X N z are computed by polynomial trans-
forms, and large two-dimensional convolutions can be obtained from a small set
of polynomial transform algorithms. A convolution of size 15 X 15 can, for
instance, be computed from convolutions of sizes 3 X 3 and 5 X 5. Since these
convolutions are calculated by polynomial transforms with 13 multiplications,
70 additions and 55 multiplications, 369 additions, respectively (Table 6.3),
nesting the two algorithms yields a total of 715 multiplications and 6547 addi-
tions for the convolution of size 15 X 15.
Table 6.4 itemizes arithmetic operations count for two-dimensional convolu-
tions computed by polynomial transforms and nesting. It can be seen that these
algorithms require fewer mUltiplications and more additions per point than for
the approach using composite polynomial transforms and corresponding to
Table 6.3. The number of additions here can be further reduced by replacing the
172 6. Polynomial Transforms

Table 6.4. Number of operations for two-dimensional convolutions computed by polynomial


transforms and nesting

Convolution size Number of Number of Multiplications Additions per


multiplications additions per point point

12 x 12 (144) 286 2638 1.99 18.32


20 x 20 (400) 946 14030 2.36 35.07
30 x 30 (900) 2236 35404 2.48 39.34
36 x 36 (1296) 4246 40286 3.28 31.08
40 x 40 (1600) 4558 80802 2.85 50.50
60 x 60 (3600) 12298 192490 3.42 53.47
72 x 72 (5184) 20458 232514 3.95 44.85
80 x 80 (6400) 27262 345826 4.26 54.03
120 x 120 (14400) 59254 1046278 4.11 72.66

Table 6.5. Number of operations for two-dimensional convolutions computed by polynomial


transforms and split nesting

Convolution size Number of Number of Multiplications Additions per


multiplications additions per point point

12 x 12 (144) 286 2290 1.99 15.90


20 x 20 (400) 946 10826 2.36 27.06
30 x 30 (900) 2236 28348 2.48 31.50
36 x 36 (1296) 4246 34010 3.28 26.24
40 x 40 (1600) 4558 69534 2.85 43.46
60 x 60 (3600) 12298 129106 3.42 35.86
72 x 72 (5184) 20458 192470 3.95 37.\3
80 x 80 (6400) 26254 308494 4.10 48.20
120 x 120 (14400) 59254 686398 4.11 47.67

conventional nesting by a split nesting technique (Sect. 3.3.2). In this case, the
number of arithmetic operations becomes as shown in Table 6.5. It can be seen
that the number of additions per point in this table is comparable to that
obtained with large composite polynomial transforms.

6.4.4 Comparison with Conventional Convolution Algorithms

Polynomial transforms are particularly suitable for the evaluation of real con-
volutions because they then require only real arithmetic as opposed to complex
arithmetic with a OFT approach. Furthermore, when the polynomial products
are evaluated by polynomial product algorithms, the polynomial transform
approach does not require the use of trigonometric functions. Thus, the compu-
tation of two-dimensional convolutions by polynomial transforms can be com-
pared to the nesting techniques [6.4] described in Sect. 3.3.1 which have similar
6.5 Polynomial Transforms Defined in Modified Rings 173

characteristics. It can be seen, by comparing Table 3.4 with Tables 6.3 and 6.5,
that the polynomial transform method always requires fewer arithmetic opera-
tions than the nesting method used alone, and provides increased efficiency with
increasing convolution size. For large convolutions of sizes greater than 100 X
100, the use of polynomial transforms drastically decreases the number of
arithmetic operations. In the case of a convolution of 120 X 120, for example,
the polynomial transform approach requires about 5 times fewer multiplications
and 2.5 times fewer additions than the simple nesting method.
When a convolution is calculated via FFT methods, the computation requires
the use of trigonometric functions and complex arithmetic. Thus, a comparison
with the polynomial transform method is somewhat difficult, especially when
issues such as roundoff error and the relative cost of ancillary operations are
considered. A simple comparative evaluation can be made between the two
methods by assuming that two real convolutions are evaluated simultaneously
with the Rader-Brenner FFT algorithm (Sect. 4.6) and the row-column method.
In this case, the number of arithmetic operations corresponding to convolutions
with one fixed sequence and precomputed trigonometric functions is listed in
Table 4.7. Under these conditions, which are rather favorable to the FFT ap-
proach, the number of additions is slightly larger than that of the polynomial
transform method while the number of multiplications is about twice that of the
polynomial transform approach. Conventional radix-4 FFT algorithms or the
Winograd Fourier transform method would also require a significantly larger
number of arithmetic operations than the polynomial transform method.

6.5 Polynomial Transforms Defined in Modified Rings

Of all possible polynomial transforms, the most interesting are those defined
modulo (ZN + 1), with N = 2', because these transforms are computed without
multiplications and with a reduced number of additions by using a radix-2 FFT-
type algorithm. We have seen, in Sect. 6.2.1, that large two-dimensional con-
volutions are computed with these transforms by using a succession of stages,
where each stage is implemented with a set of four polynomial transforms. This
approach is very efficient from the standpoint of the number of arithmetic
operations, but has the disadvantage of requiring a number of reductions and
Chinese remainder reconstructions. In the following, we shall present an inter-
esting variation [6.5] in which a simplification of the original structure is obtained
at the expense of increasing the number of operations.
In order to introduce this method, we first establish that a one-dimensional
convolution YI oflength N, with N = 2', can be viewed as a polynomial product
modulo (ZN + 1), provided that the input and output sequences are multiplied
by powers of W, where W is a root of unity of order 2N. We consider the
circular convolution YI defined by
174 6. Polynomial Transforms

1= 0, ... , N - 1, (6.68)

where I - n is defined modulo N. The input sequences are multiplied by Wn and


wm, with W = e-J"IN, and organized as two polynomials
(6.69)

N-i
X(z) = L:
m=O
xmWmz m. (6.70)

We now define a polynomial A(z) oflength N

A(z) == H(z)X(z) modulo (ZN + 1) (6.71)

(6.72)

where each coefficient al of A(z) corresponds to the products hnxm such that
n+ m = I or n + m = I + N. Since ZN == -1, we have

(6.73)

where I - n is not taken modulo N. Hence, with WN = -1,

(6.74)

which shows that Yl is obtained by simple multiplications of A(z) by W-l. Note


that this method is quite general and converts a convolution into a polynomial
product modulo (ZN + 1), sometimes called skew circular convolution, provided
that the input and output sequences are multiplied by powers of any root of
order 2N. Such roots need not be e-J"IN and can, for instance, be roots of unity
defined in rings of numbers modulo an integer.
We now apply this method to the computation of two-dimensional convolu-
tions. We begin again with a two-dimensional circular convolution Yu.l of size
N X N, where N = 2',

N-i N-i
Yu,l L: L:
= m=O n=O
hn.m Xu-n.l-m u, 1= 0, ... , N - 1. (6.75)

In polynomial notation, this convolution becomes


N-i
Al(Z) == L:
m=O
Hm(z)XI_m(z) modulo (ZN + 1) (6.76)
6.5 Polynomial Transforms Defined in Modified Rings 175

N-I
Hm(z} = 1:
,,-0
h",m W"z", W = e- J7C1N (6.77)

N-I
1: X.,r W'z',
Xr(Z} =
.-0 m, r = 0, ... , N- 1 (6.78)

N-I
A,(z} = 1:
,,-0
a""z", 1= 0, ... ,N-l (6.79)

Y",' = a"" W-". (6. 80}

The most important part of the calculations corresponds to the evaluation of the
polynomial convolution A,(z} defined modulo (ZN + I) corresponding to (6.76).
We note, however, that we can always define a polynomial transform oflength

POLYNOMIAL POLYNOMIAL
TRANSFORM TRANSFORM
MODULO (ZN+J) MODULO (ZN+ J)
SIZE N • ROOT ZZ SIZE N • ROOT ZZ

N POLYNOMIAL
'------t~ MULTIPLICATIONS
MODULO (ZN+ J)

INVERSE POLYNOMIAL
TRANSFORM MODULO
(ZN+ I )
SIZE N . ROOT ZZ

-+-- W- u

Fig. 6.6. Computation of a convolution of


size N x N, with N = 2', by polynomial
Yu.l transforms defined in modified rings
176 6. Polynomial Transforms

N modulo (ZN + 1). Hence, the two-dimensional convolution YM,I can be com-
puted with only three polynomial transforms of length N and roots Z2, as shown
in Fig. 6.6. When one of the input sequences, hn,m, is fixed, the corresponding
polynomial transform needs be computed only once and the evaluation of Yu,l

REDUCTION REDUCTION
MODULO (ZNI2+ 1) MODULO (ZNI2_l)

POLYNOMIAL
TRANSFORM
MODULO (ZNl2+l)
SIZE N . ROOT Z
POLYNOMIAL
TRANSFORM
MODULO (ZNI2+l)
N POLYNOMIAL
PRODUCTS SIZE N , ROOT Z

MODULO (ZNI2+1)

N POLYNOMIAL
--+ PRODUCTS
INVERSE MODULO (ZNI2+ 1)
POLYNOMIAL
TRANSFORM
MODULO (ZNl2+ 1) INVERSE
SIZE N • ROOT Z POLYNOMIAL
TRANSFORM
MODULO (ZNl2+1)
SIZE N • ROOT Z

Fig. 6.7. Computation of a convolution of size


N x N, with N = 2', by combining the two poly-
Yu.1 nomial transform methods
6.6 Complex Convolutions 177

reduces to the computation of 2 polynomial transforms oflength N, N polynomi-


al multiplications modulo (ZN + I), and 2N scalar multiplications by W· and
w-u.
Hence the overall structure of the algorithm is very simple, and all reductions
and Chinese remainder operations have been eliminated at the expense of the
multiplications by W· and W-u, by using polynomial multiplications modulo
(ZN + 1) instead of modulo (ZNI2 + I), (ZNI4 + I) .... As with the method cor-
responding to Fig. 6.3, the polynomial transforms are computed here with a
reduced number of additions by using a FFT-type radix-2 algorithm.
The approach based on modified rings of polynomials can be improved by
combining it with the method described in Sect. 6.2.1. This may be done by
computing the convolution Yu,l as a polynomial convolution of N terms defined
modulo (ZNI2 + I) plus a polynomial convolution of N terms defined modulo
(ZNI2 - 1), with the latter polynomial convolution converted into a polynomial
convolution modulo (ZNI2 + 1) by multiplications by W' and W-u, with W =
e- i2niN• In this case, the algorithmic structure, as shown in Fig. 6.7, uses 4 length-
N polynomial transforms defined modulo (ZNI2 + 1) and only N multiplications
by W· and W-U instead of 2N mUltiplications as with the preceding method.
Moreover, this algorithm replaces the computation of N polynomial multipli-
cations modulo (ZN + I) by that of 2N polynomial multiplications modulo
(ZNI2 + 1), which is obviously simpler.
A further reduction in number of operations could also be obtained by using
additional stages with reductions modulo (ZNI4 + +
I), (ZNI8 I), ... and complete
decomposition would yield the original scheme of Fig. 6.3. Thus, the translation
of polynomial products defined modulo (ZN - 1) into polynomial products
defined modulo (ZN + 1) by roots of unity defined in the field of coefficients
provides considerable flexibility in trading structural complexity for computa-
tional complexity.

6.6 Complex Convolutions

For complex convolutions, the polynomial transform approach can be imple-


mented with two real multiplications per complex multiplication by taking
advantage of the fact that j = ,J=T is real in certain fields. In particular, for
rings of polynomials defined modulo (ZN + 1), with N even, ZN == -I and j =
,J=T == ZNI2. Thus, the method described in Sect. 3.3.3 can be used to replace
a complex two-dimensional convolution with two real two-dimensional convolu-
tions. Consequently, a complex two-dimensional convolution can be computed
by polynomial transforms with about twice the computation load of a real
convolution and the relative efficiency of the polynomial transform approach
compared with the FFT method is about the same as for real convolutions.
178 6. Polynomial Transforms

6.7 Multidimensional Polynomial Transforms

Multidimensional polynomial transforms can be defined in a way similar to one-


dimensional polynomial transforms. In order to support, for instance, the
computation of a three-dimensional convolution Yu.l,.l, of size p X P X p, with
p prime, we redefine (6.2-5) as

p-t
Hm,.m,(z) = ~hn.m,.m, zn, mlo m2 = 0, ... ,p - 1 (6.81)
n=O

p-t
X"",(Z) = ~ X"'I'" z',
8=0
'10 '2 = 0, ... , p - 1 (6.82)

110 12 = 0, ... , p - 1 (6.83)


p-t
Y1,.1,(Z) = ~YU.l .. l, ZU (6.84)
u=o

u = 0, ... ,p - 1. (6.85)

The two-dimensional polynomial transform is defined modulo P(Z) , with


P(z)= (zp - I)/(z - I), by the expression

klo k2 = 0, ... , p - I, (6.86)

where

(6.87)

with a similar definition for the inverse transform. The two-dimensional poly-
nomial transform in (6.86) supports circular convolution because z is a root of
order N in the field of polynomials modulo P(z). Hence, it may be used to com-
pute the polynomial convolution Y t •1,• 1, (z) with

Yt.1,.1,(Z) == Y1,.1,(Z) modulo P(z). (6.88)

Y1,.1,(Z) can be obtained from Yt.1,.1,(Z) by a Chinese remainder reconstruction


with

Y2.1, .1, == Y1,.1,(Z) modulo (z - I) (6.89)


6.7 Multidimensional Polynomial Transforms 179

(6.90)

(6.91)

(6.92)

and

Y1,,1,(Z) == Yt ,l\,l,(Z)Sj(Z) + Y2,l\,l,S2(Z) modulo P(z) (6.93)

with

Sj(z) : I, S2(Z) == 0 modulo P(z)


{
Sj(z) = 0, S2(Z) == I modulo (z - I). (6.94)

Thus, the convolution of size p X P X P is mapped by polynomial transforms


into p2 polynomial products modulo P(z) and one convolution of size p X p. A
diagram of the computation process is shown in Fig. 6.8. It should be noted that
the convolution of size p X P can, in turn, be computed by polynomial trans-
forms oflengthp as indicated in Fig. 6.1.
The two-dimensional polynomial transforms can be calculated as 2p poly-
nomial transforms of length p by the row-column method, with 2p(p3 - p2 - 3p
+ 4) additions. Under these conditions, the convolution of size p X P X P is evalu-
ate<.: with 4p4 + 2p3 - 14p2 + 6p + 8 additions, p(p + I) polynomial products
modulo(zP - 1)/(z - 1), and one convolution of length p. Using the same
technique, a convolution of size p X P X P X P would be computed with 6p 5 +
2p4 - 20p3 + lOp2 + 6p + 8 additions, p3 + p2 + P polynomial multiplica-
tions modulo(zp - I)/(z - I), and one convolution of length p. We give in
Table 6.6 the number of operations for some multidimensional convolutions
computed by polynomial transforms, which will be used for the calculation of
DFTs by the method described in Sect. 7.2.

Table 6.6. Number of operations for short multidimensional convolutions computed by


polynomial transforms

Convolution size Total number of Total number of


additions multiplications

3x 3x 3 40 325
3x3x3x3 121 1324
6x 6 x 6 320 3896
6x6x6x6 1936 31552
180 6. Polynomial Transforms

POLYNOMIAL
TRANSFORM
MODULOP(Zj
SIZE pxp . ROOT Z

I
I
I

'+--
I
"z POLYNOMIAL
PRODUCTS
MODULOP(Zj

INVERSE
POLYNOMIAL
TRANSFORM
MODULOP(Z)
SIZE pxp • ROOT Z

Fig. 6.8. First stage of the computation of a con-


volution of size p x p x p by polynomial trans-
forms. p odd prime

A similar approach can be employed to develop multidimensional polynomi-


al transforms from one-dimensional polynomial transforms of length N, with N
composite.
7. Computation of Discrete Fourier Transforms by
Polynomial Transforms

As indicated in the previous chapter, polynomial transforms can be used to


efficiently map multidimensional convolutions into one-dimensional convolu-
tions and polynomial products. In this chapter, we shall see that polynomial
transforms can also be used to map multidimensional DFTs into one-dimen-
sional DFTs. This mapping is very efficient because it is accomplished using
ordinary arithmetic without multiplications, and because it can be implemented
by FFT-type algorithms when the dimensions are composite.
This method, which is significantly simpler than the conventional multi-
dimensional DFT approaches such as the row-column method or the Winograd
algorithm' applies only to DFTs having common factors in several dimensions.
For one-dimensional DFTs or for multidimensional DFTs having no common
factors in several dimensions, we show that polynomial transforms can still be
used to reduce the amount of computation by converting the DFTs into multi-
dimensional correlations and by evaluating these multidimensional correlations
with polynomial transforms.
In practice, both polynomial transform methods significantly decrease the
number of arithmetic operations and can be combined to define procedures
with optimum efficiency for large multidimensional DFTs that have common
factors in several dimensions.

7.1 Computation of Multidimensional DFTs by Polynomial


Transforms

We consider a two-dimensional DFT, Jlk I'


k'
,
of size N X N

kl> k2 = 0, ... , N - 1, j=,J-l. (7.1)

The conventional row-column method (Sect. 4.4) computes Jlkl •k , as N one-


dimensional DFTs along dimension k2 and None-dimensional DFTs along
dimension k!. Hence this method maps Jlkl •k , into 2N DFTs of length N and, if
M! is the number of complex multiplications required to compute the length-N
DFT, the total number M of multiplications corresponding to Jlkl •k , is M =
2NM!. The DFT Jlk, •k , can also be evaluated by use of the Winograd nesting
182 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

algorithm (Chap. 5) as a OFT of size N in which each scalar is replaced by a


vector of N terms and each multiplication is replaced by a OFT of length N. In
this case, M = Mr, and the performance of the nesting method is better than the
row-column approach when Ml < 2N.
We shall now show that the number of multiplications is significantly de-
creased when Xk,.k, is mapped into a set of one-dimensional OFTs by a poly-
nomial transform method.

7.1.1 The Reduced DFT Algorithm

In order to compute Xk"k, by polynomial transforms [7.1, 2], we shall represent


this OFT in polynomial algebra by replacing (7.1) with the following set of three
equations:

(7.2)

(7.3)

kl' k z = 0, ... , N - 1. (7.4)

It can easily be verified that (7.2-4) are equivalent to (7.1) by noting that the
definition of (7.4) modulo (z - Wk,) is equivalent to substituting Wk, for z in
(7.2) and (7.3). It should also be noted that although the definition of Xk/z)
modulo (ZN - 1) is superfluous at this stage, it is valid, since ZN == WNk, = 1.
In order to simplify the presentation, we assume now that N is an odd prime,
with N = p. Thus, zP - I is the product of two cyclotomic polynomials

zP - 1 = (z - I)P(z) (7.5)

P(z) = Zp-l + zp-z + ... + 1. (7.6)

For k z == 0, we have z == I, and Xk"o is a simple OFT of length N obtained by


reducing Xn,(z) modulo (z - I), with

(7.7)

For k z =t=. °
modulo p, Wk, is always a root of P(z), since

(7.8)

and Xk"k, may be obtained by substituting Wk, for z in (7.2). Since z - Wk, is a
factor of P(z) and P(z) is a factor of zP - I, (7.4) becomes
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 183

Xk,.k, == UXk,(z) modulo (zp - I)J modulo P(z)} modulo (z - Wk,). (7.9)

Hence, for k2 $. 0, (7.2-4) reduce to

kl = 0, ... ,p - 1 (7.10)

(7.11)

k2 = 1, ... , p - 1. (7.12)

Since p is an odd prime and k2 =1= 0, the permutation k2kl modulo p maps all
values of kJ, and we obtain, by replacing kl with k2kJ,
_ p-l
Xk,k,(Z) == L: X!,(z) Wk,.,k, modulo P(z) (7.13)
"1=0

Xu k
2 I' 2.
== Xk k (z) modulo (z - Wk,).
2 I
(7.14)

Xk,k,.k, is obtained by replacing z by Wk, in (7.14). Therefore, we can substitute


z for Wk, in (7.13), with

_ p-l
Xk,k,(Z) == L: X!,(Z)Z·,k, modulo P(z), (7.15)
7.1 1 =0

where the right-hand side of the equation is independent of k 2 • Xl,k,(z) is recog-


nized as a polynomial transform of length p which is computed without multi-
plications, with only p3 - p2 - 3p + 4 additions. Hence, the only multiplica-
tions required for evaluating the DFT of size p X P are those corresponding to
(7.14) and to the length-p D FT defined by (7.7).
In order to specify the operations corresponding to (7.14), we note that the
p polynomials Xl,k,(z) are defined modulo P(z) and are therefore of degree
p - 2. Hence these polynomials can be represented as

_ p-2
Xl 1k (z) =
1
L: Yk
1=0 I'
I Zl. (7.16)

Ifwe substitute Xkk


, ,(z) defined by (7.16) into (7.14), we obtain

k2 = I, ... , p - 1, (7.17)

which represents p DFTs of p terms corresponding to the p values of k 1• This


means that the polynomial transform maps without multiplications a DFT of
size p X p, with p prime, into p + 1 DFTs of p terms, instead of 2p DFTs of p
184 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

terms with the row-column method. Thus, the number M of multiplications


becomes

M = (p + I)Mh (7.18)

where Ml is the number of complex multiplications corresponding to a length-p


OFT. This polynomial transform approach is illustrated in Fig. 7.1.

REDUCTION MODULO
P(Z) = (z.P -1 )/(Z-I)

POLYNOMIAL
TRANSFORM
MODULO P(Z)
ROOT Z . LENGTH p

p REDUCED DFTs
(p CORRELATIONS
OF p-l POINTS)

Fig. 7.1. Computation of a OFf of size p x p by polynomial transforms. p prime. Reduced


OFf algorithm

The number of multiplications can be further reduced by noting that, in the


p OFTs defined by (7.17), the last input term is equal to zero and the first output
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 185

term is not computed. The simplification of (7.17) is based upon the fact that,
for k2 =1= 0 and p prime,
p-I
2:: Wk,1 = - 1. (7.19)
I-I

Hence (7.17) can be rewritten as

k2 = 1, ... , p - 1, (7.20)

where

1= 1, ... ,p - 2. (7.21)

In the OFTs defined by (7.20), the first input and output terms are missing. Thus,
these OFTs are usually called reduced OFTs. They can be computed as corre-
lations of length p - 1 by using Rader's algorithm [7.3] (Sect. 5.2). In this case,
if g is a primitive root modulo p, I and k2 are redefined by

I == g" modulop
{
k2 == g" modulo p u, v = 0, ... ,p - 2. (7.22)

Under these conditions, the reduced OFT (7.20) is converted into a correlation

(7.23)

The sequence ykt.1 can be constructed from the sequence Ykt,l without additions
by noting that it is equivalent to a multiplication of Xk k (z) by Z-I modulo
, t

(zp - 1), followed by a reduction modulo P(z) and a multiplication by z. In


practice, the multiplication by Z-I may be combined with the ordering of the in-
put polynomials. The reduction modulo P(z) can, therefore, be executed without
additions as part of the computation of the polynomial transform. We have
seen in Sect. 5.2 that when a length-p OFT, withp prime, is computed by Rader's
algorithm, the calculations reduce to one correlation of p - 1 terms plus one
scalar multiplication. Thus, if the conventional OFT is computed with MI com-
plex multiplications, the corresponding reduced OFT defined by (7.23) is calcu-
lated with MI - 1 complex multiplications and the number M of complex
multiplications required to evaluate the OFT of size p X P by polynomial
transforms reduces to

M = (p + I)Ml - p. (7.24)

This is about half the number of multiplications corresponding to the row-


186 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

column method and always less (except for p = M 1) than the number of multi-
plications required by the Winograd algorithm. When the DFTs and reduced
DFTs of size p are evaluated by Rader's algorithm, all complex multiplications
reduce to multiplications by pure real or pure imaginary numbers and can be
implemented with only two real multiplications. In this case, the number of real
multiplications required to evaluate the DFT of size p x p by polynomial
transforms becomes 2(p + I)Ml - 2p.

7.1.2 General Definition of the Algorithm

In the foregoing, we have restricted our discussion to DFTs of size p X p, with


p prime. A similar polynomial transform approach can also be applied to DFTs
of size N X N, with N composite, by developing algorithms as in Sect. 6.2.1,
with polynomial transforms defined modulo the various cyclotomic polynomials
p.,(z) which are factors of ZN - 1 [7.1]. Since the most important form of the
general algorithm concerns transforms corresponding to N = 2t , we shall
restrict detailed discussion to this case, and simply give a summary of results for
other cases of interest.
Assuming N is a power of 2, with N = 2t , we represent once again the DFT
Xk,.k. of size N X N by a set of three polynomial equations

(7.25)

(7.26)

(7.27)

Since N is even, ZN - I is the product of the two polynomials ZNI2 - 1 and


ZNIZ+ 1. The complex roots of ZNI2 + 1 are Wk., for k z odd, and we have
ZNI2 + 1 = IT (z - Wk.). (7.28)
k. odd

Therefore, for k2 odd, z - Wk. is a factor of ZNI2 + 1 which, in turn, is itself a


factor of ZN - 1. Thus, for k2 odd, (7.25-27) can be reduced modulo (ZNIZ + 1)
to become

+ 1)
N-I
Xl,(z) == ~ X!,(z)Wn,k, modulo (ZNI2 (7.29)
"1- 0
NIZ-I
X! I(z) = ~ (x.
"2=0.' 2
n - Xn •• n +N12)Z" == Xn I(z) modulo (ZNI2
:I
+ 1) (7.30)

(7.31)
7.1 Computation of Multidimensional DFfs by Polynomial Transforms 187

Since k2 is odd and N is a power of two, the permutation k2ki modulo N maps
all values of k i and we obtain, by replacing k i with k2kh

N-i
X~,k,(Z) == ~ X~,(z)Wk,n,k,
nl=O
modulo (ZNI2 + 1) (7.32)

Xk3k1" k:1 == X~1k1(z) modulo (z - Wk,) , (7.33)

Since (7.33) is defined modulo (z - Wk,) , we have z == Wk,. Hence, we can


substitute z for Wk, in (7.32). This gives

+ 1),
N-l
Xl,k,(z) == ~ X~,(z)zn,k, modulo (ZNI2 (7.34)
n 1 =O

which indicates that Xl,k,(z) can be computed as a polynomial transform of


length N, with N = 2', and of root z defined modulo (ZNI2 +
1). We have shown
in Sect. 6.2.1 that such a transform may be calculated using a radix-2 FFT-type
algorithm with only (N2/2) log2N additions and without multiplications. The N
polynomials X~,k,(Z) are of degree NI2 - 1 because they are defined modulo
(ZNI2 + 1). Therefore, we can represent these polynomials as

(7.35)

Then, substituting (7.35) into (7.33) yields

(7.36)

and, since k2 is odd

(7.37)

with k2 = 2u + 1. Equation (7.37) represents N DFTs of length NI2 where the


input sequence is multiplied pointwise by 1, W, W 2 , •••• These DFTs are identical
to the reduced DFT that appears in the first stage of a decimation in frequency
radix-2 FFT decomposition and are sometimes called odd DFTs [7.4J. There-
fore, for k2 odd, the DFT of size N X N is computed by N reductions modulo
(ZNI2 + I), one polynomial transform of length N, and N odd DFTs of length
N12, the only multiplications being those corresponding to the odd DFTs.
For k2 even, with k2 = 2u, the DFT Xk,.k, of size N X N becomes a simple
DFT of size N X (NI2), with

u = 0, ... , NI2 - 1. (7.38)


188 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

By reversing the role of kl and 2u, this OFT can be represented in polynomial
notation as
_ N/2-1
X2"(z) == L; X.,(Z)W2W1, modulo (ZN - 1) (7.39)
n2 =O

(7.40)

(7.41)

We may then use the same polynomial transform method as above to compute
Xk,.2"' The polynomial ZN - 1 factors into the two polynomials ZN/2 - 1 and
ZNI2 + 1 and the roots of ZNI2 - 1 correspond to Wk" kl even. Therefore, for kl
even, with kl = 2v, Xk,.2" reduces to a simple OFT of size (NI2) X (NI2)

v = 0, ... , NI2 - 1. (7.42)

For kl odd, the Wk, are the roots of ZNI2 + 1 and (7.39, 40) can be defined
modulo (ZNI2 +
1) instead of modulo (ZN - 1). In this case, Xk,.2u can be com-
puted using a polynomial transform of length NI2 in a way similar to that dis-
cussed above for Xk,.k" k2 odd. This is accomplished with

N/2-1
X!uk,(Z) == L;
1):&-0
X!,(Z)Z2W1, modulo (ZNI2 + 1)
U = 0, ... , NI2 - I (7.43)

X!,(z) == X.,(z) modulo (ZNI2 +


1)
N/2-J
= +
L; (x., .•, x., .•,+N/2 - X.,+N/2.•, - X.,+N/2.rr,+N/2)Z·'
1)1=0
(7.44)

k1odd. (7.45)

Following this procedure, the OFT of size N X N is computed as shown in Fig.


7.2 with reductions modulo (ZNI2 - 1) and (ZNI2 + 1), two polynomial trans-
forms, 3NI2 reduced DFTs of NI2 terms, and one DFT of size (NI2) X (NI2).
This last OFT can in turn be computed by the same method, and, by repeating
this process, the (N X N)-point OFT is completely evaluated in (log2N) - 1
stages by polynomial transforms. With the conventional row-column method,
the first stage of a radix-2 FFT algorithm reduces the DFT of size N X N into
2N odd DFTs of NI2 terms, N DFTs oflength N12, and one DFT of size NI2 X
N12. This corresponds to about twice as many OFTs and therefore to about
twice as many multiplications as with the polynomial transform method.
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 189

N POLYNOMIALS OF N TERMS

REDUCTION
MODULO ZNl2+ 1

POLYNOMIAL
TRANSFORM
MODULO ZN12+ 1
LENGTH N . ROOT Z

POLYNOMIAL
TRANSFORM
DFT
MODULO zNl2 + I
(NI2) x (NI2)
LENGTH Nl2 • ROOT Z2
I
I
,~

klODD

k2 EVEN

Xkl 'k2

Fig. 7.2. Computation of a DFT of size N x N by polynomial transforms. N = 2'. Reduced


DFT algorithm
190 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

When the reduced DFTs are computed by the Rader-Brenner algorithm


[7.51, as discussed in Sect. 4.3, all complex multiplications are implemented with
only two real mUltiplications. In this case, the number of arithmetic operations
for DFTs of size N X N, with N = 2', computed by polynomial transforms, is
as given in Table 7.1. The entries in this table are derived from the number of
operations corresponding to the reduced DFTs given in Table 4.4 and from the
number of operations for reductions and Chinese remainder reconstruction given
in Table 6.2. It can be verified by comparison with Table 4.5 that the poly-
nomial transform method requires only about half as many multiplications as the
conventional row-column method using the same FFT algorithm and is im-
plemented with significantly fewer additions. It should also be noted that the
polynomial transform approach with N = 2' retains the basic structure of the
FFT algorithm because the polynomial transforms are computed by an FFT-
type partition.

Table 7.1. Number of real operations for complex DFTs of size N x N computed by polyno-
mial transforms with the reduced Rader-Brenner DFT algorithm. N = 2'. Trivial mUltiplica-
tions by ± 1, ±j are not counted

DFT size Number of Number of Multiplications Additions


multiplications additions per point per point

2 x 2 0 16 0 4.00
4 x 4 0 128 0 8.00
8 x 8 48 816 0.75 12.75
16 x 16 432 4528 1.69 17.69
32 x 32 2736 24944 2.67 24.36
64 x 64 15024 125040 3.67 30.53
128 x 128 76464 599152 4.67 36.57
256 x 256 371376 2790512 5.67 42.58
512 x 512 1747632 12735600 6.67 48.58
1024 x 1024 8039088 57234544 7.67 54.58

The same general approach can also be employed to compute DFTs of size
N X N, with N = pC, P an odd prime. If, for instance, N = pZ, zp' - 1 factors
into the three cyclotomic polynomials P1(z) = Z - 1, Pz(z) = Zp-I + zp-z +
... + 1" and P 3(z) = zp(p-Il +
Zp(p-Zl + ... +
1. In this case, a DFT of size
pZ X pZ is computed as shown in Fig. 7.3 with one polynomial transform of pZ
terms modulo P 3(z), one polynomial transform of p terms modulo P 3(z),pZ + p
reduced DFTs of length pZ, and one DFT of size p X p. This last DFT can in
turn be evaluated by polynomial transforms. In this approach, each of the re-
duced DFTs of size pZ is such that only the first p(p - 1) input samples are non-
zero and that the output samples with indices multiple of p are not computed.
These reduced DFTs are equivalent to one correlation ofp(p - 1) terms plus one
reduced DFT oflengthp.
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 191

REDUCTION MODULO

Pj(Z) = (zr-/)/(zI'-J)

POLYNOMIAL
TRANSFORM
OF'? TERMS
MODULO Pj (Z)

p2 REDUCED DFTs
OF LENGTH ,?
POLYNOMIAL
TRANSFORM
OF p TERMS
MODULO Pj (Z)

'+-
I
I
P REDUCED DFTs
OF LENGTH ,?

k2 :: 0 MODULO p

kJ f. 0 MODULO P

k2;j; 0 MODULO P

Fig. 7.3. Computation ofa DFT ofsizep 2 x p2 by polynomial transforms.p odd prime

We summarize in Table 7.2 the main properties of various two-dimensional


OFTs computed by the polynomial transform method. In this table, the oper-
ations count for execution of polynomial transforms and reductions is derived
from Table 6.2. We also list in Table 7.3 the number of arithmetic operations for
multidimensional OFTs computed by polynomial transforms. In this table, we
have presumed the use of the reduced OFT algorithms of Sect. 7.4 (Table 7.8)
for N = 8,9, 16 and of the short convolution algorithms of size N - I given in
Sect. 3.8.1 (Table 3.1) for N = 5. The reduced OFT algorithm of 7 points is
192 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

Table 7.2. Main parameters for DFTs of size N x N computed by polynomial transforms

N Polynomial transforms Number of additions DFTs and reduced DFTs


for polynomial
transforms and
reductions

N=p 1 polynomial transform of P P' + p2 - 5p +4 1 DFT of P terms


P prime terms modulo (zP - I)/(z - 1) P reduced DFTs of P
terms (p correlations of
P - I terms)
N=p2 1 polynomial transform of p2 2p' + p' - 5p' + p2 + p reduced DFTs of
p prime terms modulo (Zp 2 - 1)/(zP - I) p2 + 6 p2 terms
1 polynomial transform of p p reduced DFTs of p
terms modulo (Zp 2 - 1)/(zP - 1) terms (p correlations of
1 polynomial transform p - 1 terms)
of p terms modulo 1 DFT of p terms
(zP - I)/(z - I)
N=2' 1 polynomial transform of 2'
terms modulo (Z2'-t + I)
1 polynomial transform of 2'-1 (31 + 5)22 ('-1) 3.2'-1 reduced DFTs of
terms modulo (Z2'-t + I) dimension 2'
1 DFT of size
2'-1 X 2'-1
N = PIP2 1 polynomial transform of PuP
terms
PhP2 P2 polynomial transforms of PI plp~(pi + P2 + 2) PIP2 + PI + P2 reduced
primes terms -5PIP2(PI + P2) DFTs of
PIP2 terms
PI polynomial transforms of P2 +4(p? + pD PI reduced DFTs of PI
terms terms
P2 reduced DFTs of P2
terms
1 DFT of PIP2 terms

Table 7.3. Number of complex operations for simple two-dimensional DFTs evaluated by
polynomial transforms. Trivial multiplications are given between parentheses. Each complex
multiplication is implemented with two real multiplications

Number of Number of
DFT size
multiplications additions

2 x 2 4 (4) 8
3 x 3 9(1) 36
4 x 4 16 (16) 64
5 x 5 31 (1) 221
7x 7 65 (1) 635
8 x 8 64 (40) 408
9 x 9 105 (1) 785
16 x 16 304 (88) 2264
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 193

obtained as a 6-point convolution computed with 8 complex multiplications and


34 complex additions by nesting convolutions of 2 and 3 terms. In Table 7.3, the
OFTs corresponding to N = 2, 3, 4 are computed by simple nesting and the
OFT corresponding to N = 9 is computed partly by nesting and'partly by poly-
nomial transforms.

7.1.3 Multidimensional DFTs

A similar polynomial transform approach can also be developed to compute


OFTs of dimension greater than 2. Consider, for instance, a OFT Xk k:v k, of size
II

NxNxN

(7.46)

In polynomial notation, this OFT becomes


N-J N-J
Xk,.k,(Z) == ~ ~ Xn,.n,(z) Wn,k, Wn,k, modulo (ZN - 1) (7.47)
n.=O nJ-O

(7.48)

XkI' k2' k3 == Xk I'


k1(z) modulo (z - Wk.). (7.49)

When N is an odd prime, with N = p, this OFT reduces to a OFT of size p X P


for k3 = O. For k3 =1= 0, Xk,.k,.k. can be computed by a two-dimensional poly-
nomial transform with

(7.50)

P(Z) = (zp - 1)/(z - 1) (7.51)

X~,.n,(Z) == Xn,.n,(z) modulo P(z) (7.52)

(7.53)

Therefore, the OFT of size p X P X P is mapped by a two-dimensional poly-


nomial transform into a OFT of size p X P plus p2 reduced OFTs of dimension
p. For a OFT of dimension d, the same process is applied recursively and the
OFT is mapped into one OFT of length p plus p + p2 + ... + pd-l odd OFTs
of length p. Thus, if Ml is the number of complex multiplications for a DFT of
length p, the number of complex multiplications M corresponding to a DFT of
dimension d with length p in all dimensions becomes
194 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

M = I + (Ml - I)(pd - I)/(p - I). (7.54)

The same DFT is computed with dpd-1M I complex multiplications by the row-
column method. Therefore, the number of multiplications is approximately
reduced by a factor of d when the row-column method is replaced by the poly-
nomial transform approach. Thus, the efficiency of the polynomial transform
method, relative to the row-column algorithm, is proportional to d. A similar
result is also obtained when the polynomial transform method is compared to a
nesting technique, since the number of multiplications for nesting is Mt, with
Ml > p for p =1= 3. This point is illustrated more clearly by considering the case
of a DFT of size 7 X 7 X 7 which is computed with 457 complex multiplications
by polynomial transforms, as opposed to 1323 and 729 multiplications when
the calculations are done by the row-column method and the nesting algorithm,
respectively.
A similar polynomial transform approach applies to any d-dimensional DFT
with common factors in several dimensions and we give the number of complex
arithmetic operations in Table 7.4 for some complex three-dimensional DFTs
computed by polynomial transforms.

Table 7.4. Number of complex operations for simple three-dimensional DFTs evaluted by
polynomial transforms. Trivial multiplications are given between parentheses. Each complex
multiplication is implemented with two real multiplications

Number of Number of
DFT size
multiplications additions

2 x 2 x 2 8 (8) 24
3 x 3 x 3 27 (1) 162
4 x 4 x 4 64 (64) 384
5 x 5 x 5 156 (I) 1686
7 x 7 x 7 457 (I) 6767
8 x 8 x 8 512 (288) 4832
9 x 9 x 9 963 (I) 10383
16 x 16 x 16 4992 (I 184) 52960

7.1.4 Nesting and Prime Factor Algorithms

We have seen in Chap. 5 that large DFTs of size N X N can be computed by


nesting small DFTs of size N/ X N/ or by using a prime factor algorithm, when
the various N/ are factors of N which are mutually prime [7.6, 7]. These methods
can be used in combination with polynomial transforms as an alternative to
using large polynomial transforms.
If we consider the simple case corresponding to N = N 1N 2, with (N!> N 2) = I,
the DFT of size N1N2 X N1N2 can be transformed into a four-dimensional
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 195

DFT of size (N, X N,) X (N2 X N 2) by using Good's mapping algorithm [7.6].
With this approach, the four-dimensional DFT is, in turn, computed using
Winograd nesting [7.7] by calculating, by polynomial transforms, a DFT of size
N, X N, in which each scalar is replaced by an array of N2 X N2 terms and each
multiplication is replaced by a DFT of size N2 X N2 computed by polynomial
transforms. Thus, if M" M 2 , M and A" A 2 , A are, respectively, the number of
complex multiplications and additions required to evaluate the DFTs of sizes
N, X N" N2 X N 2, and N,N2 X N,N z, we have

(7.55)

(7.56)

The four-dimensional DFT of size (N, X N,) X (N2 X N 2) can also be com-
puted by the row-column method as Nt DFTs of dimension N2 X N2 plus Ni
DFTs of dimension N, X N,. In this case, we have

(7.57)

(7.58)

Since M, ~ Nt and M2 ~ Ni, the nesting method generally requires more ad-
dition than the row-column method, except when M, = Nt. However, for

Table 7.5. Number of real operations for complex multidimensional DFTs evaluated by poly-
nomial transforms and nesting. Trivial multiplications by ± 1, ±j are not counted

Number of Number of Multiplications Additions


DFT size
multiplications additions per point per point

24 x 24 1072 11952 1.86 20.75


30 x 30 2224 26712 2.47 29.68
36 x 36 3328 35488 2.57 27.38
40 x 40 3888 48688 2.43 30.43
48 x 48 5296 59184 2.30 25.69
56 x 56 8240 121264 2.63 38.67
63 x 63 13648 204920 3.44 51.63
72x72 13360 166576 2.58 32.13
80 x 80 18672 247568 2.92 38.68
112 x 112 39344 607952 3.14 48.47
120 x 120 35632 553392 2.47 38.43
144 x 144 63664 844048 3.07 40.70
240 x 240 169456 2688912 2.94 46.68
504 x 504 873520 16353584 3.44 64.38
1008 x 1008 4149424 80267312 4.08 79.00
120 x 120 x 120 4312512 99966528 2.50 57.85
240 x 240 x 240 42050240 977859648 3.04 70.74
196 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

short DFTs, Ml and M2 are not much larger than Nl and N~, and the nesting
method requires fewer multiplications than the row-column algorithm, and a
number of additions which is about the same. Thus, the nesting algorithm is
generally better suited for DFTs of moderate sizes whereas the prime factor
technique is best for large DFTs.
With both methods, large DFTs can be evaluated using a small set of short
length DFTs computed by polynomial transforms. Moreover, additional com-
putational savings can be obtained by splitting the calculations with the tech-
niques discussed in Sects. 5.3.3 and 5.4.3.
Table 7.5 gives the number of real operations for complex multidimensional
DFTs computed by nesting the small multidimensional DFTs evaluated by
polynomial transforms for which data are tabulated in Tables 7.3 and 7.4. It can
be seen by comparison with Table 7.1 that this method requires only about half
the number of multiplications of the large polynomial transform approach with
size N = 2', but uses more additions. We shall see, however, that when this
method is combined with split nesting and another polynomial transform
method, significant additional reduction in the number of operations is made
possible.

7.1.S OFf Computation Using Polynomial Transforms Defined in Modified Rings


of Polynomials

We have seen that multidimensional DFTs with common factors in several


dimensions can be efficiently converted by polynomial transforms into one-
dimensional DFTs and reduced DFTs. This method is particularly worthwhile
for DFTs with dimensions which are powers of two, because the polynomial
transforms and the DFTs can then be calculated with a minimum number of
operations by a radix-2 FFT-type algorithm. The main disadvantage of this
method, however, is that it is implemented with a number of different polynomi-
al transforms.
We shall now show that the implementation can be greatly simplified, at the
expense of a slightly larger number of arithmetic operations, by modifying the
definition of the rings with a premultiplication of the input data sequence by
powers of a root of -1. In order to introduce this method [7.8J, we consider a
D FT Xk,.k, of dimension N X N, with N = 2',

W= e-J1<IN kh k2 = 0, ... , N - 1, (7.59)

where the symbol W represents e- i 1<IN instead of e-J21<IN for reasons that will be
apparent later.
We first rewrite (7.59) as
7.1 Computation of Multidimensional DFTs by Polynomial Transforms 197

which is equivalent to premultiplying the input data samples by W-n, and


computing a modified two-dimensional DFT which is a regular DFT along
dimension k j and an odd DFT along dimension k 2 • In order to simplify the com-
putation of (7.60), we replace (7.60) by the following equivalent polynomial
representation:

(7.61)

(7.62)

(7.63)

It can be verified easily that (7.61-63) are a valid representation of (7.60) by


substituting Xn,(z), defined by (7.61) into (7.62) and by replacing z by W2k,+I. We
note that the definition of (7.62) modulo (ZN + 1) is not necessary at this stage.
However, this definition is valid because ZN == WN(2k,+1) = -1 and because all
the roots of ZN + 1 are given by z = W2k,+1 for k2 = 0, ... , N - 1. Since 2k2
+ 1 is odd and since N = 2t , the permutation (2k2 + l)k l modulo N maps
all values of k, for k, = 0, ... , N - 1. With this permutation, we obtain

" X n, (z)W 2(2k,+l)n,k, modulo (ZN + 1)


N-I
X (2k,+I)k, (z) == "-' (7.64)
"1=0

(7.65)

Equation (7.65) amounts to a simple substitution of W2k,H for z. Therefore, we


can replace W2k,+1 by z in (7.64). This gives
N-I
X(2k
:Ii
+I)k (z)
I
L: Xn (z)z2n,k, modulo (ZN + 1),
== "1=0 J
(7.66)

which is recognized as a polynomial transform of length N defined modulo


(ZN+ 1). This transform is computed without multiplications and with 2N210g2
N real additions using a radix-2 FFT-type algorithm.
When employing this method, the only multiplications required for the DFT
of size N X N are the N2 premultiplications by W-n, and the multiplications
required for the evaluation of (7.65). The number of operations corresponding to
(7.65) can be quantified by noting that X(2k,+l)k.(Z) represents N polynomials of
N terms which can be defined by

(7.67)
198 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

Consequently, (7.65) becomes


N-I
X(lk.+I)k,.k. =
-
"
~
yk"l W'WZ/k. , (7.68)
1-0

which represents N odd DFTs oflength N.


Following this procedure, a DFT of size N X N, with N = 2', is computed
as shown in Fig. 7.4, with NZ premultiplications by w-n" one polynomial trans-
form of length N, one permutation, and N reduced DFTs of N terms. The
reduced DFTs can be calculated using any convenient FFT algorithm. If we as-
sume that these DFTs are evaluated by a simple radix-2 FFT algorithm in which
the trivial multiplications by ± I and ±j are counted as general multiplications,
a length-N DFT is evaluated with 2Nlog zN real mUltiplications and 3Nlog2N
real additions. In this case, the DFT of size N X N is computed with MI real
multiplications and A I real additions, where

ORDERING OF
POLYNOMIALS

POLYNOMIAL
TRANSFORM
MODULO (ZN + J)
SIZE N . ROOT z2

N REDUCED DFTs
OF N TERMS

Fig. 7.4. Computation of a DFT of size N x N by polynomial


transforms defined in modified rings of polynomials. N = 2'
7.1 Computation of Multidimensional DFfs by Polynomial Transforms 199

MI = 2N2{4 + log2N) (7.69)

AI = N 2(4 + 5Iog N). 2 (7.70)

If the DFTofsize N X Nisevaluated by the row-column method, the number of


multiplications Mz and additions A2 become

(7.71)

(7.72)

which demonstrates that the polynomial transform approach is better than the
row-column method for N > 16 and reduces the number of multiplications by
half for large transforms. It should also be noted that the polynomial transform
approach reduces the number of additions by about 15 %for large transforms.
Therefore, the foregoing polynomial transform method reduces significantly
the number of arithmetic operations while retaining the structural simplicity of
the row-column radix-2 FFT algorithm. In practice, the reduced DFTs will
usually be calculated via the Rader-Brenner algorithm [7.5] because all complex
multiplications are then implemented with only two real multiplications.
The number of arithmetic operations can be further reduced by modifying the
ring structure for only part of the procedure. This may be realized by computing
one or several stages with the method described in Sect. 7.1.2 and by completing
the calculations with the modified ring technique. In the case of a one-stage
process, the DFT Xk,.k , of size N X N is redefined by

W = e-j21tIN kh k2 = 0, ... , N - 1, (7.73)

where W takes its usual representation. In polynomial notation, Xk,.k , becomes


N-I
Xk (z) ==
1
2.:
11 1 =0
Xn (z) Wn,k, modulo
t
(ZN - 1) (7.74)

(7.75)

(7.76)

For k2 odd, Xk"k , is calculated as in Sect. 7.1.2 by a polynomial transform of N


terms defined modulo (ZNI2 + 1)

+ 1)
N-I
Xl ,k,(z) == 2.: X!,(z)zn,k, modulo (ZNI2 (7.77)
"1- 0
X!,(z) == Xn,{z) modulo (ZNIZ + 1) (7.78)
200 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

k z odd. (7.79)

For k z even, Xk"k, reduces to a OFT of size N X (Nj2) which is computed using
a ring translation technique

(7.80)

N-l
X(k +l)k (z)
2 1
== ~
11 1 =0
X; (z)zn,k,
1
modulo (ZNIZ + 1) (7.81)

X(k 1+l)k 1':1


k == X(k +l)k (z) modulo (z - Wk,+l),
2 I
kz even, (7.82)

which indicates that the OFT Xk"k, of size N X N is computed as shown in Fig.
7.5 with only NZj2 premultiplications by w-n" plus two polynomial transforms

POLYNOMIAL
TRANSFORM
MODULO (zN/2+ 1)
SIZE N , ROOT Z POLYNOMIAL
TRANSFORM
MODULO (ZNI2+ 1)
SIZE N , ROOT Z

Fig. 7.5. Computation of a DFT of size


N x N by polynomial transforms defined
modulo (NNI 2 + 1). N = 2'
7.2 DFTs Evaluated by Multidimensional Correlations 201

defined modulo (zN12 + 1) and 2N reduced OFTs of size N12. When the reduced
OFTs are computed by a simple radix-2 FFT algorithm, the number of real
mUltiplications M3 and real additions A3 become
M3 = 2N2(2 + log2N) (7.83)

A3 = N2(2 + 5 log2N). (7.84)

With this scheme, an additional reduction in number of arithmetic operations is


obtained at the expense of using two polynomial transforms instead of one. The
same method can be used recursively by reducing the OFT of size N x (N12)
into a OFT of size (N12) x (N12) plus NI2 reduced OFTs of N terms and, with
additional stages, into OFTs of sizes (N14) x (NI4), (N/8) x (N/8), ... , each
additional stage reducing the number of arithmetic operations at the expense of
an additional number of different polynomial transforms. When the decom-
position is complete, the method becomes identical to that described in Sect.
7.1.2. Thus, there is considerable flexibility in trading structural complexity for
computational complexity by selecting the number of stages.

7.2 DFfs Evaluated by Multidimensional Correlations and


Polynomial Transforms

We have seen in the preceding sections that multidimensional OFTs can be ef-
ficiently partitioned by polynomial transforms into one-dimensional OFTs and
reduced OFTs. This method is mainly applicable to OFTs having common
factors in two or more dimensions and therefore does not apply readily to one-
dimensional OFTs. In this section, we shall present a second way of computing
OFTs by polynomial transforms [7.1, 2, 9]. This method is based on the decom-
position of a composite OFT into multidimensional correlations via the Wino-
grad [7.7] algorithm and on the computation of these multidimensional cor-
relations by polynomial transforms when they have common factors in several
dimensions. This method is applicable in general to multidimensional OFTs and
also to some one-dimensional OFTs.

7.2.1 Derivation of the Algorithm

We consider a two-dimensional OFT Xk I' k Z of size Nl x N 2 • This OFT may


either be a genuine two-dimensional OFT or a one-dimensional OFT of length
N 1N 2, with Nl and N2 mutually prime, which has been mapped into a two-
dimensional OFT structure by using Good's algorithm [7.6] (Sect. 5.3). Xk, •kz is
defined by
202 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

k\ = 0, ... , Nl - 1 k2 = 0, ... , N2 - l. (7.85)

In order to simplify the presentation, we shall assume that N \ and N2 are prime.
For k2 = 0, Xk"k, becomes a OFT oflength Nl

kl = 0, ... , N\ - l. (7.86)

For k2 =1= 0, we consider first the case corresponding to kl = 0. Then, Xk"k,


becomes

k2 = I, ... , N2 - 1 (7.87)

and, since \ + W + W~ + ... + W~-l =


2 0,

(7.88)

Since N2 is a prime, and n2, k2 =1= 0, we can map XO,k, into a correlation of
length N2 - \ by using Rader's algorithm [7.3] with

n2 == gU, modulo N2
k2 == gV, modulo N2 u2, V2 = 0, ... , N2 - 2 (7.89)

(7.90)

where g is a primitive root of order N2 - I modulo N 2. When klo k2 =1= 0, Xk"k,


becomes a two-dimensional correlation of size (Nl - I) X (N2 - 1)

(7.9\)

Using the Winograd method [7.7], the OFT of size Nl x N2 is calculated by


nesting the OFTs of lengths N\ and N 2 , which is equivalent to computing the
two-dimensional correlation (7,91) via the Agarwal-Cooley nesting algorithm
[7.10] (Sect. 3.3.1). We have seen, however, in Chap. 6 that when a two-dimen-
sional convolution or correlation has common factors in several dimensions,
the number of arithmetic operations can be significantly reduced by replacing the
conventional nesting structure with a polynomial transform method. Thus, one
can expect to reduce the computational complexity of a OFT of size Nl X N2
if the derived two-dimensional correlation is evaluated by polynomial trans-
7.2 DFTs Evaluated by Multidimensional Correlations 203

7 POLYNOMIALS OF 7 TERMS

OF 6 TERMS

OF 7 TERMS

REDUCTION MODULO REDUCTION


(z6 _/) /( z2 _ /) MODULO (Z2_l)

6 POL YNOMIAL
MULTIPLICATIONS

MODULO (Z6_l)/(Z2_1! CORRELA TION


OF 6x6 POINTS

INVERSE POL YNOMIAL


TRANSFORM OF
6 TERMS

Fig. 7.6. Computation of a DFT of size 7 x 7 by the Winograd algorithm and polynomial
transforms

forms. The same technique can also be applied recursively to accommodate the
case of more than two factors or factors that are composite (Sect. 3.3.1).
When NI = N 2 , all factors in both dimensions are common and a polynomial
204 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

transform mapping of the two-dimensional correlation is always realizable. We


illustrate this method in Fig. 7.6 for a OFT of size 7 X 7. Since the OFT of 7
terms is reduced by Rader's algorithm to one multiplication and one correlation
of 6 terms, the OFT of size 7 X 7 can be mapped into one OFT of 7 terms, one
correlation of 6 terms, and one correlation of size 6 X 6. If the correlations of 6
terms are then calculated by an algorithm requiring 8 complex mUltiplications,
the complete OFT of size 7 X 7 is evaluated with 81 complex multiplications
using the Winograd nesting algorithm. In this approach, the correlation of size
6 X 6 is computed via nesting, with 64 multiplications. However, if the (6 X 6)-
point correlation is evaluated by polynomial transforms, only 52 complex
multiplications are required and the OFT of size 7 X 7 is evaluated with only 69
multiplications instead of 81.
Thus, the polynomial transform mapping of multidimensional correlations
provides an alternate solution to the polynomial transform OFT mapping
method discussed in Sect. 7.1. It should be noted, however, that the latter method
is always more efficient whenever it is applicable. This result is due to the fact
that the polynomial transform mapping of the correlations is based on smaller
extension fields. This point can be illustrated by noting that a OFT of size 7 X 7
can be computed with only 65 multiplications by the method of Sect. 7.1, as
opposed to 69 multiplications with the method discussed here. Therefore, the
utility of the method based on the polynomial transform mapping of multidi-
mensional correlations is limited to the evaluation of multidimensional OFTs
having no common factors in the different dimensions and for which the method
of Sect. 7.1 is not applicable.
If we consider, for instance, a two-dimensional OFT of size 7 X 9 (or a
one-dimensional OFT of dimension 63), this OFT cannot be computed by the
method of Sect. 7.1 because 7 and 9 have no common factors. Employing the

Table 7.6. Number of real operations for complex DFTs computed by multidimensional
correlations and polynomial transforms. Trivial multiplications by ± 1, ± j are not counted

Number of Number of Multiplications Additions


DFT size
multiplications additions per point per point

63 172 1424 2.73 22.60


80 188 1340 2.35 16.75
504 1380 14668 2.74 29.10
1008 3116 34956 3.09 34.68
5 x 5 64 452 2.56 18.08
7 x 7 136 1300 2.78 26.53
9 x 9 216 1816 2.67 22.42
16 x 16 496 4752 1.98 18.56
63 x 63 11680 208904 2.94 52.63
5 x 5 x 5 346 3490 2.77 27.92
7 x 7 x 7 1000 14048 3.04 40.96
7.2 DFfs Evaluated by Multidimensional Correlations 205

Winograd algorithm, this DFT is evaluated by nesting a DFT of 9 terms with a


DFT of 7 terms. Using Rader's algorithm, the DFT of 7 terms is converted into
a process with one multiplication and one correlation of 6 terms, while the DFT
of 9 terms is reduced to 5 multiplications and one correlation of 6 terms. Thus,
the Winograd algorithm computes the DFT of size 7 X 9 as 5 DFTs of 7 terms,
one correlation of 6 terms, and one correlation of size 6 X 6, with a total of 198
real multiplications. Alternatively, if the correlation of size 6 X 6 is computed
by polynomial transforms, the total number of multiplications is reduced to 174.
In Table 7.6, we tabulate the number of real operations for complex DFTs
computed by the polynomial transform mapping of multidimensional correla-
tions. It can be seen by comparison with Table 5.3 that this method requires
significantly fewer arithmetic operations than the conventional Winograd algo-
rithm. In the case of a DFT of 1008 points, for example, the numbers of opera-
tions are reduced to 3116 real multiplications and 34956 real additions, as op-
posed to 3548 mUltiplications and 34668 additions for the Winograd algorithm.

7.2.2 Combination of the Two Polynomial Transform Methods

For large multidimensional DFTs, the two polynomial transform methods can
be combined by converting the multidimensional DFT into a set of one-dimen-
sional DFTs by use of a polynomial transform mapping and, then, by computing
these one-dimensional DFTs via a multidimensional correlation polynomial
transform mapping. With this technique, a DFT of size 63 X 63, for instance, is
calculated by nesting DFTs of size 7 X 7 and 9 X 9 evaluated by the first poly-
nomial transform method. Hence, the DFT of size 7 X 7 is partitioned into 1
multiplication plus 8 correlations of 6 terms, and the DFT of size 9 X 9 is
mapped into 33 multiplications plus 12 correlations of 6 terms. Thus, the DFT
of size 63 X 63 is computed with 33 multiplications, 276 correlations of 6 terms,
and 96 correlations of size 6 X 6. When the (6 X 6)-point correlations are
computed by polynomial transforms, the DFT of size 63 X 63 is calculated
with only 11344 real multiplications as opposed to 13648 multiplications when
the first polynomial transform method is used alone and 19600 multiplications
for the conventional Winograd nesting algorithm. It should be noted that com-
bining the two polynomial transform methods also reduces the number of
additions.
Table 7.7 lists the number of real operations for complex DFTs computed
by combining the two polynomial transform methods with the split nesting
technique. It can be seen by comparison with Table 7.1 that the combined poly-
nomial transform method requires about half the number of mUltiplications of
the first polynomial transform method for large transforms. In practice, the
number of multiplications required by this method is always very small, as
exemplified by a DFT of size 1008 X 1008 which is calculated with only 3.39
real multiplications per point or about one complex multiplication per point.
It should be noted however that this low computation requirement is ob-
206 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

Table 7.7. Number of real operations for complex DFTs computed by combining the two
polynomial transform methods. Trivial multiplications by ± 1, ±j are not counted

Number of Number of Multiplications Additions


DFT size
mUltiplications additions per point per point

80 188 1340 2.35 16.75


240 596 4980 2.48 20.75
504 1380 14668 2.74 29.10
840 2580 24804 3.07 29.53
1008 3116 32244 3.09 31.99
2520 8340 95532 3.31 37.90
5040 17732 208108 3.52 41.29
63 x 63 11344 193480 2.86 48.75
80 x 80 16944 231344 2.65 36.15
120 x 120 35632 553392 2.47 38.43
240 x 240 153904 2542896 2.67 44.15
504 x 504 726064 15621424 2.86 61.50
1008 x 1008 3449024 71455456 3.39 70.33
80 x 80 x 80 1451616 28134656 2.84 54.96
120 x 120 x 120 4312512 103038528 2.50 57.85
240 x 240 x 240 39221088 925433712 2.84 66.94

tained at the expense of a relatively complex structure and, thus, it is expected


that most practical applications will use the first polynomial transform method
with polynomial transforms of sizes which are powers of two. In this case, the
number of operations is still substantially lower than with the conventional
methods, but the structure of the algorithm remains simple and comparable in
complexity to a conventional FFT algorithm implemented with the row-column
method.

7.3 Comparison with the Conventional FFT

The calculation of a OFT by polynomial transforms is based upon the use of


roots of unity in fields of polynomials. Conversely, the polynomial transforms
can also be viewed as OFTs defined in fields of polynomials. If we consider the
simple scheme of Sect. 7.1.5, a OFT of size N X N, with N = 2', is evaluated
with 2N2 multiplications by powers of W, plus one polynomial transform of N
terms defined modulo (ZN + 1) and N OFTs of N terms. This method makes use
of multiplications by W-n, and WI to translate rings of polynomials modulo
(ZN - 1) into fields of polynomials modulo (ZN + 1) and is equivalent to a row-
column FFT method in which the N first OFTs of N terms are replaced by a
polynomial transform. Therefore, this approach eliminates the mUltiplications
in the Nfirst OFTs of the row-column method, thereby saving (N2j2) log2 N com-
7.4 Odd DFT Algorithms 207

plex multiplications, or 2N2 log2N real multiplications and N 2 log2 N real ad-
ditions, while retaining the simple structure of the FFT implementation. The
method given in Sect. 7.1.2 is essentially a generalization of this technique, which
is derived by using a complete decomposition to eliminate the multiplications by
W-n,.
When a large multidimensional DFT is evaluated by combining polynomial
transforms and nesting, as in Sects. 7.1.4 and 7.2.1, this method can be con-
sidered as a generalization of the Winograd algorithm in which small multidi-
mensional DFTs and correlations having common factors in several dimensions
are systematically partitioned into one-dimensional DFTs and correlations by
polynomial transform mappings.
In practice, significant computational savings are obtained by computing
DFTs by polynomial transforms. This can be seen by comparing the data given
in Tables 7.1 and 7.7 with those in Table 4.5 which corresponds to two-dimen-
sional DFTs calculated by the Rader-Brenner FFT algorithm and the row-
column method. It can be seen that the number of multiplications is reduced by
a factor of about 2 for large DFTs computed by the first polynomial transform
method used alone and by a factor of about 4 when the two polynomial trans-
form methods are combined. In both cases the number of additions is compara-
ble to and sometimes smaller than the number corresponding to the FFT ap-
proach.
A comparison with the Winograd-Fourier transform algorithm also demon-
strates a significant advantage in favor of polynomial transform methods. For
example, a DFT of size 1008 X 1008 is computed by the WFTA altorithm with
6.25 real multiplications and 91.61 additions per point. This contrasts with the
first polynomial transform technique which requires 7.67 multiplications and
54.58 additions per point for a DFT of size 1024 X 1024 and the combination of
the two polynomial transform methods which requires 3.39 multiplications and
70.33 additions per point for a DFT of size 1008 X 1008.

7.4 Odd DFT Algorithms

When a DFT is evaluated by polynomial transforms, it is partitioned into one-


dimensional DFTs, reduced DFTs, correlations, and polynomial products. The
correlations and polynomial products are formed by the application of the
Rader algorithm (Sect. 5.2) and are therefore implemented with two real multi-
plications per complex multiplication. These correlations and polynomial prod-
ucts can then be computed by the algorithms of Sects. 3.7.1 and 3.7.2 by replac-
ing the real data with complex data and by inverting one of the input sequences.
The corresponding number of complex operations for this process are given by
Tables 3.1 and 3.2.
The small DFT algorithms are given in Sect. 5.5 for N = 2, 3,4, 5, 7, 8, 9,
208 7. Computation of Discrete Fourier Transforms by Polynomial Transforms

16. For N = 2', with N > 16, the one-dimensional DFTs can be computed by
the Rader-Brenner algorithm (Sect. 4.3) and the corresponding number of
operations is given in Table 4.3.
For the reduced DFT algorithms, we have already seen in Sect. 7.1.1 that,
when N is a prime, the reduced DFTs become correlations of N - 1 terms. Thus,
these reduced DFTs may be computed by the algorithms of Sect. 3.7.1 with the
corresponding number of complex operations given in Table 3.1. Large odd
DFTs corresponding to N = 2' can be computed by the Rader-Brenner algo-
rithm as shown in Sect. 4.3 with an operation count given in Table 4.4.
We define in Sects. 7.4.1-4 reduced DFT algorithms for N = 4, 8, 9, 16.
These algorithms are derived from the short DFT algorithms of Sect. 5.5 and
compute q'-I(q - 1) output samples of a DFT of length N = q'. The reduced
DFT is defined by

Table 7.8. Number of real operations for complex DFTs and reduced DFTs. Trivial multipli-
cations by ± 1, ±j are given between parentheses

Size N Number of Number of


multiplications additions

2 4 (4) 4
3 6 (2) 12
4 8 (8) 16
5 12 (2) 34
7 18 (2) 72
8 16 (12) 52 DFTs
9 22 (2) 88
16 36 (16) 148
32 104 (36) 424
64 272 (76) 1104
128 672 (156) 2720
256 1600 (316) 6464
512 3712 (636) 14976
1024 8448 (1276) 34048
3 4 (0) 8
4 4 (4) 4
5 10 (0) 30
7 16 (0) 68
8 8 (4) 20 Reduced
9 16 (0) 56 DFTs
16 20 (4) 64
32 68 (20) 212
64 168 (40) 552
128 400 (80) 1360
256 928 (160) 3232
512 2112 (320) 7488
1024 4736 (640) 17024
7.4 Odd DFT Algorithms 209

_ q'-'(q-I)-I
Xk = '"
~ X11 W·k , l~k~N-I k =1= 0 modulo q
11=0

j = ,J-=T, (7.92)

where the input sequence is labelled X., the output sequence is labelled Xb and
the last q'-I input samples are zero. Input and output additions must be executed
in the specified index numerical order. Table 7.8 summarizes the number of real
operations for various complex DFTs and reduced DFTs used as building blocks
in the polynomial transform algorithms. Trivial mUltiplications by ± I, ±j are
given in parentheses.

7.4.1 Reduced DFT Algorithm. N = 4

2 complex mUltiplications (2), 2 complex additions


mo = l·xo
ml = -J·XI

XI = mo + ml
X3 = mo - mi.

7.4.2 Reduced DFT Algorithm. N = 8

4 complex multiplications (2), 10 complex additions. u = 1[/4


mo = l·xo ml = (XI - X3) cos u
mz = -j·xz m3 = -j(xi + X3) sin u
SI = mo + ml S2 = mo - m l
S3 = mz + m3 S4 = m2 - m3
XI = SI + S3 X3 = S2 - S4 XS = S2 + S4

7.4.3 Reduced DFf Algorithm. N = 9

8 complex multiplications (0). 28 complex additions. u = 21[/9


tl = X4+ Xs
mo = (xo + Xo - x3)/2

ml -
_ (2 cos u - cos 2u - cos
3
4U) (XI - X2
)

mz -
_ (COS U+ cos 2u
3
- 2 cos 4U) (X 2 - tl
)

m3 -
_ (COS U- 2 cos 2u
3
+ cos 4U) ('1 - XI
)
m 4 = -jX3 sin 3u
210 70 Computation of Discrete Fourier Transforms by Polynomial Transforms

mS = -j(xi + X2) sin u


m6 = -j(X2 + t sin 4u 2)

m7 = j(xi - t z) sin 2u

S2 = ml+ m + mo 2 S3 = -mz + m3 + mo
S4 = -m l - m3 + mo Ss = m + ms + m6
4

S6 = -m6 + m7 + m 4 S7 = -ms - m7 + m 4

XI = S2 + Ss X2 = S3 - S6 X4 = S4 + S7
Xs = S4 - S7 X7 = S3 + S6 X8 = S2 - SSo

7.4.4 Reduced DFT Algorithm. N = 16

10 complex multiplications (2), 32 complex additions u = x/8

mo = l·xo ml = (X2 - X6)COS 2u m 2 = (t2 +t 4) cos 3u


m3 = (cos u + cos 3u) t2 m4 = (cos 3u - cos u) 14
m7 = -j(/1 + t sin 3u 3)

m8 = j(sin 3u - sin u) II my = -j(sin u + sin 3u) 13

SI = mo + ml S2 = mo - ml
S4 = m4 - m2 Ss = SI + S3 S6 = SI - S3

S7 = Sz + S4 S8 = S2 - S4 Sy = ms + m6
SIO = ms - m6 Sll = m7 + m8 SI2 = m7 - my
Sl3 = S9 + SII SI4 = S9 - SII SIS = SIO + SI2
SI6 = SIO - SI2

XI = Ss + SI3 X3 = S8 - SI6 Xs = S7 + SIS


X7 = S6 - SI4 X y = S6 + SI4 XII = S7 - SIS

X I3 = S8 + SI6 XIS = Ss - SI3 0


8. Number Theoretic Transforms

Most of the fast convolution techniques discussed so far are essentially algebraic
methods which can be implemented with any type of arithmetic. In this chapter,
we shall show that the computation of convolutions can be greatly simplified
when special arithmetic is used. In this case, it is possible to define number theo-
retic transforms (NTI) which have a structure similar to the DFT, but with
complex exponential roots of unity replaced by integer roots and all operations
defined modulo an integer. These transforms have the circular convolution prop-
erty and can, in some instances, be computed using only additions and multipli-
cations by a power of two. Hence, significant computational savings can be
realized if NTIs are executed in com puter structures which efficiently implement
modular arithmetic.
We begin by presenting a general definition of NTIs and by introducing the
two most important NTIs, the Mersenne transform and the Fermat number
transform (FNT). Then, we generalize our definition of the NTI to include
complex transforms and pseudo transforms. Finally, we conclude the chapter by
discussing several implementation issues and establishing a theoretical relation-
ship between NTIs and polynomial transforms.

8.1 Definition of the Number Theoretic Transforms

Let Xm and h n be two N-point integer sequences. Our objective is to compute the
circular convolution YI of dimension N

(8.1)

In most practical cases, hn and Xm are not sequences of integers, but it is always
possible, by proper scaling, to reduce these sequences to a set of integers. We
shall first assume that all arithmetic operations are performed modulo a prime
number q, in the field GF(q). If h n and Xm are so scaled that IYII never exceeds
q/2, YI has the same numerical value modulo q that would be obtained in normal
arithmetic.
Under these conditions, the calculation of YI can be simplified by introducing
a number theoretic transform [8.1-3] having the same structure as a DFT, but
with the complex exponentials replaced by an integer g and with all operations
performed modulo q. The direct NTI of h n is, thus,
212 8. Number Theoretic Transforms

N-I
Hk == L
n=O
h"gnk modulo q, (8.2)

with a similar relation for the NIT Xk of X m • Since q is a prime, N has an inverse
N- 1 modulo q, and we define an inverse transform as
N-I
al == N- 1 L iik g-Ik modulo q, (8.3)
k=O

where

N N- 1 == 1 modulo q. (8.4)

Note that, since q is a prime, g has also an inverse g-I modulo q. Thus, the nota-
tion g-Ik is valid.
We would now like to establish the conditions which must be met for the
transform (8.2) to support circular convolution. Computing the NTTs Hk and
Xk of h n and X m, mUltiplying iik by Xk, and evaluating the inverse transform al
of iikXk yields
N-I N-I N-I
al == N-l L
n=O
L m=O
hnxm L
k-O
g<n+m-Ilk modulo q. (8.5)

Let
N-I
S == L g<n+m-Ilk modulo q. (8.6)
k=O

If the NTTs support convolution, then (8.5) must reduce to (8.1) and we must
have S == N for t = n + m - I == 0 modulo Nand S == 0 for t =1= 0 modulo N.
The first condition means that the exponents of g must be defined modulo N,
and this implies that

gN == 1 modulo q. (8.7)

For t =1= 0 modulo N, (8.6) becomes

(gt _ l)S == gNt -1 == 0 modulo q. (8.8)

Thus, S == 0 provided gt - 1 =1= 0 modulo q for t =1= 0 modulo N. This implies


that g must be a root of unity of order N modulo q, that is to say, g must be an
integer such that N is the smallest nonzero integer for which gN == 1 modulo q.
Hence the following existence theorem:
Theorem 8.1: A NIT of length N and root g supports circular convolution
when defined modulo a prime q if and only if g is a root of unity of order N
modulo q.
8.1 Definition of the Number Theoretic Transforms 213

An immediate consequence of this theorem is that the inverse transform


defined by (8.3) is indeed the inverse transform. This follows from theorem 8.1
by choosing the sequence Xm such that Xo = 1, Xm = 0 for m =1= o.
Theorem 8.1 also allows one to specify the size of NITs defined modulo a
prime q. We know from Sect. 2.1.3 that there are always primitive roots of order
q - 1 modulo q and that all the roots must be of order N, with NI(q - 1).
Furthermore, the number of roots of order N is given by ~(N), Euler's totient
function. This implies the following theorem.
Theorem 8.2: A NIT oflength N and defined modulo a prime q exists if and only
if NI (q - 1). This NIT supports circular convolution.
Thus, for any prime q, we are able to find the sizes N for which there is an NIT.
The NIT is then completely defined, provided that we can find a root of order
N. This is done by using the methods given in Sect. 2.1.3.

8.1.1 General Properties of NTTs

Previously, we have restricted our discussion to NITs defined modulo a prime


q. In practice, this definition is unnecessarily restrictive, since NITs can also be
defined in a ring of numbers modulo an integer q, where q is a composite num-
ber. In order to specify the existence conditions for such NITs, we proceed once
again as above, by defining this new NIT class by (8.2) and the inverse NIT
by (8.3) and (8.4).
Note, however, that in order to define the inverse NTT, we need the inverses
N- 1 and g-I of Nand g. Since q is composite, such inverses exist if and only if
Nand g are mutually prime with q. Now if we try to establish the circular con-
volution property by (8.5) and (8.6), we must have, as above,

gN == 1 modulo q. (8.9)

This condition ensures the existence of an inverse g-I of g, since

g gN-1 == 1 modulo q. (8.10)

The last condition which must be satisfied to support the circular convolution
property corresponds to (8.8) and implies not only that g is a root of order N
modulo q, but also, since q is composite, that [(gt - 1), q] = 1. Hence, the fol-
lowing existence theorem may be defined.
Theorem 8.3: An NTT of length N and root g, defined modulo a composite in-
teger q, supports circular convolution if and only if the following conditions
are met:

gN== 1 modulo q
NN- 1 == 1 modulo q
214 8. Number Theoretic Transforms

[(g' - 1), q] = 1 for t = 1, ... , N - 1.

Note that the condition [(g' - 1), q)] = 1 is stronger than just stating that
g must be a root of order N modulo q. This can be seen, for instance, in the case
corresponding to q = 15. In this case, 2 is a root of order 4 modulo 15, since the
4 powers of two 2°, 2 1 , 2 2 , 2 3 are all distinct and 24 == 1 modulo 15. However
we have 22 - 1 = 3, and 3 is not relatively prime to 15. In practice, the condi
tion [(g' - 1), q] = 1 can be replaced by a more restrictive condition by noting
that it corresponds to the need to ensure that
N-I
S == ~ g'k == 0 modulo q, for t = 1, ... , N - 1. (8.11)
k=O

The following theorem, due to Erdelsky [8.4], specifies the conditions required
for the existence of NTTs which support circular convolution.
Theorem 8.4: An NTT of length N and root g, defined modulo a composite
integer q, supports circular convolution if and only if the following conditions
are met:

gN == 1 modulo q
NN- 1 == 1 modulo q
[(gd - 1), q] = 1 for every integer d such that N/d is a prime.
N-I
(Or equivalently ~ gdk == 0 modulo q for every d such that N/d is a prime).
k=O

Proof of this theorem can be found in [8.4].


Consider now the simplest composite numbers, which correspond to a power
of a prime

ql prime. (8.12)

In this case, the condition NN-I == 1 implies that N must be relatively prime
with ql' Moreover, g is of necessity relatively prime to qh because the condition
gN == 1 modulo q implies gN == 1 modulo ql' Therefore, for each g relatively
prime to q, we have, by Euler's theorem (theorem 2.3),

g~(q) == 1 modulo q. (8.13)

And, since ~(q) = qVI(ql - 1),

NI(ql - 1). (8.14)

We now can demonstrate the following theorem which establishes the existence
of an NTT defined modulo qp and of length ql - 1.
Theorem 8.5: Given an NTT which supports circular convolution when defined
8.1 Definition of the Number Theoretic Transforms 215

modulo qh ql prime, with the root gl and the length ql - 1, there is always an
NIT of length ql - 1 when defined modulo qi'. This NIT supports circular
convolution and its root is g = g11,-I.
In order to demonstrate this theorem, we note first that the existence of the
NIT defined modulo ql implies that (gh ql) = 1. Then, Euler's theorem implies
r -1 .
that gq,-I = g11' (q,-I) == I modulo qi'. Moreover, SInce ql - 1 has no com-
mon factors with qh (ql - 1) is mutually prime with ql and qi' and has therefore
an inverse modulo qi'. We also note that the existence of the NIT defined
modulo ql implies that [(g1 - 1), ql] = 1 for s = 1, ... , ql - 2. Thus, g1 - 1 is
not a multiple of q)' and since, by Fermat's theorem (theorem 2.4) g1' == gl
modulo ql> we have, by systematically replacing gl by g1', gf - 1 == g{q.',-l_ 1
modulo ql == g' - 1. This means that gS - 1 has no common factors with ql
for s = 1, ... , ql - 2. Hence the three conditions of theorem 8.3 are met and
this completes the proof of theorem 8.5.
We can now consider any composite integer q given by its unique prime
power factorization.

q = qi'q2' ... q[' ... q;'. (8.l5)

By the Chinese remainder theorem (theorem 2.1), the N-length circular convolu-
tion modulo q can be calculated by evaluating separately the N-length convolu-
tions modulo each q[' and performing a Chinese remainder reconstruction to
recover the convolution modulo q from the convolutions modulo q,/. Therefore,
an N-Iength NTT which supports circular convolution will exist if and only if
N-length NITs exist modulo each factor q,/. Theorem 8.5 shows that this is the
case if NJ (qi - 1). Thus, N must divide the greatest common divisor (GCD)
of the (qi - 1) and we have the following existence theorem.
Theorem 8.6: A length-N NTT defined modulo q, with q = ql' ... q? ... q;'
supports circular convolution if and only if

NJ GCD[(ql - 1), (q2 - 2), ... (qi - 1), ... , (qe - 1)]. (8.16)

This theorem immediately gives the maximum transform length, which is


GCD[(PI - 1), ... , (Pi - 1), ... , (P. - 1)]. It should also be noted that theorem
8.6 gives a simple way of computing the root 8 of order N modulo q. This is done
by first computing the roots gl,i of order N modulo qi by the methods given in
Sect. 2.1.3. The roots g2.i of order N modulo q,/ are then obtained by theorem
r,-l
8.5, with g2,i = gf,li and g is derived from the g2,i by a Chinese remainder
reconstruction.
The circular convolution property is by far the most important property of
the NITs, because it allows one to replace the direct computation of a convolu-
tion by that of three NITs plus N multiplications in the transform domain. In
the general case, NITs are computed with multiplications by powers of integers
216 8. Number Theoretic Transforms

g and are, therefore, not significantly simpler than DFTs, except that the com-
plex exponentials in the DFTs are replaced by real integers. Thus, real convolu-
tions are computed via NTIs with real arithmetic, instead of complex arithmetic
as with the DFT approach. We shall see, however, in the following sections that
when the modulus q is properly selected, the mUltiplications by powers of g are
replaced by simple shifts, thereby simplifying the NTI computations consider-
ably. Another advantage of computing convolutions by NTIs instead of FFTs is
that the convolutions are computed exactly, without round-off errors. Thus, the
NTT approach is well adapted to high accuracy computations.
Since the NTIs have the same structure as DFTs, they have the same general
properties as the DFTs and the reader can refer to Sect. 4.1 for a description of
them. We simply note here that the NTI definition implies that the linearity
property

{h n} + {xn} ~ {Rk } + {X k} (8.17)

{Ah n } ~ {ARk}. (8.18)

8.2 Mersenne Transforms

In order to simplify the computation of NTTs, we would like the modulus q to


be as simple as possible. The most obvious choice is q = 2'. However, in this
case, the maximum transform length is 1, which rules out q = 2'. Similarly, when
q is even, one of the factors of q is a power of2 and, by theorem 8.6 the maximum
transform length is also 1. Thus, the only cases of interest correspond to q odd.
Then, the simplest choice is q = 2p - 1 because arithmetic modulo (2 P - 1) is
the well-known one's complement arithmetic which can be implemented easily
with binary hardware. If p is composite, withp = PIP2 andpl a prime, then 2p , - 1
is a factor of 2p - I and the maximum transform length cannot be larger than
possible with 2P ' - l. Therefore, for q = 2p - 1, the most interesting cases cor-
respond to P prime. The integers 2p - 1 with P prime are the Mersenne numbers
discussed in Sect. 2.l.5 and the transforms defined modulo Mersenne numbers
are called Mersenne transforms [8.5].

8.2.1 Definition of Mersenne Transforms

Theorem 8.6 implies the existence of a length-N NTI which supports circular
convolution modulo a Mersenne number q = 2p - 1, p prime, provided that N
divides all the ql - 1, where the q[' are the factors of q. Some of the Mersenne
numbers are primes. For these numbers, the possible transform lengths are given
by NI(q - 1)

NI(2P - 2). (8.19)


8.2 Mersenne Transforms 217

We know, by Fermat's theorem (theorem 2.4) that p divides 2p - 2. Moreover,


2 is an obvious divisor of 2 p - 2. Hence we can define NTIs of lengths p and 2p
modulo prime Mersenne numbers. When q is composite, we know, by theorem
2.14 that every prime factor ql of a composite Mersenne number is of the form

(8.20)

Thus we have ql - 1 = 2cl P and 2p divides every ql - 1. This implies, by


theorem 8.6, that we can also define NTTs oflengths p and 2p modulo composite
Mersenne numbers.
In order to complete the definition of the Mersenne transforms, we must now
find the roots g of order p and 2p. For q prime, an obvious root of order p is 2,
since the p first powers of 2 corresponding to 1,2,22 , ••• , 2p - 1 are all distinct and
2p == 1 modulo q. For q composite, 2 is also a root of order p modulo q, but we
must also insure that the two last conditions of theorem 8.3 are satisfied, that is
p-l
to say, that p has an inverse p-l and that L; 2tk == 0 modulo q for t =1= 0 modulo p.
k-O
For the inverse p-l of p, we note that, since pi (2 p - 2),

p-l == 2 p - 1 - (2 P - 2)/p modulo (2 P - 1). (8.21)

For the last condition, we note that, since p is a prime, the set of exponents (·k
p-I
modulo pin S = L; 2tk is a simple permutation of k. Thus, for ( =1= 0 modulo p,
k=O
p-I p-I
L;2tk == L;2k == 2 p - 1 == 0 modulo (2 p - 1). (8.22)
k=O k-O

Hence we can define a p-point Mersenne transform having the circular convolu-
tion property by
- p-I
Xk == L; Xm 2mk modulo (2 p - 1), p prime k = 0, ... , p - 1 (8.23)
m-O

and the corresponding inverse Mersenne transform

(8.24)

with
2- mk == 2(p-llmk modulo (2 p - 1). (8.25)

Thus, a length-p circular convolution is computed as shown in Fig. 8.1 by three


Mersenne transforms plus p multiplications in the transform domain. When one
of the input sequences, hn' is fixed, its transform ilk can be precalculated and
combined with the multiplications by p-l corresponding to the inverse transform.
218 8, Number Theoretic Transforms

MERSENNE MERSENNE
TRANSFORM TRANSFORM

INVERSE
MERSENNE
TRANSFORM

Fig. 8.1. Computation of a length-p


N-I
y/ == L hn xl_n MODULO (]P-I) circular convolution modulo (2" - 1)
n =0 by Mersenne transforms

In this case, only two Mersenne transforms need to be evaluated. Moreover,


since the roots of Mersenne transforms are powers of two, each Mersenne
transform is calculated with only pep - 1) additions and (p - 1)2 shifts and the
only general multiplications required to compute the length-p circular con-
volution are the p multiplications in the transform domain. Hence the use of
Mersenne transforms can bring significant computational savings for the evalua-
tion of circular convolutions, even when compared to other efficient methods,
such as the FFT approach.
We have seen above that Mersenne transforms can also be defined with
length 2p. It can be seen that - 2 is a root of order 2p modulo a Mersenne
number q = 2p - 1, since the 2p first powers of - 2 defined by I, -2, 4, ,.,
(-2)2,,-1 are all distinct and (_2)2p == 1 modulo q. Thus, for q prime, we can de-
fine a Mersenne transform of length 2p with root -2. For q composite, since
2p I(2" - 2), 2p has an inverse (2p)-1 defined by

(2p)-1 == 2p - I - (2P - 2)/2p modulo (2P - 1). (8.26)


2p-1
For q composite, we must also show that S L (_2)tk == 0 modulo q for
=
k=O 2p-1 p-I
t = 1, .. " 2p - l. We note first that when t is even, S = L (_2)'k = 2 L
k=O k=O
(_2)tk == 0 modulo q for t$.O modulo p. For t odd, and t =1= p, tk modulo 2p
is a simple permutation and we have
2p-1 2p-1
L (_2)tk == L (-2)k = _(2 2p - 1)/3 == 0 modulo q. (8.27)
k=O k=O
8.2 Mersenne Transforms 219

For t = p, S = 1 - 1 +
1 - 1 ... = 0.
Hence, for any Mersenne number, we can define a Mersenne transform of
length 2p by
_ 2p-1
Xk == L: xm( _2)mk modulo (2P - 1) (8.28)
m=O

and an inverse transform


2p-1 _
Xm = (2p)-1 L: Xk( _2)-mk modulo (2P - 1) (8.29)
k=O

with
(_2)-mk == (_2)(2 p -llmk modulo (2P - 1). (8.30)

These double-length Mersenne transforms can also be computed without multi-


plications. When 2p - 1 is a prime, it is possible to define Mersenne transforms
of dimension larger than 2p, since the maximum transform length is 2P - 2.
However, the maximum number of distinct powers of -2 is exactly 2p so that
the roots g of these larger transforms can no longer be simple powers of two.
Thus, these larger transforms require some general mUltiplications and are
therefore less useful than the transforms of lengths p and 2p. In practice, only
the transforms of length p and 2P are called Mersenne transforms.

8.2.2 Arithmetic Modulo Mersenne Numbers

Any integer Xm defined modulo a Mersenne number q = 2P - 1 can be re-


presented as a p-bit word, with

Xm,l E (0, I) (8.3 I)

From the binary representation of Xm given by (8.31), -Xm can be obtained


easily by replacing each bit Xm,l of Xm with its complement Xm,l' Since Xm,l +
Xm,l = I, the integer xm obtained by this complementation is such that

(8.32)

Hence

Consider now two integers Xm and hn' with Xm defined by (8.31) and hn defined by

hn" E (0, 1). (8.33)


220 8. Number Theoretic Transforms

If we add the two numbers hn and Xm, we obtain a (p + I)-bit integer Cn


defined by

Cn,l E (0, 1) (8.34)

and, since 2p == 1 modulo (2p - 1),

(8.35)

Thus, addition modulo a Mersenne number is performed very simply by using a


conventional full binary adder of p bits and by folding the most significant carry
bit output back into the least significant carry bit input.
The multiplication modulo (2P - 1) of an integer Xm by a power of two, 2d ,
is also done very simply. Assuming that Cn is defined by

(8.36)

we have

+2
p-I-d p-I
C" = '"
.L....i X m.l 21+d p ' " X m,t
£-A 21+d-p , (8.37)
i-O I-p-d

and since 2P == 1 modulo (2P - 1),


n-I
Cn == ~
1-0
Xm,<I-d) 21, (8.38)

where the index <i -


d) is taken modulo p. This shows that a multiplication by
2d amounts to a simple d-bit rotation of a word of p bits.
General multiplications are implemented easily by combining the additions
and shifts discussed above. It should also be noted that the binary representation
discussed above assigns a double representation to the integer zero, since Xm == 0
if the bits Xm,l are either all zeros or all ones. Thus, when the final result of a
computation modulo (2P - 1) is converted into normal arithmetic, one must
detect the condition corresponding to all bits of Xm being equal to one and set the
final result to zero when this condition is realized.
Thus, arithmetic modulo a Mersenne number is not significantly more
complex than normal arithmetic when implemented with special purpose hard-
ware and the multiplications by powers of two are considerably more simple than
the general multiplications used in other transforms such as the DFT. This
provides motivation for replacing FFTs with NITs in the calculation of con-
volutions.
8.2 Mersenne Transforms 221

8.2.3 Illustrative Example

Given the two length-5 data sequences h" and X m , use the 5-point Mersenne
transform defined modulo (2' - 1) to compute the circular convolution Yl of h"
and X m • h" and Xm are defined by

ho = 1 (8.39)

Xo = 3 (8.40)

The 5-point Mersenne transform iik of h" is defined by

(8.41)

with a similar relation for the transform X k of X m• In matrix notation, (8.41)


becomes

iio 1
iii 2 4 8 16 3
iiz 4 16 2 8 2
ii3 8 2 16 4 2
ii4 16 8 4 2 0 (8.42)

Thus, the transform sequences iik and Xk are given by

iiz == 18 (8.43)

Xz == 22 (8.44)

Multiplying iik by Xk modulo 31 yields


(8.45)

We must now multiply iikXk by 5- 1 == 25 modulo 31. This gives

25iioXo == 19 25HI XI == 0 25iizXz == 11 25H3X3 == 0 25ii4X4 == 16.


(8.46)

Since 2- 1 == 24 modulo 31, the inverse Mersenne transform is given by

(8.47)

or, in matrix notation


222 8. Number Theoretic Transforms

Yo 1 19
y, 16 8 4 2 0
Y2 8 2 16 4 11
Y3 4 16 2 8 0
Y4 2 4 8 16 16 (8.48)

This gives the final result

Yo = 15 y, = 15 Yz = 12 Y3 = 13 Y4 = 9. (8.49)

The direct computation in ordinary arithmetic would produce the same result.
Note, however, that if we had chosen larger input samples, some of the output
samples of the convolution would have been greater than 30. In this case, the
Mersenne transform approach would produce erroneous samples because of the
reduction modulo 3 l.

8.3 Fermat Number Transforms

Mersenne transforms have a number of desirable attributes for the computation


of convolutions. In particular, these transforms utilize an easily implemented
arithmetic and can be computed without multiplications. The principal defici-
encies of Mersenne transforms, however, relate to the lack of a fast transform
algorithm and to the very rigid relationship between word length and transform
length. These limitations stem from the fact that Mersenne transforms support
convolution with the simple roots 2 or -2 only for transform lengths p and 2p,
with p prime. Thus, for a word length of p bits, one can, in practice, employ only
the two lengths p and 2p. Since p is a prime, the Mersenne transform of length
p cannot be computed with a fast FFT-like computation structure, and only a
two-stage fast algorithm can be used for length 2p. Thus, large Mersenne trans-
forms are computed with a much larger number of additions than FFTs of com-
parable size.
In order to overcome these problems, one is led to choose another modulus q.
The simplest choice, after q = 2' and q = 2p - l, is q = 2· + 1, which yields
a still relatively simple arithmetic. If v is odd, 3 is a factor of 2· + 1 and, by
theorem 8.6, the maximum transform length cannot be larger than 3 - 1 = 2.
If v is even, with v = s2' and s odd, 2 2' + 1 is a factor of 2s2' + 1, and the
transform defined modulo(2S2' + 1) cannot be larger than the transform defined
modulo(22' + 1). This indicates that the best choice corresponds to a modulus
F, = 22' + 1. The numbers F, are the Fermat numbers (Sect. 2.1.5) and the
NTIs defined modulo F, are called Fermat number transforms (FNT) [8.5-7].
8.3 Fermat Number Transforms 223

8.3.1 Definition of Fermat Number Transforms

We have seen in Sect. 2.15 that the first five Fermat numbers, Fo to F4 , are prime
while all other known Fermat numbers are composite. When Ft is a prime, the
maximum transform length is F t - 1 = 2 2', and therefore, all possible trans-
form lengths N correspond to N 122'.
When F t is composite, every prime factor ql of F t is of the form

(8.50)

Hence, theorem 8.6 implies that we can always define an N-Iength transform
modulo a composite Fermat number, provided that

(8.51)

We must now find the roots of these transforms. It is obvious that 2 is root of
order 2t+1 modulo F" since 2 2' == - 1 and since 21 takes the 2t+1 distinct values
1,2,22, ... , -1, -2, ... ,22'-1 - 1 for i = 0, 1,2, ... , 2t+1 - 1. This means that
when F t is a prime, we can define an FNT oflength N = 2t+1 with root 2. For F t
composite, 2 is also a root of order 2t+ 1, but we must also prove that 2t+1 has an
inverse and that 2 2' - 1 is mutually prime with 2 2' + 1 (theorem 8.4). Since
22' == - 1, the inverse of 2t+1 modulo F t is obviously _22'-t-l. Moreover, we
have 22' - 1 = (22' + 1) - 2. This means that any divisor of 22' - 1 and
22' + 1 should also divide 2. Thus, this divisor could only be 2, but this is
impossible since 22' - 1 and 22' + I are odd.
Under these conditions, we can define a length - 2t+1 FNT which supports
circular convolution by
2'+'-1
Xk == L:
m=O
Xm 2mk modulo (2 2' + 1) (8.52)

and the corresponding inverse FNT

2'+'-1
Xm == - 22'-t-1 L: Xk 2-mk modulo (22' + I) (8.53)
k-O

with

2- mk == _2(2'-I)mk modulo(22' + 1). (8.54)

Iffollows immediately that, since 2 is a root of order 2t+1 modulo F" 22' is a root
of order 2t+1-'. We can, therefore, always define FNTs of length 2t+1-, with root
22'.
We have shown above that it is possible to define FNTs oflength 2t+2 when
F t is composite. Since the maximum number of distinct powers of ±2 modulo
(22' + 1) is equal to 2t+1, the roots of the length - 2t+2 FNTs can no longer be
224 8. Number Theoretic Transforms

simple powers of two. We note that 2 is a root of order 21+1 and therefore that
-./2 is a root of order 2'+2. However, -./2 has a very simple expression in a
ring of Fermat numbers

-./2 == 2./ 4 (2./ 2 - 1) modulo (2· + 1), v = 2'. (8.55)

Thus, FNTs have lengths which are powers of two, and can be computed using
only additions and multiplications by powers of 2 for sizes up to N = 2'+2.
Larger FNTs can be defined modulo prime Fermat numbers, but in this case,
the roots are no longer simple and the computation of these transforms requires
general multiplications. Therefore, most practical applications are restricted to a
maximum length equal to 21+2.
FNTs are superior to Mersenne transforms in several respects. As a first point
of difference, it can be noted that FNTs permit much more flexibility in selecting
the transform length as a function of the word length than Mersenne transforms.
A second advantage of using FNTs relates to the highly composite length of
such transforms. This makes it possible to evaluate an FNT with a reduced
number of additions by use of a radix-2 FFT-type algorithm. To illustrate this
point with a decimation in time algorithm, the length - 2'+1 FNT defined by
(8.52) can be calculated, in the first stage by

_ N/2-1 N/2-1
X k == L;
m=O
X2m 22mk + 2k L;
m=Q
X 2m + 1 22mk modulo F, (8.56)

_ N/2-1 N!2-1
Xk+N/2 == L; X2m 2 2mk - 2k L; X2m+1 2 2mk modulo F,. (8.57)
m=O m=O

Thus, the FFT-type computation structure of an FNT is similar to that of a


conventional FFT, but with multiplications by complex exponentials replaced by
simple shifts. This means that an FNT of length N can be calculated with N
log2 N real additions and (NI2) log2 N shifts.

8.3.2 Arithmetic Modulo Fermat Numbers

Any integer Xm defined modulo a Fermat number F, = 2· + 1, v = 2' can be


represented as a (v + 1) - bit word

v-I
Xm = L; x m ,,2' + x m ,_ 2-, X m ,' E (0, I)
i=O

for i =1= v. (8.58)

Since Xm ~ 2-, x m ,_ is equal to 1 only if all the x m ,' are equal to zero for i < v.
Negation can be realized by complementing all bits x m ,' of X m , except x m ,_' Thus,
if we treat x m ,_ separately, we have
8.3 Fermat Number Transforms 225

+ Xm =
v-I
Xm ~ 21 = 2v - 1
I~O

and

- Xm = xm + 2, Xm,v = ° (8.59)
- Xm = 1, xm,v = 1. (8.60)

Hence, negation requires an addition and a complementation together with


several auxiliary operations in order to deal with the case xm,v = 1.
If we add two integers Xm and hn' the result

(8.61)

cannot be greater than 2.+ 1 and, since 2· == -1 and Cn,.+1 = 1 only for Cn,l = 0,
j = 0, ... , v,

v-I
Cn == ~ Cn,l 21 - cn,v, (8.62)
1-0

(8.63)

If we multiply an integer Xm by 24 , we obtain, for xm,v = 0, an integer result


v-I-4 v-I
Cn = " x m,t 21+4+ 2·
~ "
.L...A x m,1 21+4-v (8.64)
1-0 l-v-4

and, since 2· == - I,
v-I-4 v-I
Cn == ~ x m ,I21+4 - ~ x m ,121+4-., Xm,v = 0. (8.65)
1=0 l=v-4

For x m ,. = 1, Cn reduces to

Xm,v = 1. (8.66)

Therefore, arithmetic modulo a Fermat number is significantly more complex


than arithmetic modulo a Mersenne number. However, the practical implemen-
tation can be greatly simplified by using various data code translation techniques
[8.8-10]. In one of these techniques, the input sequence Xm is first mapped into
a new sequence

(8.67)

or, by introducing the integer xm obtained by complementing the v least signi-


ficant bits Xm,l of X m,
226 8. Number Theoretic Transforms

(8.68)

which indicates that the coded samples am are obtained by simply complementing
the v least significant bits of Xm and adding 1 to Xm + x m••2·.
With this technique, the input data stream is encoded only once prior to
transform computation and all operations are performed on the coded sequences
with a single decoding operation on the final result, this decoding operation being
also defined by (8.68). We now demonstrate that arithmetic modulo Fermat
numbers on the coded sequences am is much simpler than on the original se-
quences.
Consider first negation. If am •• = 1, then Xm = 0 and no modification is
required. If am •• = 0, coding the complement am of am yields

(8.69)

or, with (8.67) and - I == 2·,

(8.70)
which shows, by comparison with (8.67) that am is the coded representation of
-Xm. Thus, negation is performed on the coded samples by a simple comple-
mentation except when am •• = 1. If am and bm are the coded values of two inte-
gers Xm and hn' the sum of am and bm is given by

(8.71)

with
(8.72)

The coded value dm of Xm + hn is defined by


(8.73)

Thus,
v-I
dm = em - 2· == L: e m•
1-0
1 21 +I- em •• , (8.74)

which indicates that addition in the transposed system is executed with ordinary
adders, but with high order carry fed back, after complementation, to the least
significant carry input of the adder. If one or both bits am •• and bm•• are zero, one
or both of the operands Xm and h n are zero. In this case, the operation must be
inbibited.
It can be verified easily, from the rules of addition, that multiplication by 24
corresponds in the transposed system to a simple d-bit rotation around the v-bit
8.3 Fermat Number Transforms 227

word, with complementation of the overflow bits. When am,v = 1, we have Xm = 0


and the inversion of the overflow bits is inhibited. The process is illustrated
in Fig. 8.2 for a multiplication by 2.

Ilm.v_l Il m •v

~J LAG

FLAG

Fig. 8.2. Multip\y-by-two circuit in the transposed Fermat number system

Therefore, the foregoing code translation technique reduces the arithmetic


operations to one's complement arithmetic, with the exception that the overflow
bits are complemented before around-carry and that some additional hardware
must be used for treating separately the cases corresponding to zero-valued input
data items. With this approach, arithmetic modulo a Fermat number is only
slightly more complex than arithmetic modulo a Mersenne number.

8.3.3 Computation of Complex Convolutions by FNTs

We now consider a complex circular convolution y, + jy, defined modulo


(2V + 1),

j =,J=I, 1= 0, ... , N - 1, (8.75)

where h", Xm and y, are the real parts of the input and output sequences and It",
im , and y, are the imaginary parts of the input and output sequences. With the
conventional approach, this complex convolution is calculated by evaluating
four real convolutions:
228 8. Number Theoretic Transforms
N-I N-I _
Y/ ==
11=0
L: h n x/_ n modulo (2V + I)
L: hn X/- n - 11=0 (8.76)

N-I N-I _
L: h n x/- n + L:
y/ == 11=0 n=O
h n X/_ n modulo (2V + 1). (8.77)

We shall now show that the evaluation of y, + y/ can be done with only two real
convolutions by taking advantage of the special properties of j = ,J=1 in cer-
tain rings of integers [8.11, 12]. This is done by noting that in the ring of integers
modulo (2V + 1), with v even, we have 2v == - I, which means thatj = ,j-I is
congruent to 2v/2. y/ + jy/ is evaluated by first computing the two real auxiliary
convolutions a/ and b/ defined by

== L: (h n + 2 v / 2 h n) (x/- n + 2v / 2 x/_ n) modulo + I)


N-l '"
a/ (2V (8.78)
n-O

x/_ n) modulo (2V + I).


N-l
== L: (h n -
A.

b/ 2 v / 2hn) (x/- n - 2 v/2 (8.79)


n=O

Since 2 v == - I, we have

y/ == (a/ + b/)/2 modulo (2V + 1) (8.80)

(8.81)

Therefore this method supports the computation of a circular convolution by


FNTs with only two multiplications per complex output sample instead of four
in the conventional case.

8.4 Word Length and Transform Length Limitations

When a convolution is computed via NTTs, all the calculations are executed on
integer data sequences and the convolution product is obtained modulo q without
roundoff errors. This feature provides significant advantage over other methods
when high accuracy is needed, but can also impose a requirement for relatively
long words to ensure that the result remains within the modulo range. In order
to analyze these limitations for arithmetic operations modulo q, we assume that
the two input sequences Xn and h n are integer sequences. The output Yn of the
ordinary convolution is given by

(8.82)

The output Yn of the same convolution computed modulo q will numerically


equal to Y n if
8.4 Word Length and Transform Length Limitations 229

Ii.. I < q12. (8.83)

This condition will be met for a length-N circular convolution if

Ih .. lmax Ix .. Imax < q12N, (8.84)

which means that the word length of the original input sequences must be slightly
less than half of the word length corresponding to the modulus q. Tighter bounds
on input signal amplitudes can be found using the L1' norms [8.13] defined by

IN-I )l/r
IIxllr = (
N ~ Ix.. lr , r ~ 1. (8.85)

i .. is bounded by

li .. 1~ Nllxllrllhll. (8.86)

with

11r + lIs = 1 r, s ~ 1. (8.87)

These bounds are better than (8.84), especially when the circular convolutions
are used to compute aperiodic convolutions by the overlap-add method, with
both input sequences padded out with zeros.
Thus, when convolutions are computed by NTIs, the only source of quan-
tization noise is the input quantization noise resulting from scaling and rounding
of the input sequences required to avoid overflow [8.14]. This implies that the
output signal-to-noise ratio (SNR) is relatively independent of the convolution
length N and increases by 3 dB for each increase of word length by one bit. By
contrast, when the convolution is evaluated by FFTs with fixed word length, one
must account for the roundoff noise incurred in FFT computations and the SNR
increases by about 6 dB for each additional bit of word length and decreases by
about 2 dB for every doubling of the convolution length. This shows that for
fixed word lengths, the computation by NTIs is less noisy than the computation
by FFTs for long convolutions. For words of 12 bits, for instance, NTI filtering
gives a better SNR than FFT filtering for N greater than about 32.
This motivates one to use NTIs for computing long convolutions. However,
for a given modulus q, the maximum number of distinct powers of ±2 is equal
to 2rIog2 q, with a = rIog2 q, where a is the smallest integer such that a ~ log2
q. Thus, for NTIs computed without multiplications, there is a rigid relationship
between word length and maximum convolution length, and long convolutions
imply long word lengths, even if these long word lengths far exceed the desired
accuracy.
One solution to this problem consists of simply computing the convolutions
YI ... and Y2 ... of two consecutive blocks simultaneously. Assuming, for instance,
230 8. Number Theoretic Transforms

that h. is a fixed input sequence of positive integers and that XI,. and X2,. are two
consecutive positive integer sequences, the two length-N convolutions YI,. and
Y2,. can be computed in a single step by

X. = Xl,. + 2' X2,. (8.88)

Y. == h.*x. modulo q (8.89)

Yl,. == Y. modulo 2' (8.90)

(8.91)

with

e = (lIog 2 q )/2. (8.92)

With this method, the transform length is doubled for a given accuracy and there
is no overflow, provided IYl,. I, IYz,n I < (.../(j)/2.
Another solution to computing long convolutions with NTTs using moderate
word lengths is possible by mapping the one-dimensional convolution of length
N into multidimensional convolutions using one of the methods discussed in
Chap. 3. For instance, if N is the product of d distinct Mersenne numbers Nt.
N 2, ... , Nd with N = NlN2 ... N d, we can map the length N convolution into a
d-dimensional convolution of size Nl X N2 X ... X Nd by using the Agarwal-
Cooley algorithm (Sect. 3.3.1). This is always possible because all Mersenne
numbers are mutually prime (theorem 2.15). The nested convolutions are then
calculated with Mersenne transforms defined modulo Nt. N2 ... Nd and the
convolution product Y/ is obtained without overflow provided that Iy/I < Nd2,
where NI is the smallest Mersenne number.

8.5 Pseudo Transforms

We have seen that Mersenne transforms defined modulo (2P - 1), withp prime,
and Fermat number transforms defined modulo(22 ' + 1) can be used to compute
circular convolutions. Both transforms are computed without multiplication but
suffer serious limitations which relate mainly to the lack of an FFT-type al-
gorithm for Mersenne transforms and to the problems associated with arithmetic
modulo (22 ' + 1) for FNTs. It would seem difficult to consider the use of any
modulus other than a Mersenne or a Fermat number because of the problems
associated with the corresponding arithmetic. In the following, however, we shall
show that these difficulties can be circumvented by defining NTTs modulo in-
tegers ql which are factors of pseudo Mersenne numbers q, with q = 2p - 1, p
composite or of pseudo Fermat numbers q, with q = 2" + 1, v =1= 2'. In both
cases, q is composite and can always be defined as the product of two factors
8.5 Pseudo Transforms 231

Under these conditions, if an N length NTT, which supports circular convolu-


tion, can be defined modulo q2, this NTT computes the N-length convolution
Yl modulo q2, with

N-I
Yl == L: h n X1- n modulo q2 (8.94)
n=O

The difficulty of performing the arithmetic operations modulo qz can be cir-


cumvented by exploiting the fact that qz is a factor of q. Thus Yl can be com-
puted modulo q, with just one final reduction modulo q2,
N-I
Y/ == ( L: hn X/- n modulo q) modulo q2 (8.95)
n=O

With this method [8.15, 16], if q is a pseudo Mersenne number, all operations
but the last reduction are done in one's complement arithmetic. The price to be
paid for use of this approach is that all operations modulo (2P - 1) must be
executed on word lengths longer than that of the final result. However, the
increase in number of operations is very limited when ql is small and the penalty
is more than offset by the fact that P needs no longer to be a prime.

8.5.1 Pseudo Mersenne Transforms

We shall first consider the use of pseudo Mersenne transforms defined modulo
q = 2p - 1, with P composite. For P even, q = (2P12 - 1) (2P'Z + 1), and the
transform length cannot be longer than that which is possible for 2p !2 - 1. Thus,
we need be concerned only with the cases corresponding to P odd. In order to
specify the pseudo Mersenne transforms, we shall use the following theorem
introduced by Erdelsky [8.4].
Theorem 8.7: Given a prime number PI and two integers u and g such that
u ~ 1, jgj ~ 2, g =t= 1 modulo Ph the NTT of length N = p~ and of root g
supports circular convolution. This NTT is defined modulo q2 = (gP; - 1)/
(gP;-1 - 1).
In order to demonstrate this theorem, we must establish that the three condi-
tions of theorem 8.4 are satisfied.

gN == 1 modulo qz (8.96)

N N-I == 1 modulo q2 (8.97)

(8.98)

The condition (8.96) follows immediately from the fact that gN == 1 modulo
232 8. Number Theoretic Transforms

(gP; - 1), with q2 factor of (gP; - 1). For the condition (8.97) we note that
Fermat's theorem implies that

gP. == g modulo PI (8.99)

Hence, by repeated multiplications, gJ>; = gN == g modulo PI and gP;-' == g modulo


PI' Since g $. 1 modulo p" we have q2 == 1 modulo PI' Therefore, (P"q2) = 1 and
(N, q2) = 1, which implies that N has an inverse modulo q2'
In order to establish condition (8.98), note that

(8.100)

or

(8.101)

which implies that gJ>;-' - 1 is mutually prime with q2 if [(gP;-' - 1), PI] = 1.
This condition is immediately established, because gJ>;-' == g modulo q2 and
W $. 1 modulo Q2' Therefore (8.98) is proved and this completes the demonstra-
tion of the theorem.
We can now derive two classes of pseudo Mersenne transforms from the-

PSEUDO MERSENNE PSEUDO MERSENNE


TRANSFORM MODULO TRANSFORM MODULO

dry_I) (t1-I)
2 2
ROOT 2 . LENGTH PI ROOT 2 , LENGTH PI

INVERSE PSEUDO
MERSENNE TRANSFORM
MODULO (/I-I)
PJ-I 2
ROOT 2 , LENGTH P

REDUCTION MODULO
,; PI
(2 1_1)/(2 -1)
Fig. 8.3. Computation of a circular convolution
modulo (2"f - 1)/(2'" - 1) by pseudo Mersenne
y/ transforms defined modulo (2J>f - 1), PI prime
8.5 Pseudo Transforms 233

orem 8.7. One set of pseudo Mersenne transforms is obtained by setting g = 2


and u = 2. This yields pseudo Mersenne transforms oflength N = pi ,PI prime,
and defined modulo q2 where

q2 = (2p1 - 1)/(2P' - 1). (8.102)

Using these pseudo transforms, computation is executed modulo q, with q =


21'1 - 1, on data words of pi bits and the final result is obtained by a final re-
duction modulo q2, giving words of approximately pi - PI bits. Thus, the
length - pi circular convolution modulo(2p1 - 1)/(2P ' - 1) is computed as
shown in Fig. 8.3, with the pseudo Mersenne transform
pl-l
Xk == L: Xm 2mk modulo (2p1 - 1) (8.103)
m=O

with similar definitions for the transform iik of hn' and for the inverse transform.
Another class of pseudo Mersenne transforms is derived from theorem 8.7
by setting u = 1, g = 2p , and P = PIP2. This gives pseudo Mersenne transforms
oflength N = PI> PI prime, and defined modulo Q2, where

Q2 = (2plP , - 1)/(2P' - 1). (8.l04)

With this pseudo transform, computation is executed modulo (2P 'P' - 1) on data
word lengths of PIP2 bits and the final result is obtained modulo (2plP' - 1)/
(2P ' - 1) on words of approximately plpl - 1) bits.
Table 8.1 lists the parameters for various pseudo Mersenne transforms
defined modulo (21' - 1), with P odd. The most interesting transforms are those
which have a composite number of terms and a useful word length as close as

Table 8.1. Parameters for various pseudo Mersenne transforms defined modulo (2 P - 1) and
convolutions defined modulo Q2, with Q2 factor of 2P - 1

p Prime factorization Modulus Transform Root Effective


of 2P - 1 Q2 length N g word length
Nb of bits

15 7·31·151 (2 1S - 1)/7 5 23 12
21 72 ol27·337 (2 21 - 1)/49 7 23 15
25 31·601·1801 (2 25 - 1)/31 25 2 20
27 7·73·262657 (227 - 1)/511 27 2 18
35 31·71·127·122921 (2 3' - 1)/3937 35 2 23
35 31·71·127·122921 (2 3' - 1)/(127) 5 27 28
35 31·71·127·122921 (2 3' - 1)/31 7 2' 30
45 7·31·73·151·631·23311 (24) - 1)/511 5 29 36
49 127·4432676798593 (249 - 1)/127 7 27 42
49 127·4432676798593 (249 - 1)/127 49 2 42
234 8. Number Theoretic Transforms

possible to p. In particular, the transform of 49 terms defined modulo (249 - 1)


seems to be particularly interesting, since it can be computed with a 2-stage
FFT-type algorithm and an effective word length which is only about 15 %
shorter than the computation word length.

8.5.2 Pseudo Fermat Number Transforms

Pseudo Fermat number transforms are defined modulo q, with q = 2· + 1,


v *'
2'. In order to specify the pseudo Fermat number transforms, we shall use
the following theorem introduced by Erdelsky [8.4].
Theorem 8.8: Given an odd prime integer VI and two integers u and g such that
u ~ I, g ~ 2, g =4= -1 modulo VI> the NTT of length N = 2v'{ and of root g
supports circular convolution. This NTT is defined modulo qz = (gV! + 1)/
(gV!-' + 1).
In order to prove this theorem, we must demonstrate that the three condi-
tions of theorem 8.4 are satisfied.

gN == 1 modulo qz (8.105)

N N- I == 1 modulo qz (8.106)

[(gV! _ 1), qz] = 1 and [(gZV!-' - I), qz] = 1. (8.107)

Since gV! == -1 modulo (gV! + 1), we have gN == 1 modulo (gV! + 1) and there-
fore, gN == 1 modulo qz, because qz is a factor of gV! +
1. To establish the condi-
tion (8.106), we note that Fermat's theorem implies that

(8.108)

Hence, by repeated multiplications, gV!-' == gV! == g modulo VI' Thus, qz == 1


modulo VI, since g =4= - 1 modulo VI' Therefore, (VI> qz) = 1 and we have
(N, qz) = 1 provided that q2 is odd. When g is even, qz is obviously odd. When
g is odd, we have

qz = (gV!-' + 1) [g(V.-2)V!-' - 2g(V.-3)V!-' + ... - (VI - I)] + Vi> (8.109)

which implies that q2 is odd, since VI is odd. Thus we have (N, q2) = I and N
has an inverse modulo q2'
In order to establish that the condition [(gV! - I), qz] = I corresponding
to (8.107) is met, we note that

(8.ll0)

which implies that [(gV! - I), q2] = (qz, 2) and, since q2 is odd,
8.5 Pseudo Transforms 235

[(g~ - 1), q2] = 1, (8.111)

In order to establish that the condition [(g2~-1 - 1), q2] = 1 is satisfied, we note
that, since g2~-1 - 1 = (gv:- I + 1) (g"l-I - 1), this condition corresponds to
[(g~-I + 1), q2] = 1 and [(g~-I - 1), q2] = 1. The condition (8.110) implies
that [(gv:- I - 1), q2] = 1, since (gv:- I - 1) is a factor of (g": - 1). We note also
that (8.109) implies

(8.112)

and, since g $. -1 modulo VI and gv:- I == g modulo Vi> we have [(g~-I + 1),
q2] = 1, which completes the proof of the theorem.
An immediate application of theorem 8.8 is that, if g = 2 and u = 1, we can
*"
define for VI 3 a NTT oflength N = 2vI which supports circular convolution.
This NTT is defined modulo q2, with q2 = (2V' + 1)/3.
Similarly, for u = 2 and g = 2, we have an NTT of length N = 2vI. This
NTT has the circular convolution property and is defined modulo q2, with
q2 = (2vl + 1)/(2V, + 1).
A systematic application of theorems 8.7 and 8.8 yields a large number of
pseudo Fermat number transforms. We summarize the main characteristics of
some of these transforms for V even and V odd, respectively, in Tables 8.2 and 8.3.
It can be seen that there is a large choice of pseudo Fermat number transforms
having a composite number of terms. This allows one to select word lengths that
are more closely taylored to meet the needs of particular applications than when
word lengths are limited to powers of two, as with FNTs.
The same pseudo transform technique can also be applied to moduli other
than 2p - 1 and 2" + 1, and the cases corresponding to moduli 2 2p - 2p + 1

Table 8.2. Parameters for various pseudo Fermat number transforms defined modulo (2· + 1)
and convolutions defined modulo Q2, with Q2 factor of 2" + 1. v even

v Prime factorization Modulus Transform Root Effective


of 2" + 1 q2 length N g word length
Nb of bits

20 17·61681 (220 + 1)/17 40 2 16


22 5·397·2113 (2 22 + 1)/5 44 2 19
24 97·257·673 (224 + 1)/257 48 2 16
26 5·53·157·1613 (2 26 + 1)/5 52 2 24
28 17'} 5790321 (2 28 + 1)/17 56 2 24
34 5·137·953·26317 (2 34 + 1)/5 68 2 32
38 5·229·457·525313 (2 38 + 1)/5 76 2 36
40 257·4278255361 (240 + 1)/257 80 2 32
44 17·353·2931542417 (244 + 1)/17 88 2 40
46 5·277·1013·1657·30269 (246 + 1)/5 92 2 44
236 8. Number Theoretic Transforms

Table 8.3. Parameters for various pseudo Fermat number transforms defined modulo (2" + 1)
and convolutions defined modulo Q2, with Q2 factor of 2" + 1. v odd

v Prime factorization Modulus Transform Root Effective


of 2" + 1 Q2 length N g word length
Nb of bits

15 32.11.331 (21' + 1)/9 10 23 12


21 32.43.5419 (221 + 1)/9 14 23 18
25 3·11·251·4051 (22' + 1)/33 50 2 20
27 3'·19· 87211 (2 27 + 1)/1539 54 2 16
29 3·59·3033169 (2 29 + 1)/3 58 2 27
33 32.67.683 ·20857 (2 33 + 1)/9 22 23 30
35 3·11·43·281·86171 (2 3' + 1)/33 14 2' 30
41 3·83·8831418697 (241 + 1)/3 82 2 39
45 33 .11.19.331.18837001 (24) + 1)/513 10 29 36
49 3·43·4363953127297 (249 + 1)/129 98 2 41

are discussed in [S.17]. These moduli are factors of 26q - 1 and NTIs of dimen-
sion 6q and root 2 can be defined modulo some factors of 2 2p - 2!' + 1.
When a convolution is calculated by pseudo transforms, the computation is
performed modulo q and the final result is obtained modulo q2, with q = qlq2'
Since q2 < q, it is possible to detect overflow conditions by simply comparing
the result of the calculations modulo q with the convolution product defined
modulo qz [8.18].

8.6 Complex NTIs

We consider a complex integer x + jx, where x and x are defined in the field
GF(q) of the integers defined modulo a prime q. Thus, x and x can take any
integer value between 0 and q - 1. We also assume that j = ~ is not a
member of GF(q), which means that -I is a quadratic nonresidue modulo q.
Then, the Gaussian integers x + jx are similar to ordinary complex numbers,
with real and imaginary parts treated separately and addition and multiplication
defined by

(S.1I3)

(S.1I4)

Since each integer x and x can take only q distinct values, the Gaussian integers
x + jx can take only qZ distinct values and are said to pertain to the extension
field GF(q2). Since x +
jx pertain to a finite field, the successive powers of x jx +
+
given by (x jx)n, n = 0, 1,2, ... yield a sequence which is reproduced with
8.6 Complex NTIs 237

a periodicity N, and we can define roots of order N in GF(q2). In order to specify


the permissible orders of the various roots, we note first that

+ j.:£)q =
q
(x ~ C Ie x, gq-I jq-I, (8.115)
1=0

where the Cle are the binomial coefficients

C' q! (8.116)
k = l..'(q _ I.)'.
.

Since these coefficients are integers, i!(q - i)! must divide q!. However i!(q - i)!
cannot divide q because q is a prime. Therefore i!(q - i)! divides (q - I)! for
i =1= 0, q and (8.115) reduces to

(8.117)

and, thus, via Fermat's theorem,

(x + j.:£)q == x + jq .:£ modulo q. (8.118)

If we now raise x + jq .:£ to the qth power, we obtain


(x + j.:£)q. == x + jq' .:£ modulo q. (8.119)

Since jq' = j for q2 == 1 modulo 4, we have in this case

(x + j.:£)q' == x + j.:£ modulo q. (8.120)

This implies that, when q2 == 1 modulo 4, any root of order N in GF(q2) must
satisfy the condition

NI (q2 - I). (8.121)

The Diophantine equation q2 == 1 modulo 4 has only the two solutions q) == 1


modulo 4 and q2 == 3 modulo 4. We know however that j cannot be a member
of GF(q). This implies that -1 must be a quadratic nonresidue modulo q and,
therefore, that (-l/q) = -1, where (-l/q) is the Legendre symbol. A con-
sequence of theorem 2.11 is that

(-l/q) = (_1)(q-1l/2. (8.122)

This imposes the condition

q == 3 modulo 4. (8.123)
238 8. Number Theoretic Transforms

This condition is established easily for Mersenne and pseudo Mersenne trans-
forms defined modulo (2P - 1), because, in this case

q = 2p - 1 == -1 == 3 modulo 4. (8.124)

If we now consider a simple Mersenne transform of dimension p with root 2,


we can then define complex Mersenne transforms of lengths 4p and 8p by replac-
ing the real root 2 with the complex roots gl and g2

(8.125)

g2 = 2(1 + j). (8.126)

Since p is an odd prime, we have gtp == 1 modulo q and g~P == 1 modulo q, with
gj and gi taking, respectively, 4p and 8p distinct values for n = 0, ... , 4p - 1
and n = 0, ... , 8p - 1. Thus ,we can define complex Mersenne transforms
[8.12, 15] of length 4p and 8p which support the circular convolution by
_ 4p-1
Xk == ~ xm(2j)mk modulo (2P - 1) (8.127)
m=O

_ 8q-l
Xk == ~ xm(1 +j)mk modulo (2P - 1). (8.128)
m=O

The advantage of these complex Mersenne transforms over the corresponding


real transforms is that, for the same word length, the transform length is in-
creased to 4p and 8p instead of 2p and that the transforms can be computed by
a 3-stage FFT-type algorithm, without multiplications.
A similar approach is applicable to pseudo Mersenne transforms defined
modulo (2P - 1) with p odd [8.15] and to pseudo Fermat number transforms
defined modulo (2" + 1), with v odd [8.16]. Complex Mersenne transforms are
particularly interesting because they can be implemented in one's complement
arithmetic and their maximum length is both large and highly factorizable,
leading to an efficient FFT-type implementation. This is exemplified by the trans-
form defined modulo (2 49 - 1) which operates on data words of 49 bits with an
effective word length of 42 bits. This transform has a maximum transform length
equal to 392 and can be computed by utilizing 3-stage radix-2 and 2-stage radix-7
FFT-type algorithms.
Reed and Truong [8.19, 20] have investigated the general case of complex
Mersenne transforms. They have shown that complex transforms which support
the circular convolution can be defined modulo q, with q = 2p - 1 and p prime
for any length N such that N I(q2 - 1). Thus, we have

(8.129)

For transforms oflength N, with N = 2P+ 1 , the roots are given by


8.7 Comparison with the FFT 239

g = a + jb (8.130)

with

a == ±221-' modulo (2P - I) (8.131)

b == ±( -3)21-' modulo (2 P - 1). (8.132)

This specifies relatively large transforms which operate in one's complement


arithmetic and which can be computed entirely by a radix-2 FFT-type algorithm.
Unfortunately, the roots are not simple and some general multiplications are
required in the computation of the transform.

8.7 Comparison with the FFf

There is a large choice of NTI candidates for the computation of convolutions.


From the standpoint of arithmetic operations count, the most useful NTIs are
those which can be computed without mUltiplications and which can be calcu-
lated with an FFT-type algorithm while allowing the longest possible transform
length for a given word length. In practice, this means that the best NTTs are
Fermat number transforms and complex pseudo Mersenne transforms [8.21].
The maximum lengths for multiplication-free transforms is 128 for an FNT
defined modulo (2 32 + 1), with data words of 32 bits, and is 392 for a complex
pseudo Mersenne transform defined modulo (249 - 1), with effective word length
of 42 bits. Thus, NTIs seem to be best suited for computing the convolution of
short and medium length sequences in computers where the cost of multipli-
cations is significantly higher than the cost of additions and when a high accuracy
is required. A particularly interesting application concerns the evaluation of data
sequences such as phase angles, which are by nature defined modulo an integer.
In this case, there is no need to scale down the input sequence data to prevent
overflow and both the input and output sequences are defined with the same
number of bits.
In all other cases, the input sequences must be scaled down in order to avoid
overflow and the number of bits corresponding to the input sequences must be
less than half the number of bits of the output sequences. Since all calculations
must be performed on words of length equal to that of the input sequence, the
price to be paid for obtaining an exact answer with NTIs is the use of word
lengths which are approximately twice that of those corresponding to other
methods. For real convolutions, however, NTTs use only real arithmetic instead
of complex arithmetic for FFTs so that the hardware requirements are about
the same. A practical comparison between FFTs and FNTs has been reported
in [8.7]. It has been shown that for convolutions oflengths in the range 32-2048
computed with FNTs on words of 32 bits, the computer execution times (IBM
240 8. Number Theoretic Transforms

370/155) were about 2 to 4 times shorter than with an efficient FFT program. In
this comparison, the convolutions of lengths above 128 are computed by two-
dimensional FNTs.
An interesting aspect of number theoretic transforms is their analogy with
discrete Fourier transforms. NTIs are defined with roots of unity g of order
N modulo an integer q, while DFTs are defined with complex roots of unity W
of order N in the field of complex numbers. Hence NTIs can be viewed as DFTs
defined in the ring of numbers modulo q. In fact, NTIs can also be considered
as particular cases of polynomial transforms in which the N-bit words are viewed
as polynomials. This is particularly apparent for polynomial transforms oflength
21+1 defined modulo(Z2' + 1). Such transforms compute a circular convolution

of length 2'+1 on polynomials oflength 2'. If the 21+1 input polynomials are defined
as words of 2' bits, the polynomial transform reduces to an FNT oflength 2'+1,
of root 2 and defined modulo (2 2' + 1). Thus, polynomial transforms and NTIs
are DFTs defined in finite fields and rings of polynomials or integers. Their
main advantage over DFTs is that systematic advantage is taken from the
operation in finite fields or rings to define simple roots of unity which allow one
to eliminate the multiplications for transform computation and to replace com-
plex arithmetic by real arithmetic.
References

Chapter 2
2.1 T. Nagell: Introduction to Number Theory, 2nd ed. (Chelsea, New York 1964)
2.2 G. H. Hardy, E. M. Wright: An Introduction to the Theory of Numbers, 4th ed. (Oxford
University Press, Ely House, London 1960)
2.3 N. H. McCoy: The Theory of Numbers (MacMillan, New York 1965)
2.4 J. H. McClellan, C. M. Rader: Number Theory in Digital Signal Processing (Prentice-
Hall, Englewood Cliffs, N. J. 1979)
2.5 M. Abramowitz, I. Stegun: Handbook of Mathematical Functions, 7th ed. (Dover, New
York 1970) pp. 864-869
2.6 W. Sierpinski: Elementary Theory of Numbers (Polska Akademia Nauk Monographie
Matematyczne, Warszawa 1964)
2.7 I. M. Vinogradov: Elements of Number Theory, (Dover, New York 1954)
2.8 D. J. Winter: The Structure of Fields, Graduate Texts in Mathematics, Vol. 16 (Springer,
Berlin, New York, Heidelberg 1974)
2.9 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans.
ASSP-25, 392-410 (1977)
2.10 J. H. Griesmer, R. D. Jenks: "SCRATCHPAD I. An Interactive Facility for Symbolic
Mathematics", in Proc. Second Symposium on Symbolic and Algebraic Manipulation,
ACM, New York, 42-58 (1971)
2.11 S. Winograd: On computing the discrete Fourier transform. Math. Comput. 32, 175-199
(1978)
2.12 S. Winograd: Some bilinear forms whose multiplicative complexity depends on the field
of constants. Math. Syst. Th., 10, 169-180 (1977)

Chapter 3
3.1 T. G. Stockham: "Highspeed Convolution and Correlation", in 1966 Spring Joint Com-
puter Conf., AFIPS Proc. 28, 229-233
3.2 B. Gold, C. M. Rader, A. V. Oppenheim, T. G. Stockham: Digital Processing of Signals,
(McGraw-Hill, New York 1969) pp. 203-213
3.3 R. C. Agarwal, J. W. Cooley: "New Algorithms for Digital Convolution", in 1977 Intern.
Conf., Acoust., Speech, Signal Processing Proc., p. 360
3.4 I. J. Good: The relationship between two fast fourier Transforms. IEEE Trans. C-20,
310-317 (1971)
3.5 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE ASSP-2S,
392-410 (1977)
3.6 H. J. Nussbaumer: "New Algorithms for Convolution and DFT Based on Polynomial
Transforms", in IEEE 1978 Intern. Conf. Acoust., Speech, Signal Processing Proc., pp.
638-641
3.7 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev., 22, 134-144 (1978)
3.8 R. C. Agarwal, C. S. Burrus: Fast one-dimensional digital convolution by multidimen-
sional techniques. IEEE Trans. ASSP-22, 1-10 (1974)
3.9 H. J. Nussbaumer: Fast polynomial transform algorithms for digital convolution. IEEE
Trans.ASSP-28,205-215,(1980)
242 References

3.10 A. Croisier, D. J. Esteban, M. E. Levilion, V. Riso: Digital Filter for PCM Encoded
Signals, US Patent 3777130, Dec. 4, 1973
3.11 C. S. Burrus: Digital filter structures described by distributed arithmetic. IEEE Trans.
CAS-24, 674-680 (1977)
3.12 D. E. Knuth: The Art of Computer Programming, Vol. 2, Semi-Numerical Algorithms
(Addison-Wesley, New York 1969)

Chapter 4
4.1 B. Gold, C. M. Rader: Digital Processing of Signals (McGraw-Hili, New York 1969)
4.2 E. O. Brigham: The Fast Fourier Transform (Prentice-Hall, Englewood Cliffs, N. J. 1974)
4.3 L. R. Rabiner, B. Gold: Theory and Application of Digital Signal Processing (Prentice-
Hall, Englewood Cliffs, N. J. 1975)
4.4 A. V. Oppenheim, R. W. Schafer: Digital Signal Processing (Prentice-Hall, Englewood
Cliffs, N. J. 1975)
4.5 A. E. Siegman: How to compute two complex even Fourier transforms with one trans-
form step. Proc. IEEE 63, 544 (1975)
4.6 J. W. Cooley, J. W. Tukey: An algorithm for machine computation of complex Fourier
series. Math. Comput. 19,297-301 (1965)
4.7 G. D. Bergland: A fast Fourier transform algorithm using base 8 iterations. Math. Com-
put. 22, 275-279 (1968)
4.8 R. C. Singleton: An algorithm for computing the mixed radix fast Fourier transform.
IEEE Trans. AU-17, 93-103 (1969)
4.9 R. P. Polivka, S. Pakin: APL: the Language and Its Usage (Prentice-Hall, Englewood
Cliffs, N. J. 1975)
4.10 P. D. Welch: A fixed-point fast Fourier transform error analysis. IEEE Trans. AU-I7,
151-157 (1969)
4.11 T. K. Kaneko, B. Liu: Accumulation of round-off errors in fast Fourier transforms. J.
Assoc. Com put. Mach. 17, 637-654 (1970)
4.12 C. J. Weinstein: Roundoff noise in floating point fast Fourier transform computation.
IEEE Trans. AU-I7, 209-215 (1969)
4.13 C. M. Rader, N. M. Brenner: A new principle for fast Fourier transformation. IEEE
Trans. ASSP-24, 264-265 (1976)
4.14 S. Winograd: On computing the discrete Fourier transform. Math. comput. 32, 175-199
(1978)
4.15 K. M. Cho, G. C. Ternes: "Real-factor FFT algorithms", in IEEE 1978 Intern. Conf.
Acoust., Speech, Signal Processing, pp. 634-637
4.16 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
4.17 G. Bonnerot, M. Bellanger: Odd-time odd-frequency discrete Fourier transform for sym-
metric real-valued series. Proc. IEEE 64,392-393 (1976)
4.18 G. Bruun: z-transform DFT filters and FFTs. IEEE Trans. ASSP-26, 56-63 (1978)
4.19 G. K. McAuliffe: "Fourier Digital Filter or Equalizer and Method of Operation There-
fore", US Patent No.3 679 882, July 25, 1972

Chapter 5
5.1 L. I. Bluestein: "A Linear Filtering Approach to the Computation of the Discrete Fourier
Transform", in 1968 Northeast Electronics Research and Engineering Meeting Rec., pp.
218-219
5.2 L. I. Bluestein: A linear filtering approach to the computation of the discrete Fourier
transform. IEEE Trans. AU-IS, 451-455 (1970)
5.3 C. M. Rader: Discrete Fourier transforms when the number of data samples is prime.
Proc. IEEE 56, 1107-1108 (1968)
5.4 S. Winograd: On computing the discrete Fourier transform. Proc. Nat. Acad. Sci. USA
73, 1005-1006 (1976)
References 243

5.5 L. R. Rabiner, R. W. Schafer, C. M. Rader: The Chirp z-transform algorithm and its
application. Bell Syst. Tech. J. 48, 1249-1292 (1969)
5.6 G. R. Nudd, O. W. Otto: Real-time Fourier analysis of spread spectrum signals
using surface-wave-implemented Chrip-z transformation. IEEE Trans. MTT-24, 54-56
(1975)
5.7 M. J. Narasimha, K. Shenoi, A. M. Peterson: "Quadratic Residues: Application to Chirp
Filters and Discrete Fourier Transforms", in IEEE 1976 Acoust., Speech, Signal Pro-
cessing Proc., pp. 376-378
5.8 M. J. Narasimha: "Techniques in Digital Signal Processing", Tech. Rpt. 3208-3, Stanford
Electronics Laboratory, Stanford University (1975)
5.9 J. H. McClellan, C. M. Rader: Number Theory in Digital Signal Processing (Prentice-Hall,
Englewood Cliffs, N. J. 1979)
5.10 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
5.11 1. J. Good: The interaction algorithm and practical Fourier analysis. J. Roy. Stat. Soc.
B-20, 361-372 (1958); 22,372-375 (1960)
5.12 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
5.13 D. P. Kolba, T. W. Parks: A prime factor FFT algorithm using high-speed convolution.
IEEE Trans. ASSP-25, 90-103 (1977)
5.14 C. S. Burrus: "Index Mappings for Multidimensional Formulation of the DFT and Con-
volution", in 1977 IEEE Intern. Symp. on Circuits and Systems Proc., pp. 662-664
5.15 S. Winograd: "A New Method for Computing DFT", in 1977 IEEE Intern. Conf.
Acoust., Speech and Signal Processing Proc., pp. 366-368
5.16 S. Winograd: On computing the discrete Fourier transform. Math. Com put. 32, 175-
199 (1978)
5.17 H. F. Silverman: An introduction to programming the Winograd Fourier transform
algorithm (WFTA). IEEE Trans. ASSP-25, 152-165 (1977)
5.18 R. W. Patterson, J. H. McClellan: Fixed-point error analysis of Winograd Fourier trans-
form algorithms. IEEE Trans. ASSP-26, 447--455 (1978)
5.19 L. R. Morris: A comparative study of time efficient FFT and WFTA programs for general
purpose computers. IEEE Trans. ASSP-26, 141-150 (1978)

Chapter 6
6.1 H. J. Nussbaumer: Digital filtering using polynomial transforms. Electron. Lett. 13, 386-
387 (1977)
6.2 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev. 22, 134-144 (1978)
6.3 P. Quandalle: "Filtrage numerique rapide par transformees de Fourier et transformees
polynomiales-Etude de I'implantation sur microprocesseurs" These de Doctorat de
Specialite, University of Nice, France (18 mai 1979)
6.4 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans. ASSP-
25,392--410 (1977)
6.5 B. Arambepola, P. J. W. Rayner: Efficient transforms for multidimensional convolutions.
Electron. Lett. 15, 189-190 (1979)

Chapter 7
7.1 H. J. Nussbaumer, P. Quandalle: Fast computation of discrete Fourier transforms using
polynomial transforms. IEEE Trans. ASSP-27, 169-181 (1979)
7.2 H. J. Nussbaumer, P. Quandalle: "New Polynomial Transform Algorithms for Fast DFT
Computation", in IEEE 1979 Intern. Acoustics, Speech and Signal Processing Conf. Proc.,
pp.510-513
7.3 C. M. Rader: Discrete Fourier transforms when the number of data samples is prime.
Proc. IEEE 56, 1107-1108 (1968)
244 References

7.4 G. Bonnerot, M. Bellanger: Odd-time odd-frequency discrete Fourier transform for sym-
metric real-valued series. Proc. IEEE 64, 392-393 (1976)
7.5 C. M. Rader, N. M. Brenner: A new principle for fast Fourier transformation. IEEE
Trans. ASSP-24, 264-266 (1976)
7.6 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
7.7 S. Winograd: On computing the discrete Fourier transform. Math. Comput. 32, 175-199
(1978)
7.8 H. J. Nussbaumer: DFT computation by fast polynomial transform algorithms. Electron.
Lett. 15,701-702 (1979)
7.9 H. J. Nussbaumer, P. Quandalle: Computation of convolutions and discrete Fourier
transforms by polynomial transforms. IBM J. Res. Dev. 22, l34-144 (1978)
7.10 R. C. Agarwal, J. W. Cooley: New algorithms for digital convolution. IEEE Trans. ASSP-
25, 392--410 (1977)

Chapter 8
8.1 I. J. Good: The relationship between two fast Fourier transforms. IEEE Trans. C-20,
310-317 (1971)
8.2 J. M. Pollard: The fast Fourier transform in a finite field. Math. Comput. 25, 365-374
(1971)
8.3 P. J. Nicholson: Algebraic theory of finite Fourier transforms. J. Comput. Syst. Sci. 5,
524-547 (1971)
8.4 P. J. Erdelsky: "Exact convolutions by number-theoretic transforms"; Rept. No. AD-
AOl3 395, San Diego, Calif. Naval Undersea Center (1975)
8.5 C. M. Rader: Discrete convolutions via Mersenne transforms. IEEE Trans. C-21, 1269-
1273 (1972)
8.6 R. C. Agarwal, C. S. Burrus: Fast convolution using Fermat number transforms with
applications to digital filtering. IEEE Trans. ASSP-22, 87-97 (1974)
8.7 R. C. Agarwal, C. S. Burrus: Number theoretic transforms to implement fast digital con-
volution. Proc. IEEE 63, 550-560 (1975)
8.8 L. M. Leibowitz: A simplified binary arithmetic for the Fermat number transform. IEEE
Trans. ASSP-24, 356-359 (1976)
8.9 J. H. McClellan: Hardware realization of a Fermat number transform. IEEE Trans.
ASSP-24, 216-225 (1976)
8.10 H. J. Nussbaumer: Linear filtering technique for computing Mersenne and Fermat num-
ber transforms. IBM J. Res. Dev. 21, 334-339 (1977)
8.11 H. J. Nussbaumer: Complex convolutions via Fermat number transforms. IBM J. Res.
Dev. 20, 282-284 (1976)
8.12 E. Vegh, L. M. Leibowitz: Fast complex convolutions in finite rings. IEEE Trans. ASSP-
24, 343-344 (1976)
8.13 L. B. Jackson: On the interaction of round-off noise and dynamic range in digital filters.
Bell Syst. Tech. J. 49, 159-184 (1970)
8.14 P. R. Chevillat, F. H. Closs: "Signal processing with number theoretic transforms and
limited word lengths", in IEEE 1978 Intern. Acoustics, Speech and Signal Processing
Conf. Proc., pp. 619-623
8.15 H. J. Nussbaumer: Digital filtering using complex Mersenne transforms. IBM J. Res.
Dev. 20, 498-504 (1976)
8.16 H. J. Nussbaumer: Digital filtering using pseudo Fermat number transforms. IEEE
Trans. ASSP-26, 79-83 (1977)
8.17 E. Dubois, A. N. Venetsanopou!os: "~umber theoretic transforms with modulus 220 -
20 + 1", in IEEE 1978 Intern. Acoustics, Speech and Signal Processing Conf. Proc., pp.
624-627
References 245

8.18 H. J. Nussbaumer: Overflow detection in the computation of convolutions by some num-


ber theoretic transforms. IEEE Trans. ASSP-26, 108-109 (1978)
8.19 I. S. Reed, T. K. Truong: The use of finite fields to compute convolutions. IEEE Trans.
IT-21,208-213 (1975)
8.20 I. S. Reed, T. K. Truong: Complex integer convolutions over a direct sum of Galois fields.
IEEE Trans. IT-21, 657-661 (1975)
8.21 H. J. Nussbaumer: Relative evaluation of various number theoretic transforms for digital
filtering applications. IEEE Trans. ASSP-26, 88-93 (1978)
Subject Index

Agarwal-Cooleyalgorithm 43,202,230 Euler


Agarwal-Burrus algorithm 56 theorem 9, 13,214, 215
Algorithms totient function 11,30, 157,213
convolution 66, 78
DFT 123,144 Fast Fourier transform (FFT) 85
polynomial product 73 computer program 95
reduced DFT 207 Fermat
APL FFT program 95 number 19, 222, 223
number transform (FNT) 222
Bezout's relation 6 prime 21, 223
Bit reversal 94 theorem 13,215,217,232,234,237
Butterfly 94 Field 25, 157
Bruun algorithm 104 Finite impulse response filter (FIR) 33, 55, 113
Fourier transform see discrete Fourier trans-
Chinese remainder theorem form
integers 9, 33, 35,43, 116, 125,215
polynomials 26,48, 152, 157, 163, 178 Galois field (GF) 25, 236
Chirp Z-transform 112 Gauss 15, 18, 236
Congruence 7, 14, 26, 56 Greatest common divisor (GCD) 5
Convolution 22,27,34 Group 23
circular 23,30,32,43,81,107,117,151,
212 Identity 24
skew-circular 174 "In place" computation 94, 143
complex 52, 177, 227 Interpolation see Lagrange interpolation
Cook-Toom algorithm 27 Irreducible polynomial 25, 157
Correlation 82, 117, 185 Isomorphism 24
Cyclic convolution see circular convolution
Cyclotomic polynomials 30, 36, 37, 152, Lagrange interpolation 28, 37
157, 182 Legendre symbol 18, 22, 237

Decimation Mersenne
in frequency 89, 187 number 19,216,230
in time 87 prime 20
Diophantine equations 6, 8, 237 transform 216
Discrete Fourier transform (DFT) 80, 112, Modulus 7
181 Multidimensional
Distributed arithmetic 64 convolution 45, 108, 151
Division DFT 102, 141, 193
integers 4 polynomial transform 178
polynomials 25 Mutually prime 5, 20, 21, 26, 43, 170,201

Equivalence class 7 Nesting 44, 57, 60, 134, 170, 194


Euclid's algorithm 5, 8, 9, 27 Number theoretic transform (NIT) 211
248 Subject Index

One's complement 20, 227 Rader-Brenner FFT 99, 190,207,208


Odd DFT 102, 187, 207 Recursive 60, 114
Order 14,24 Reduced DFT 102,121, 182,207
Overlap-add 29, 33, 55 Relatively prime see mutually prime
Overlap-save 34 Remainder 5
Residue 7
Parseval's theorem 83 class 7
Permutation 10,43,82, 125, 170, 183 reduction 120, 152
Polynomial 22 polynomial 25,152,163
product 30, 34, 47 Ring 24, 173, 196
transform 151 Roundoff 96, 228
Prime 5 Roots II, 14, 117, 156, 157, 161, 212,
Prime factor FFT 125, 194 237
Primitive root 11,21,117,202 Row-column method 102
Pseudo Fermat transform 234
Pseudo Mersenne transform 231 Skew-circular convolution 174
Split nesting 47, 139
Quadratic Split prime factor algorithm 129
non-residue 17, 22, 236
residue 17, 115 Totient function see Euler totient function
Quantization error (FFT) 96, 142 Twiddle factors 86
Quotient 5
Winograd 29, 280
Rader algorithm 116, 129, 134, 202, 204, Fourier transform algorithm (WFTA)
205,207 133, 201, 202, 204, 205, 207
Digital Pattern Recognition S.1 Dwyer III, G. Lodwick: On Radiographic
Image Analysis. - R. L. McIlwain, Jr. : Image
Editor: K. S. Fu Processing in High Energy Physics. -
2nd corrected and updated edition 1980. K Preston, Jr. : Digital Picture Analysis in
59 figures, approx. 4 tables. Approx. 240 pages Cytology. -1 R Ullmann: Picture Analysis in
(Communication and Cybernetics, Character Recognition.
Volume 10)
ISBN 3-540-10207-8
Contents: K S. Fu: Introduction. - T. M. Cover,
T.l Wagner: Topics in Statistical Pattern Re- Picture Processing and Digital
cognition. - E Diday, 1 C Simon: Clustering FIltering
Analysis. - KS.Fu: Syntactic (Linguistic)
Pattern Recognition. - A Rosenfeld, Editor: T. S. Huang
1 S. Weszka: Picture Recognition. -11 Wolf: 2nd corrected and updated edition. 1979.
Speech Recognition and Understanding. - 113 figures, 7 tables. XIII, 297 pages
K S. Fu, A. Rosenfeld: Recent Developments in (Topics in Applied Physics, Volume 6)
Digital Pattern Recognition. ISBN 3-540-09339-7
Contents: T. S. Huang: Introduction. -
H. CAndrews:Two-Dimensionai Trans-
Syntactic Pattern Recognition, forms. -1 G. Fiasconaro: Two-Dimensional
Nomecursive Filters. - R R. Read,
Applications lL.Shanks, S. Treitel:Two-Dimensional Re-
cursive Filtering. - B. R. Frieden: Image
Editor: K. S. Fu
1977. 135 figures, 19 tables. XI, 270 pages Enhancement and Restoration. - F. C Billings-
ley: Noise Considerations in Digital Image
(Communication and Cybernetics,
Volume 14) Processing Hardware. - T. S. Huang: Recent
Advances in Picture Processing and Digital
ISBN 3-540-07841-X Filtering. - Subject Index.
Contents: K S. Fu: Introduction to Syntactic
Pattern Recognition. - S. L Horowitz: Peak
Recognition in Waveforms. -1 E Albus: Elec-
trocardiogram Interpretation Using a Stochas-
tic Finite State Model. - R DeMori: Syntactic
Recognition of Speech Patterns. -
W. W.Stallings:Chinese Character Recogni-
tion. - Th. Pavlidis, H. -Y. F. Feng: Shape Dis-
crimination. - RH.Anderson:Two-Dimen-
sional Mathematical Notation. - B. Moayer,
K S. Fu: Fingerprint Classification. -
1 M. Brayer, P. H. Swain, K S. Fu: Modeling of
Earth Resources Satellite Data. - T. Vamos:
Industrial Objects and Machine Parts Recogni-
tion.

Digital Picture Analysis


Editor: A Rosenfeld
1976. 114 figures, 47 tables. XIII, 351 pages
(Topics in Applied Physics, Volume II) Springer-Verlag
ISBN 3-540-07579-8 Berlin
Contents: A. Roserifeld: Introduction. -
R. M. Haralick: Automatic Remote Sensor
Heidelberg
Image Processing. - CA. Harlow, New York
The Computer in Optical tion. - T. J. Ulrych, M Doe: Autoregressive and
Mixed Autoregressive-Moving Average
Research Models and Spectra. - EA. Robinson: Iterative
Methods and Applications Least-Squares Procedure for ARMA Spectral
Estimation. - J. Capon: Maximum-Likeli-
Editor: B. R Frieden hood Spectral Estimation. - R. N. McDonough:
1980.92 figures, 13 tables. Approx. 400 pages Application of the Maximum-Likelihood
(Topics in Applied Physics, Volume 41) Method and the Maximum-Entropy Method
ISBN 3-540-10 119-5 to Array Processing. - Subject Index.
Contents: B. R.Frieden: Introduction. -
R. Barakat: The Calculation of Integrals
Encountered in Optical Diffraction Theory. - Two-Dimensional Digital
B. R. Frieden: Computational Methods of Pro-
bability and Statistics. - A. K Rigler, R. J. Pegis:
Signal Processing I
Optimization Methods'iIi Optics. - L Mertz: Linear Filters
Computers and Optical Astronomy. - Editor: T. S. Huang
w.J. Dallas: Computer-Generated Holo- 1981. 77 figures, approx. 11 tables.
grams. Approx. 230 pages. ( Topics in Applied
Physics, Volume 42)
Image Reconstruction from ISBN 3-540-10348-1
Contents: T. S. Huang: Introduction. -
Projections R. M Mersereau: Two-Dimensional Noncur-
Implementation and Applications sive Filter Design. - P. A. Ramamoorthy,
Editor: G. T. Herman L T. Bruton: Design of Two-Dimensional Re-
1979. 120 figures, 10 tables. XII, 284 pages cursive Filters. - B. O'Connor, T.S.Huang:
(Topics in Applied Physics, Volume 32) Stability of General Two-Dimensional Recur-
ISBN 3-540-09417-2 sive Filters. - J. W. Woods: Two-Dimensional
Kalman Filtering.
Contents: G. T.Herman, R.MLewitt:Over-
view ofImage Reconstruction from Projec-
tions. - S. W. Rowland: Computer Implemen- Two-Dimensional Digital
tation ofImage Reconstruction Formulas. -
R. N. Bracewell: Image Reconstruction in
Signal Processing II
Radio Astronomy, - MD. Altschuler: Recon- Transform and Median Filters
struction ofthe Global-Scale Three-Dimen- Editors: T. S. Huang
sional Solar Corona. - T. F. Budinger, 1981. 61 figures. approx. 25 tables,
G. T. Gullberg, R. H Huesman: Emission Approx. 260 pages. (Topics in Applied
Computed Tomography. - EH Wood, Physics, Volume 43)
J. H Kinsey, R. A. Robb, B. K Gilbert, ISBN 3-540-10359-7
L D. Harris, E L Ritman: Applications of High
Temporal Resolution Computerized Tomo- Contents: T. S. Huang: Introduction. -
graphy to Physiology and Medicine. J. -0. Eklundh: Efficient Matrix Transposi-
tion, - HJ. Nussbaumer: Two-Dimensional
Convolution and DFT Computation Using
Nonlinear Methods of Polynominal Transforms. -S.Zohar: Winograd's
Discrete Fourier Transform Algorithm. -
Spectral Analysis B.I. Justusson: Median Filtering: Statistical
Editor: S. Haykin Properties. - S. G. Tyan: Median Filtering:
1979.45 figures, 2 tables. XI, 247 pages Deterministic Properties.
(Topics in Applied Physics, Volume 34)
ISBN 3-540-09351-6
Contents: S. Hayldn: Introduction. - Springer-Verlag
S. Hayldn, S. Kesler: Prediction-Error Filtering
and Maximum-Entropy Spectral Estima- Berlin Heidelberg NewYork

You might also like