0% found this document useful (0 votes)
50 views6 pages

A Simple Fixed-Point Error Bound For The Fast Fourier Transform

Uploaded by

tristanlvk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views6 pages

A Simple Fixed-Point Error Bound For The Fast Fourier Transform

Uploaded by

tristanlvk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

IEEE TRANSACTIONS ON ACOUSTICS,

SPEECH, AND SIGNAL


PROCESSING, VOL. ASSP-27, NO. 6 , DECEMBER 1979 615
matrices,”SIAM J. Appl.Math., vol. 12, pp.515-522,Sept.
1964. [7] G . Baxter,“Polynomialsdefined by adifferencesystem,” J.
[3] S . Zohar, “The solution of a Toeplitz set of linear equations,” Math. Anal. Appl., vol. 2, pp. 223-263, Apr. 1961.
J. Ass. Comput. Mach.,vol. 21,pp. 272-276, Apr. 1974. [ 81 E. 0.Brigham, The Fast Fourier Transform. Englewood Cliffs,
[4] E.
H. Bareiss, “Numerical solution of linear equationswith NJ: Prentice-Hall, 1974.
Toeplitz and vector Toeplitz matrices,” Numer. Math., vol. 13, [9] A. K. Jain, “Fast inversion of banded Toeplitz matrices by circu-
pp. 404-424,
Oct. 1969. lar decomposition,” IEEE Trans.SignalSpeech,
Acoust., Pro-
[5] T. Kailath, A. Vieira, and M. Morf, “Inverses of Toeplitzopera- cessing, vol. ASSP-26, pp. 121-126, Apr. 1978.
tors,innovations, and orthogonalPolynomials,”presented a t[ l o ] S. Zohar,“Toeplitzmatrix inversion: The algorithm of W. F.
IEEE Conf. Decision andControl, Hyatt Regency Houston, Trench,” J. Ass. Comput. Mach.,vol. 16, pp. 592-601, Oct. 1969.
Houston, TX, Dec. 10-12,1975. [ 111 J. H. Justice, “The Szego recursion relation and inverses of posi-
[6] G. Szego, Orthogonal Polynomials, vol. 23, 3rd ed. (Amer. Math. tive definiteToeplitzmatrices,” SIAM J. Math.Anal., vol. 5,
SOC.).New York: Colloquium, 1967. pp. 503-508, May 1974.

A Simple Fixed-point Error Bound for the


Fast Fourier Transform

WILLIAM R. KNIGHT AND R. KAISER

Abstract-Error bounds for thecomputation of the fast Fourier trans-or to shift (divide by 2) the sum of two fixed-point numbers in
form in fixed-point arithmetic are derived for any arithmetic number order to fit the result into the computerword.
base and for anyprimefactorization of the data array length. The The error so arising from computing the fast Fourier trans-
intendedapplication is for signal processing withminicomputers.
form (FFT)in
Errors arising from inaccurate sine coefficients and from limited arith- fixed-pointarithmetichasbeenanalyzed
metic precision areconsidered. Thearithmetic errordepends essen- previously. Welch [2] has analyzed the case characterized by
tially on shifts of the data array that may be required to avoid overflow
roundedsign-magnitudebinaryarithmetic using a floating-
of the computer word. O w closest bound requires knowledge of where block decimation-in-time radix-2 FFT algorithm and assuming
shifts occur and is best computed in parallel with the Fourier trans-
form. For the case that such program modification is not feasible, we
datasuch that successivestagesof computation are statisti-
cally independent. Oppenheim and Weinstein [3] analyze the
derive an error bound for a posteriori calculation and an a priori error
estimate. Ow boundsare forthe maximum error because little is case characterized by rounded binary arithmetic (sign-magni-
gained at the expense of considerablygreatercomplexity for prob- tude format seems to beassumed in some places) using a
abilistic error bounds. decimation-in-timeradix-2 FFT algorithmwithwhite noise
as data causing either no shift or a shift at each stage. Tran-
I. INTRODUCTION Thong and Liu [4] have treated all cases generated by rounded

A N increased awareness of the limitations of fixed-point or truncated 2-complement binary arithmeticusing a floating-
arithmetic has come with the growing use of micro- and block decimation-in-time, or frequency, radix-2FFT algorithm
minicomputers for data acquisition and processing in physical withdatamaking successive stages statistically independent
and chemical instrumentation. The work reported here arose and requiring either no shift in any stage or one shift in each
froman analysis ofobservations [ l ] in Fouriertransform stage.
nuclearmagneticresonance spectroscopy where the fast Our analysis follows Welch [2] in spirit but covers truncated,
Fourier transform algorithm is commonly implemented with or rounded, complement or sign-magnitude arithmetic to any
fixed-point arithmetic. The limitation imposed by the finite number base using floating-block
decimation-in-time,or
computer word length becomes evident in the form of a re- frequency,mixed-radix FFT algorithmswith data requiring
stricted “dynamic range”and as computational“noise.” any possible number of scaling shifts. Following the tradition
These effects are caused by the need to truncate the product of Wilkinson [5], our bounds are worst case rather than proba-
bilistic, thus avoiding the assumption that errors are indepen-
Manuscript received August 23, 1978;revised July 1, 1979. dent or else the complications and uncertainties arising from
W. R. Knight is with the Departments of Computer Science and correlation oferrorand signal. Surprisingly, theworst case
Mathematics, University of New Brunswick, Fredericton, N.B., Canada.
R. Kaiser is with the Department of Physics, University ofNew differs from the probable error only by a multiplicative con-
Brunswick, Fredericton, N.B., Canada. stant rather than by a factor proportional to the square root

0096-3518/79/1200-0615$00.75 0 1979 IEEE

Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.
616 IEEE TRANSACTIONS
ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-21, NO. 6 , DECEMBER 1979

of the number of stages in the computation. On the other signal ratio is highly desirable,’ we can combine the last ex-
hand, statistical assumptions can assure us that the error is pression above with an estimate of the totalnumber of required
distributed uniformly over the elements of the output data shifts s, = (n/2)+ log, ( p n / p o ) < n + 1. In this expression,
vector while our worst case analysis does not prevent it from po and p n are the peak/rms ratios of the input and output
being concentrated in a single element. signals x, and x,, respectively. Strictly,thespectrumratio
In addition to the arithmetic error, we consider a “trigono- is known a posteriori only, but sufficient information
metric error” that arises from lack of precision in the sine and about the signal spectrum will often be available to permit a
cosine coefficients needed forthe Fouriertransform. These reasonable a priori estimate.
values can either be calculated from a series approximation to The following Sections I1 and 111 show the derivation of the
the sine function [ 6 ] ,or they can be stored in a table. In the bounds on the arithmetic noise/signal ratio for any base b of
first case, the error arises from both the computer arithmetic computerarithmeticandfor algorithms of mixed radix.
and from truncation of the series. In the second case, inter- Derivation of the error bound for individual arithmetic opera-
polation between tabulated values becomes necessary for tions is delegated to Appendix I. The trigonometric error is
sizeable data arrays occurring in practice. Interpolation can dealt with in Section IV, and Section V gives a brief outline of
be linear [7], in which case most of the trigonometric error difficulties with statistical error analysis.
arises from the chord approximation to the sine curve, or by When the input data are real rather than complex, the FFT
means of the nonlinear sin (a+ 0) addition theorem [8], in algorithm maybe adapted to take advantage of theredun-
which case it arises from the interpolation arithmetic. dancy that arises fromthe vanishing imaginary part of the
For binaryarithmetic and for a complex data vector of input vector. Appendix I1 shows that our error analysis adapts
length N = 2”, our results can be summarized as follows. We easily to this case.
let x, be the vector carrying the inputdata whose Fourier
transform is to be computed. The fast Fourier transform 11. ARITHMETIC
E R R O R ANDGENERALCONSIDERATIONS
algorithm proceeds through n stages, successively replacing the The recursion leading from one stage to the next inthe
input vector x , by xl, x 2 , + . . , x , , the last vector being the FFT algorithm is
desired transform
X k = FkXk -1 (1)
N -1
where Pik is an N X N matrix of simple structure [9] , [lo] .
xn(f)= xo(t) ~ X (-2.iriftlN),
P
t=o Although the details differ for different versions of the algo-
rithm, the F k are unitary matrices for theoretical discussions.
except for scaling as needed to avoid overflow of the computer For practical computations, complex numbers are represented
word and exceptfor computation noise. The total noiseis by real and imaginary parts. Complex vectors of length N then
bounded by the sum of thetrigonometric noise plus the become real vectors of length 2N, and the Fk become 2N X
arithmetic noise. 2 N real orthogonal matrices. Forfixed-pointcomputations,
The trigonometric noise/signal ratio is bounded by 2 f i n e , considerations of scaling and the desire for simple multipliers
where e boundsthe absoluteerrorin the sine and cosine intervene,and each F k is some multiple of an orthogonal
values. 2N X 2N matrix. Instead of Parseval’s theorem, we then have,
The arithmetic noise/signal ratio is bounded by r , u / ~ ~ x , ~ ~ ,
at any stage k
where u is the value of the least positive representable number
in units of the input signal x,, and the double bars represent Ilxkll = IlFkll IIFk-1 11 . ’ * IIFI 11 * Ilxoll, (2)
the norm which we takeasthe rms value llxll = (ZfZi where I l F l l indicates the spectral norm [ l l ] of thematrix
j x ( 4 ) I2/N)’/’. The normalized arithmetic error r, is bounded
F, I l F l l =maxllxll=l IIFxII.
by the sum r, < 3.81 X”,=,2sk-(k’z). The constant 3.81 is In finite precision arithmetic, the matrix X vector product
derived fortruncating arithmetic; a smallervalue holdsfor
will generate an error vector, say d k , andthecomputed
roundingarithmetic. The quantity 2Sk is the scale factor at recursion really is
the end of the kth stage, i.e., sk is the number of scaling right
shifts accumulated to the end of the kth stage as required to ?k =Fk?k-l + dk. (3)
avoid overflow of the computer word. Beginning with zl,
the computed data vectors ?k will differ
The computation of r, can easily be included in the FFT from (1) by an error vector e k which builds up according to
program, accumulating the kth summand at the kth stage of the recursion
computation. However failing this, we derive the upper
bound r, < 26.4 X 2 (s,+i)/2 ek =Fkek-’ + dk; bo = 0. (4)
Our bound onthe trigonometric noise/signal ratio i s a By taking norms onboth sides, we get by the triangle inequality
priori; it can be evaluated before computation starts. How-
ever, our boundonthearithmetic noiselsignal ratio is a llekll GllFkll . llek-111 f IIdkll; IleoII = o (5)
posteriori because its evaluation requires knowledge of the
scaling shifts, which knowledge becomes available only as ‘We gratefullyacknowledge the commentsof areferee who im-
computation proceeds. Since an a priori bound for the noise/ pressed uponus the desirability of an a priori error estimate.

Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.
KNIGHT AND KAISER: FIXED-POINT BOUND
ERROR FOR
FOURIER
TRANSFORM
FAST 611

which, after division by the norm of (l), gives a bound for the of shifts at the endof the computation, thus
propagation of the noiselsignal ratio Rk = [lekIllllxkll
b S k / a< b s " / f i . (1 1)
Rk GRk-1 ~ ~ ~ k ~ R! , /= ~ 0. ~ x k ~ ~ ~ (6) To obtain the second bound, we separate the real and imagi-
The recursion (6) establishes our basic bound on the arith- nary parts of the vectors xk and treat each xk as a real 2N
metic noiselsignal ratio. Evaluation of this recursion depends vector. Each element of xk is a weighted and scaled sum of
on particular features of the computation and is the subject 2Nk elements of x,, and the sumof the absolute values of
of the next section. theweights is, at most, N k d . We consider ( k , the largest
absolute value of any element ofxk, and obtain
111. ARITHMETIC ERRORAND
PARTICULARCONSIDERATIONS .$k < ( , N k d s k f i . (12)
The kth stage deals with the prime factorP k in the factoriza- Moreover, to make full use of the computer word, .$k/(,must
tion be held within the limits
N=PlPZ . . ' P n . (7) b-' <&I.$, < b. (1 3)
In the complex domain, each element of the new Vector xk is Substitution from (12) gives the second bound
the weighted sum of P k elements of the old vector X k - 1 , the
b s k / f i <b a . (14)
weights being the"twiddlefactors" exp (io). The algorithm
represents complex numbers by their real and imaginary parts, Bound (14) increases with increasing N k , while bound (1 1)
and the 2N X 2N matrix Fk has, therefore, P k sines and P k decreases. The two bounds are equal for
cosines of P k angles 8 in each row, all other elements being
zero. The spectral norm is IlFkll = 6,
i.e., the rms Value of
N* = bs"-' /a, (15)
the data vector IIxkll increases by a factor 6 in the kth and we let m be the least positive integer for which N , >N,.
stage. For k < m , we use bound (14) and get
This increase may make it necessary to shift the data array
one (or more) computer digit, i.e., to divide by the number b s k / f i< b a
= b@"+''I2 2'l4dN=; k < m.
base b of the computer arithmetic in order to avoid overflow
(1 6 4
of thecomputerword. We let sk bethenumber of shifts
accumulated up to the end of the kth stage. Incorporation of For k > m ,we use bound (11) and get
the shift into the matrix Fk gives then IlFkIl = b-(sk-sk-') bSkl% = b Sk-(Sn-l 21/4,/'= < b(sn+')Iz21/4
6. Accumulationofthese scale factors in ( 2 ) gives us,
for the rmsvalue of the data array at the endof the kth stage, 4 N X ; k 2 m. - (1 6b)

11 Xkll = bdsk fi11 (8) We are now in a position to bound the sum (10). With c be-
ing the largest of the c k , application of (1 6a) and (1 6b) gives
with the abbreviation Nk = p l p z * . P k .
The elements of the error vector dk will be small multiples
of the least positive numberrepresentable in thecomputer
word format. We derive in Appendix I anupperbound ck
for this multiple at stage k, and we let u be the value of the
least positive computernumber in unitsoftheinputdata
x,, so that 11 dkll < cku. The recursion (6) thereby becomes Formula (17) is convex in a
and takes its maximum for
either N* = Nm -1 or for N* = N , at the extremes of therange
GRk-1
Rk +CkbSkU/(IIX,II&); R, =o. (9)
of N*. In the first case, the parentheses in (17) has the maxi-
In terms of the normalized noiselsignal ratio rk = Rk /Ix, ll/u, mum value
it evaluates to

Thisis our basic bound on the arithmetic noise/signal ratio


which was summarized in Section I for binary arithmetic and
N = 2,. and in the second case its maximum is
Inorder to evaluate (10) otherthanin parallel withthe
computationof the Fouriertransform, we need to bound
b S k / a . There are, in fact, two bounds available. The first
follows simply from the fact that sk, the number of shifts at
the end of the kth stage, cannot exceed ,s, the total number
+...+e)
Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.
618 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 6 , DECEMBER 1979

Since 2 is the smallest prime, the square roots are maximized that, inthe absence of scaling shifts, 11 FkII = 6,
so that
as (24) evaluates to

so that both (18a) and (1 8b) are bounded by


( - - - + 2 312
- + 2-212 2 - 112 + 1 .+ 2-112 + 2-212 Scaling shifts multiply both 11 M k I I and llFkll by the same
+ 2-312 . . . 1. (18c)
factor which cancels in the ratio (24) so that (25) becomes
our final boundonthetrigonometric noise/signal ratio as
The series with quotients 2-lI2 can be summed to infinity in summarized in Section I for N = 2".
both directions, and the result is
V. STOCHASTIC ANALYSIS
We begin by observing that the errors arising from inexact
trigonometric function values do not allow stochastic analysis.
This is the a posteriori bound which was summarized in Sec- The same inexact values are used repeatedly in the computa-
tion I for binary arithmetic. tion; the errors are not statistically independent and a stochas-
It remains to obtain an a priori estimate of the total number tic analysis is not appropriate.
of scaling shifts s., If, for comparing peak/rms ratios p in For the arithmeticerrors, the assumption of statistical
the input and output data vectors, we divide (13) by (8) with independence seems more reasonable, and both the mean
k = n, the result is error and the variance are needed to arrive at the rms error.
One would hope that the mean error, i.e., the bias inthe
Fourier transformed data, is zero so that only the variance will
be needed, but this is not the case. The next two paragraphs
and hence explain our arguments.
The vector of mean errors E ( e k ) is obtained by taking
statisticalexpectation valuesof the elements of the error
kctor ek in (4). For its norm, ( 5 ) translates into
In Section I we have specialized the mean value between these
two bounds for b = 2 and N = 2 , as an a priori estimate for
the number of shifts. The mean error propagates throughthecomputationin the
same way as the maximum error, and our results remain valid
IV. TRIGONOMETRICERROR
for the mean error after replacement of the arithmetic error
We have explained in Section I why the trigonometric ele- vector dk by its expectation E(dk). At the end of the com-
ments of the matrices Fk will not always be exact. Instead of putation, bound (26) can vanish only if E(dk) is a null vector
the correct matrices Fk, slightly different matrices Fk t AFk for all k.
will be used, and instead of the desired transform x, = The errors generated in multiplication and shift operations
F, F, -1 * .F , x, there results depend on whethertruncation or rounding is used, and on
X , t A X , = (F, + AF,)(Fn-l t AF,-,) * . . (FI t AFI)x,.
whether negative numbers are represented in complement or
sign-magnitude format, but in no case can the mean error be
(22) relied upon to vanish. Consider first the case of truncation. In
complement format, truncation reduces the value of positive
To first order, the trigonometric errorvector is
as well as negative numbers; the error distribution is uniform
n from - 1 to 0 with expectation - 3.
In sign-magnitude format,
truncation reduces the absolute value of a number; the error
distrubtion is uniform from -1 t o 0 for positive numbers, and
from 0 to t1 for negative numbers. Theexpectederror is
++ with the sign of the error correlated with the sign of the
data. It does not appear safe t o neglect this correlation in
order t o get zero mean error. For rounding, the expected
To evaluate 11 AFkII, we note that each row of Fk holds P k error is generally smaller than for tuncation, yet still does not
sine-cosine pairs, so that each row of AFk holds, at most, necessarily vanish. For illustration, consider the error arising
2pk elements different from zero and, if there is no scaling from rounding after a one-bit right shift in binary arithmetic.
shift, each of these is bounded by E , the maximum absolute Its distribution concentrates at three values: if the bit shifted
error in the sine and cosine values. By permutation of rows out was a zero, there is no error. If it was a 1 , the error is
andcolumns, AFk canbe turned into block diagonal form f with the sign depending on whether the number is rounded
holding N/pk blocks of dimension 2pk x 2pk each. It is easy up or down. The usual rules to decide between rounding up
to see [ 5 , ch. 1111 that for such a matrix of blocks filled with or down are based on the sign of the number or on whether
E'S, the spectral norm is 11 AFkII < 2Pkf. We had seen earlier it is even or odd, and thus introduce again a correlation be-

Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.
KNIGHT AND KAISER: FIXED-POINT
BOUND
ERROR FOR
FOURIER
TRANSFORM
FAST 619

tweenerrorsanddata.Thisproblem arises for any even-b error by b. For a shifts after the butterfly computation, the
arithmetic. error is thus bounded by (2 t f i ) p k b ” -). 1. This is domi-
For any arithmetic we have considered, the mean error does nated by the case of no shifts after butterfly computation,
not exceedhalf the maximumerror. In fact, boundingthe and we take ck = (2 + f i ) p k .
mean error by half the maximum error is not at all pessimistic For the case b = 2 and N = 2’, the bound can be sharpened
for truncating two’s-complement arithmetic, which appearsto a bit by attention to detail. At most two shifts can occur in
be mostcommon at presentforfixed-pointFFTcomputa- any one stage in this case. At worst, both shifts occur before
tions. Since, generally, the mean error can amount to a size- the butterfly computation and generate an error bounded by
able fraction of the maximum error, there is little merit in 2 in the shifted elements of x k + . In the butterfly, a new
calculating the variance, althoughour analysis method can element of x k is computed as the sum of a sine cosine-weighted
easily be applied to such calculation. Quite generally, we feel pair ofelementsof x k - 1 , plusoneunmodifiedelementof
that there is little gained at the expense of greater complexity x k - l . (This applies directly in some algorithms. In others,
and uncertainty in calculating probabilistic error bounds for it applies on average in the sense that half the elements of x k
the fiied-point fast Fourier transform. arise eachfromtwoweighted pairs andtheremaininghalf
arise each from two unmodified elements of x k - l .) The shift
APPENDIXI error is thus magnified to 3
(1 + I sin 0 I + I cos 0 I) < (1 + 2
MACHINEERROR fi). In addition, there is the error from truncating the two
We derive a bound on the elements of the error vector d k products, bounded by 1 for each product. The resulting total
that arises in the computations (3) of the kth stage. The units is ck = $ (1 + a) + 2 = 3.81. This value has been used in the
summary of our results in Section I.
of dk are those of the least significant computer digit in the
elements of x k . There aresome differencesbetween algo-
rithms, and also for the first or last stage in a given algorithm, APPENDIX I1
but we will neglectthese specialcases. The error in these REAL INPUTDATA
cases is smaller than the worst case bound derived in the fol-
lowing but,exceptfor very shortdata arrays, the overall Of course, it is straightforward to treat real input data by
reduction amounts to less than a factor $. Also, since trunca- simply setting the imaginary part of x, to zero and applying
tion is more simply programmed and produces greater error theFFT algorithmunchanged. Our error analysis does not
than rounding,truncation willbeassumedeven though it requiremodification forthis case. Althoughthismethod is
generates abias in complement arithmetic. straightforward, it suffers fromredundancy. In theinput
In a typical stage, there are at most 2pk multiplications to array, the imaginarypart is filled with zeros, and in the output
compute one element of the new vectorx k from 2pk elements array every element is duplicated in conjugate complex form.
of the old ~ k - Each ~ . multiplication introduces a truncation Two modifications of the FFT algorithm have been developed
error of,atmost, oneunit. If no shifting takes place, the for real input data to permitthelength N of all complex
error in the new element is bounded by 2pk units. vectors in the transform to be halved when N is even. We do
If shifting is required, a natural program arrangementis this: not consider the complications that arise with these methods
upondetection ofan overflow,thecurrent“butterfly” is when N is odd.
abandonedandtheentiredataarray is shifted one digit, In the first of these modifications [ 121, a complex input
corresponding to a truncated division by b. The effect is that vector x, of length N / 2 is formed by placing the N real input
those elements of the array that had already been computed elements alternately into real and imaginary parts, so that the
beforetheoverflow was detected (and belong to x k ) are real part of x, holds the even subscripted input samples and
shifted after multiplication, while those still to be operated the imaginary part holds the odd subscripted ones. The com-
upon (and belonging to ~ k - ~including
) , those of the current plex FFT algorithm is then applied to x, and produces
butterfly, are shifted before multiplication. The effect of the in n - 1 stages corresponding to theprimefactorizationof
shift is different for these twosets of elements and,since there N/2. One additional stage is then required to allow forthe
may be several shifts per stage, we look at the general case of shifting oftheoddsubscriptedinputdata.This last stage
an element which is shifted, possibly several times, before and consists of a s u m and difference butterfly followed by a regu-
after entering the butterfly. lar radix-2 butterfly. Our error analysis is unchanged for the
The shifts before butterfly
computationintroduce an n - 1 stages. For the last stage, we obtain IIF, 11 = 2 and c, <
error bounded by oneunitintoeachelementremaining in 5.41 (for binaryarithmeticc,<4.56).Equation (6) shows
x k - 1 . The weighted sum of p k pairs of these give a new ele- that the contribution of the last stage to the arithmetic error
ment for xk. The weighting is by a sine and cosine multiplier is determined by c,/llFnll, and this ratio is less than that for
for each pair, resulting in an error bounded by I sin 0 I + a regular p = 2 stage. It is thus conservative to consider this
I cos 0 I <fi from each pair. The p k pairs thus introduce an last stage as a regular p = 2 stage so that our error bound with
error of, at most, p k f i into each new elementof x k . ~n N designating the length of the real input vector applies un-
addition,there is themultiplication error boundedby2pk changed to themodifiedalgorithm. Our trigonometric error
units, giving a totalerror bounded by (2 + a ) p k units. bound is also unchanged.
Any number of shifts after computation introduce an error Thesecondmodification [13] involvesan adaptation of
of, at most, one unit, and each such shift divides the previous thealgorithmsuch thatthe transform is obtained in n - 1

Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.
620 IEEE TRANSACTIONS
ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-21, NO. 6, DECEMBER 1919

stages corresponding to the prime factorization of N/2. Our [6] M. Abramowitz and I. A. Stegun, Handbook of Mathematical
error analysis applies unchanged to these stages. Strictly, Functions, US.Government Rep. NBS-AMS55, 1964.
[7] J. W. Cooper, I. S. Mackay, and G . B. Powle, J. Magnetic Reso-
there is a remnant of the nth stage in the form of a single sum nance, vol. 28, p. 405, 1977.
and difference butterfly to compute the two real spectral [SIJ. A. den Hollander, Rijksuniversiteit Leiden, The Netherlands,
points at zero and at maximum frequency. If the error gener- personal communication.
[9] B. Liu, Ed.,“Digital filtersand the fast Fourier transform,”
ated in thu extra butterfly is neglected, our results apply with Benchmark Papers in Electrical Engineering and Computer
N being only half the length of the real input data array. Science, vol. 12. Stroudsburg, PA: Dowden, Hutchinson & Ross,
1975.
[ l o ] F. Theilheimer, IEEE Trans. Audio Electroacoust., vol.AU-17,
REFERENCES p. 158,1969. E. 0. Brigham, The Fast Fourier Transform.
Englewood Cliffs, NJ: Prentice-Hall, 1974.
[1] J. W. Cooper, J. Magnetic Resonance, vol. 22, p. 345, 1976. [ l l ] B. Noble, Applied Linear Algebra. Englewood Cliffs, NJ:
[2] P. D. Welch, ZEEETrans. Audio Electroacoust., vol.AU-17, Prentice-Hall, 1969.
p. 151, 1969;also in [9]. [12] C. Bingham, M. D. Godfrey, and J. W. Tukey, IEEE Trans.
__
131 A. V. Oppenheim and C. J. Weinstein, Proc. ZEEE, vol. 60,
P. 957,19i2.
Audio Electroacoust., vol.AU-15, p. 56,1967; J. W. Cooley,
P. A. W. Lewis, and P.D. Welch, IBM Res.PaperRC-1743,
14 1 T.-Thonp and B. Liu. IEEE Trans. Acoust.., Sueech.
= > . . Signal
. Pro- 1967;J. Sound Vib.,vol. 12, p. 315, 1970.
cessing, Gal. ASSP-24;p. 563, 1976. [13] G.D. Bergland, Commun. Ass. Comput. Mach., vol. 11, p. 703,
[5] J. H. Wilkinson, Rounding Errors in AlgebraicProcesses. Engle- 1968; ZEEETrans. Audio Electroacoust., vol.AU-17, p. 138,
wood Cliffs, NJ: Prentice-Hall, 1963. 1969.

On the Problem of
with Short Co
HON-KEUNG KWAN

Abstract-The paper presents an algorithm for designing IIR digital desired small word length in special-purpose hardware systems,
filters with short coefficient wordlengths. The algorithm hasthe quantizationerrors associated withfinite coefficient word
flexibility of obtaining a better design at theexpense of investing more
length become critical and may lead to a filter that does not
computational time. It was found that, by incorporating Crochiere’s
idea of equalizing passband and stopband statistical word lengths be- satisfy its original specifications. Since the cost and com-
fore applying the algorithm, weare capable of obtaining a further plexity of implementationdepends on the coefficient word
reduction in the overall word length when compared with thatob- length, the word length to be chosen should be minimum but
tained by applying the algorithm alone. The principle of the algorithm still sufficient to fulfill the desired requirements.
andtheconcept of statistical word lengths equalization forfurther
It is the last limitation which we have addressed ourselves
word length reduction applies to all IIR digital filters of different
structuresand passbands, with andwithouttheir transfer functions to in this paper. As a result, an algorithm was formulated. It
expressed in a closed form. has the flexibility of yielding a better reduction in coefficient
word length at the expense of investing more computational
time. It was foundthat, by incorporating Crochiere’s idea
I. INTRODUCTION [ 2 ] , [ 3 of
] equalizing passband and stopband statistical word
lengths before applying the algorithm, we are capable of
D ESPITE the many advantages offered by digital filters,
there are some practical limitations associated with their
actualimplementation. The most important limitation is
obtaining a better design when compared with that obtained
by employing the algorithm alone. The efficiency of the
algorithm was demonstrated by employing Crochiere’s three
caused by quantization. The three major sources of quantiza-
elliptic low-pass cascaded digital filters [2] as examples.
tion errors are: 1) input quantizationerrors, 2) arithmetic
quantizationerrors, and 3 ) coefficient quantizationerrors. 11. PROBLEM FORMULATION
The first two types have been investigated [ I ] . Due to the
Consider the cascade form elliptic digital transfer function as
limit of the word length available in minicomputers and the

Manuscriut received December 29, 1977; revised October 30, 1978


and May 8,-1979.
The author is with the DeDartment of Electrical Engineering,
-. Imuerial
.
College of Science and Tech;lology, London, England: where

0096-3518/79/1200-0620$00.75 01979 IEEE

Authorized licensed use limited to: Tsinghua University. Downloaded on March 30,2023 at 09:04:36 UTC from IEEE Xplore. Restrictions apply.

You might also like