D. J. Bernstein
Fast arithmetic
djbfft
Frequently asked questions
Can I use djbfft in my own code?
Yes.
Please tell me what programs you're using it in so that I can let NSF know.
Is djbfft faster than FFTW?
UltraSPARC-I, UltraSPARC-II:
djbfft 0.76 has been tuned for these chips.
It is substantially faster than FFTW 2.
For example,
djbfft computes a size-256 complex FFT in about 6300 cycles;
FFTW needs more than 10000 cycles,
according to the UltraSPARC-I/167 results
reported by the FFTW authors on the FFTW web pages.
(Web pages checked in September 1999.)
Pentium, Pentium MMX:
djbfft 0.76 has been carefully tuned for these chips.
It is more than twice as fast as FFTW 2.
Pentium Pro, Pentium II, Pentium III:
djbfft 0.76 has been partially tuned for these chips.
It is noticeably faster than FFTW 2.
PowerPC 603, PowerPC 604:
djbfft 0.76 has not been tuned for these chips.
Nevertheless it is as fast as FFTW 2.
Alpha 21264:
djbfft 0.76 has not been tuned for this chip.
Nevertheless it appears to be faster than FFTW 2.
For example,
on a 21264-500,
djbfft 0.76 computes a size-256 complex FFT in under 6400 cycles,
while FFTW 2 needs more than 7300 cycles.
PA-8200, PA-8500:
djbfft 0.76 has not been tuned for these chips.
Nevertheless it appears to be faster than FFTW 2.
For example,
on an 8200-240 under HP-UX B.11.0 with cc +O3 +Oall,
djbfft 0.76 computes a size-256 complex FFT in about 4800 cycles,
while FFTW 2 needs more than 5400 cycles.
Why is djbfft so fast?
It's easier to answer the opposite question:
why are other FFT libraries so slow?
There are several common problems:
- Arithmetic operations.
Many libraries use a radix-2 FFT,
performing about (5k-10)2^k floating-point operations
for a size-2^k transform.
djbfft uses a split-radix-2/4 FFT,
performing about (4k-6)2^k floating-point operations.
(You can think of split-radix 2/4
as the limit of radix 2, radix 4, radix 8, etc.)
- Root computation.
Some libraries compute roots of 1 on the fly.
djbfft uses a precomputed root table.
- Root loads.
Each step of a standard split-radix FFT
involves a root of 1 and the cube of that root.
djbfft uses the inverse instead of the cube,
saving half the loads.
(I spotted this improvement in September 1997;
I call it ``exponent -1 split-radix.''
It changes the output order,
but that doesn't matter for convolution.)
- Memory use.
Many libraries use one or more temporary arrays as large as the input array,
missing the cache unnecessarily.
djbfft carries out all computations in place.
- Cache thrashing.
Most libraries use an iterative FFT,
passing repeatedly over the entire input array,
missing the cache at every step for large transforms.
djbfft uses a recursive decimation-in-frequency FFT.
(The cache-friendliness of this type of FFT
was pointed out by Singleton in 1967.)
- Cache associativity.
The UltraSPARC, for example,
can save 512 32-byte lines of memory in its level-1 cache,
but it can't save two lines with the same address modulo 16384.
This produces seemingly random slowdowns in many programs.
djbfft stores its root tables consecutively in memory
so that users can easily move their arrays to safe portions of the cache.
- Unaligned data.
The Pentium, Pentium II, etc. need extra time to access
8-byte floating-point variables that are not aligned to 8-byte boundaries.
This produces seemingly random slowdowns in many floating-point programs
(including FFTW
before I explained this problem to the FFTW authors in December 1997).
djbfft automatically uses
gcc's -malign-double option
to align 8-byte floating-point variables in the data segment,
and it generally avoids storing floating-point variables on the stack.
- Floating-point latency.
On an UltraSPARC, for example,
a floating-point addition takes just one cycle,
but the result cannot be used during the next two cycles.
Most FFT libraries try to use the result immediately,
and end up pausing for two cycles.
The instructions in djbfft are scheduled
for both the Pentium and the UltraSPARC
to avoid most of these delays without any register spills.
Don't modern compilers take care of instruction scheduling?
Sure, if you don't care about speed.
Is there any room for further speed improvements?
UltraSPARC-I, UltraSPARC-II:
My size-256 FFT uses about 6300 cycles.
The most obvious bottleneck is the floating-point adder,
which needs 5008 cycles to carry out 5008 floating-point additions.
The integer units are mostly idle;
some floating-point operations could be simulated by integer operations,
or an independent integer computation could be carried out in the meantime.
The overhead can be reduced;
my current scheduling has some obvious gaps,
especially in the real transforms.
The code can be scheduled to work directly from the L2 cache,
speeding up large transforms.
Pentium, Pentium MMX:
My size-256 FFT uses 17196 cycles.
The most obvious bottleneck is the floating-point unit,
which is using 16112 cycles:
- 1696 cycles for multiplications,
- 5008 cycles for additions,
- 1560 cycles for register copies,
- 2672 cycles for loads, and
- 5176 cycles for 2588 stores.
Some of the loads and stores can be eliminated.
The function-call overhead can be reduced in several ways
with an asm implementation.
The number of cache misses can be reduced further for large transforms.
Pentium Pro, Pentium II, Pentium III:
My size-256 FFT is under 12000 cycles.
I'm still trying to figure out the bottlenecks here;
I don't have an accurate Pentium Pro simulator.
I wouldn't be surprised if big speedups are still possible.
Will MMX, SSE, 3DNow, VIS, AltiVec, et al. lead to faster FFTs?
Typical circuitry to compute a double-precision floating-point product
can be adapted to compute several low-precision products simultaneously.
So it's reasonable to expect low-precision FFTs
to run more quickly than double-precision FFTs.
Vector hardware is often accompanied by faster memory access,
so it's possible that double-precision FFTs can be sped up too.