Algorithm For Scalable Fourier Transforms
Algorithm For Scalable Fourier Transforms
Abstract—The Highly Efficient Fast Fourier Transform for Fourier transform (DFT) requiring O(N 2 ) operations on a sig-
Exascale (heFFTe) numerical library is a C++ implementation nal with N samples, bounding the performance of a DFT from
of distributed multidimensional FFTs targeting heterogeneous above. However, it is commonly taught that the transform can
and scalable systems. To date, the library has relied on users to
provide at least one installation from a selection of well-known be accelerated and computed in O(N log N ) operations using
libraries for the single node/MPI-rank one-dimensional FFT the “Fast Fourier Transform” (FFT), a class of algorithms
calculations that heFFTe is built on. In this paper, we describe the pioneered in the late 20th century [3]–[5].
development of a CPU-based backend to heFFTe as a reference, Currently, the landscape for computing the one-dimensional
or “stock”, implementation. This allows the user to install and FFT of a signal on one node includes many respectable
run heFFTe without any external dependencies that may include
restrictive licensing or mandate specific hardware. Furthermore, implementations, including those of Intel’s OneMKL initiative,
this stock backend was implemented to take advantage of SIMD NVIDIA’s cuFFT, AMD’s rocFFT, and FFTW [6]–[9]. This
capabilities on the modern CPU, and includes both a custom list includes implementations for both CPU and GPU devices,
vectorized complex data-type and a run-time generated call- largely giving flexibility to a user needing to compute the FFT
graph for selecting which specific FFT algorithm to call. The of a few small signals on a local machine, a few intermediate-
performance of this backend greatly increases when vectorized
instructions are available and, when vectorized, it provides rea- sized signals on a robust compute device, or perhaps many
sonable scalability in both performance and accuracy compared independent small- and intermediate-sized signals on a larger,
to an alternative CPU-based FFT backend. In particular, we heterogeneous machine. However, these libraries are seldom
illustrate a highly-performant O(N log N ) code that is about designed for the problem of scale— as scientists desire the fre-
10⇥ faster compared to non-vectorized code for the complex quency representation of increasingly large multidimensional
arithmetic, and a scalability that matches heFFTe’s scalability
when used with vendor or other highly-optimized 1D FFT signals, they will at some point need to shift towards using
backends. The same technology can be used to derive other distributed and heterogeneous machines. Creating scalable
Fourier-related transformations that may be even not available FFTs for large peta- or exascale distributed machines is an
in vendor libraries, e.g., the discrete sine (DST) or cosine (DCT) open problem, and the heFFTe [10] library has the ambition
transforms, as well as their extension to multiple dimensions and to be the most performant on this frontier.
O(N log N ) timing.
Up to this point, the heFFTe [11] library has been fully
dependent on the aforementioned one-dimensional FFT pack-
I. I NTRODUCTION
ages, requiring the user to install and link to external depen-
The Fourier transform is renowned for its utility in innu- dencies for both testing and production runs. Some of these
merable problems in physics, partial differential equations, libraries require abiding by non-permissive licensing agree-
signal processing, systems modeling, and artificial intelligence ments (e.g., FFTW) or proprietary restrictions (e.g., MKL),
among many other fields [1], [2]. The transform can be limiting the use of heFFTe in more sensitive or proprietary
represented as an infinite dimensional linear operator on the domains. Other packages require specialized hardware, e.g.,
Hilbert space of “sufficiently smooth” functions, but becomes a specific brand’s GPU device, and even if such hardware is
a finite dimensional linear operator when applied on the space available on many production machines it is seldom available
of functions with compact support in the frequency domain [2]. on the testing environments. These were prime motivations
Since any finite-dimensional linear operator can be represented for having some fallback or reference implementation self-
as a matrix, this transformation is equivalent to a “discrete” contained in heFFTe that was under the full jurisdiction of the
maintainers. Due to the distributed nature of the library, the ac 1 0 bd
= +
speed of the algorithm is less critical compared to traditional bc 0 1 ad
✓ ◆
one-dimensional FFT implementations, as the algorithm is a c 1 0 b d
= +
communication and not computation bound. Therefore, the b c 0 1 a d
✓ ◆
reference backend of the library stresses accuracy first with 1 0
a secondary focus on speed. =x y +
1 0
This reference implementation, or “stock FFT”, is not just ✓✓ ◆ ✓ ◆◆
1 0 0 1 0 1
a naı̈ve implementation of the DFT. The fast O(N log N ) x y
0 1 1 0 0 1
algorithms are employed, and the CPU Single-Instruction
Multiple-Data (SIMD) paradigm is used for complex arith- where represents the Hadamard product (i.e. elementwise
metic. The “stock FFT” implementation also works on batches multiplication). Each operation on individual vectors can be
of data, transforming multiple identically-sized signals at the done in one vectorized instruction and, accounting for the
same time which is the primary use case within the heFFTe capabilities of Fused-Multiply Add, complex multiplication
framework. can then be done in five vector instructions, with three of those
being shuffle operations that are much cheaper than flops [13].
II. V ECTORIZATION OF C OMPLEX N UMBERS
The advantage of the vectorization is further magnified
Many default packages providing complex multiplication, when multiplying many complex numbers. For example, if
like std::complex from the C++ standard library or
complex from Python, are developed for consistency and a1 a2 c c2
x= , y= 1 ,
compatibility and, thus, will implement complex multiplication b1 b2 d1 d2
as the textbook definition. Given a, b, c, d 2 R, the simplest and we want to do the column-wise multiplication of x and y
way of performing complex multiplication is via the direct (i.e. find (a1 , b1 )⇥(c1 , d1 ) and (a2 , b2 )⇥(c2 , d2 )), then we can
evaluation of (a + bi)(c + di) = (ac bd) + (ad + bc)i. use the same set of 5 operations but with wider registers, e.g.,
This is generally optimal in terms of floating point operations 256-bit AVX as opposed to 128-bit SSE. Using AVX registers
(flops), where one complex multiplication is four floating point and single precision, we can multiply four pairs of complex
multiplications and two floating point additions, or six flops. numbers in five instructions instead of doing 24 individual
However, one must note that a computer performs instructions, flops. Further, CPUs equipped with AVX-512 instructions can
not flops. execute this complex multiplication on eight pairs of single-
Vectorization has been supported to some degree within precision complex numbers and maintain five instructions.
high-performing CPUs since the 1970s, and the more modern High level programming languages, such as C and C++, rely
SSE and AVX instruction sets [12], [13] have exponentially on the compiler to convert simple floating point operations into
increased the possibilities for accelerating code via extended vector instructions, which works well in the simpler instances.
registers [14]. In most scenarios, vectorization is implemented However, the shuffle operations used in complex arithmetic
at the assembly instruction level, and a programmer can are presenting too much of challenge for the commonly used
interface with the assembly using instrinsics or wrappers in compilers, e.g., see Figure 4. This is despite nearly every gen-
a low level language (e.g., C, C++, FORTRAN); higher- eral purpose CPU since 2010 supporting some degree of vector
level interfaces also exist and many scientific computing instructions and nearly all compute clusters (high-performance
packages use vectorization internally. Examples of vectorized or otherwise) supporting these instructions extensively.
instructions in AVX include basic arithmetic operations, such The heFFTe library currently allows the user to enable
as element-wise adding, subtracting, multiplying, dividing, AVX abilities at compile-time and employs them in its stock
and fused multiply-add. Non-arithmetic instructions can range backend to do all complex arithmetic. The user can also
from simple operations, such as permuting the order items in enable AVX512-based complex arithmetic to further increase
a vector, to complicated ideas, such as performing one step of the library’s abilities. These options tremendously increase
AES encryption [13]. Many software libraries take advantage arithmetic throughput in practice, as seen in Figure 1.
of vectorization and as well as other SIMD capabilities of Figure 1 shows that performing arithmetic operations in
computers for numerical computation, and even FFT calcula- batches can accelerate a complex algorithm by a significant
tion [15], [16]. margin. Of course, this necessitates an algorithm that can take
The CPU executes code in terms of instructions, thus it advantage of SIMD, where the instructions are independent of
is more natural to represent an algorithm as a set of vector the data.
operations as opposed to working with individual numbers.
Let x = a + bi and y = c + di and consider the product of III. FAST F OURIER T RANSFORMS
the two complex numbers: It is worth remarking that, once a matrix is known, all opera-
a c tions of a matrix-vector multiplication are known. The process
x⇥y = ⇥
b d of evaluating a linear operator is described independent of
ac bd the data used as an input. Similarly, since a DFT is a finite-
= dimensional linear operator, all the arithmetic operations are
ad + bc
N = 3672, Composite
N = 8, 23 F F T N = 459, Composite
N = 27, 33 F F T N = 17, DF T
Fig. 1. Creating two sets of eight length-N complex vectors and timing
the elementwise multiplication between the sets while scaling N to compare
to calculate the FFT for prime-length signals. Further, the
std::complex and heFFTe::stock::Complex, using gcc-7.3.0 dimensions of X > in step 1 of Figure 2 affects the speed of
with optimization flag O3 in single-precision. execution. To attempt the fastest FFT, the backend establishes
a call-graph of which class of FFT to call recursively and
Performing where contains vectors, what factors to use a priori. The fact that these call-graphs
are created ahead-of-time allows the backend to cache fac-
0.Pack each
row into 1.Transpose 2. Compute torization results and other information that might be costly
vectorized Batched Stride to calculate several times over, thus alleviating some of the
types
computational burden. Additionally, there are optimized FFT
implementations for when N = p` for p = 2, 3 and when
N is prime [3], [4]. An example call-graph is illustrated in
4.Combine
Figure 3.
3.Scale by
output
twiddle factors
A. heFFTe Integration
The heFFTe library takes as input a distributed signal
spread across multiple computer nodes, then uses a series of
reshape operations (implemented using MPI) and converts the
distributed problem into a series of batch 1D FFT transforms.
Fig. 2. Example of Cooley-Tukey in heFFTe.
The user then selects a backend library from a collection to
handle the 1D transforms, and the native stock option is part
fully determined independent of input content. As such, we can of that collection. However, unlike any of the other libraries,
use the idea of vectorized complex numbers to perform one- this comes prepackaged with heFFTe so the library will be
dimensional FFTs in batches. Since an FFT is fully determined usable without external dependencies. The stock backend is
by the size N , two vectors of identical size will have the implemented in C++-11 and the use of AVX vectorization
same sequence of operations regardless of data they contain. is optionally enabled at compile time, since not all devices
As such, if we want the DFT of one single-precision signal, can support the extended register. If AVX is not enabled,
we can get the DFT of up to three more signals in the same the C++ standard std::complex implementation will be
number of instructions and similar time when using AVX used. Additionally, an option is provided so the user can force
instructions. The heFFTe library’s stock backend enables and enable vectorization, e.g., when cross-compiling on a machine
encourages this style of batching. without AVX.
The Cooley-Tukey algorithm [3] forms the foundation for
computing FFTs of generic composite-length signals, batched B. Implementation and Performance
and packed for generic vectorized computing of the FFT of The heFFTe library distributes the work associated with FFT
many signals, visualized in Figure 2. Assuming that the user via the MPI standard, similar to prior work on distributed
needs to compute P FFTs of length M = mR, heFFTe splits and heterogeneous FFT libraries [17]–[20]. Each MPI rank
this up into batches of size B (depending on the vectorization of heFFTe is tasked with performing a set of one-dimensional
supported by the machine), then calls the FFTs as illustrated Fourier transforms. The new integration is built to take a set
on each batch until all P signals have been transformed. of one-dimensional signals, package them in the vectorized
However, the backend also includes specialized FFTs im- complex type, perform an FFT (in batches), then unload the
plemented to calculate signals for length M = p` where p is vectorized outputs into std::complex for communication
2, 3 as well as an implementation of Rader’s algorithm [4] across the ranks. The backend additionally uses the precision
Fig. 4. Performance of the heFFTe library using the stock backend Fig. 5. Benchmarking the stock backend versus FFTW for Complex-to-
with std::complex and heffte::stock::Complex numbers, single- complex transforms on single- and double-precision signals
precision.