0% found this document useful (0 votes)
51 views

Algorithm For Scalable Fourier Transforms

This document describes the development of a CPU-based backend for the HeFFTe library as a reference implementation. This allows HeFFTe to be installed and run without external dependencies. The backend was implemented to take advantage of SIMD capabilities on modern CPUs, including a custom vectorized complex data type and runtime-generated call graph to select FFT algorithms. Performance is greatly increased with vectorization and it provides reasonable scalability compared to alternative CPU backends. In particular, a vectorized implementation is about 10x faster than non-vectorized for complex arithmetic.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Algorithm For Scalable Fourier Transforms

This document describes the development of a CPU-based backend for the HeFFTe library as a reference implementation. This allows HeFFTe to be installed and run without external dependencies. The backend was implemented to take advantage of SIMD capabilities on modern CPUs, including a custom vectorized complex data type and runtime-generated call graph to select FFT algorithms. Performance is greatly increased with vectorization and it provides reasonable scalability compared to alternative CPU backends. In particular, a vectorized implementation is about 10x faster than non-vectorized for complex arithmetic.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A More Portable HeFFTe: Implementing a Fallback

Algorithm for Scalable Fourier Transforms


Daniel Sharp Miroslav Stoyanov
Center for Computational Science & Engineering Multiscale Methods Group
Massachussetts Institute of Technology Oak Ridge National Laboratory
Cambridge, Massachussetts 02139 Oak Ridge, TN 37831
[email protected] [email protected]

Stanimire Tomov Jack Dongarra


Innovative Computing Laboratory Innovative Computing Laboratory
University of Tennessee University of Tennessee
Knoxville, TN 37996 Knoxville, TN 37996
[email protected] [email protected]

Abstract—The Highly Efficient Fast Fourier Transform for Fourier transform (DFT) requiring O(N 2 ) operations on a sig-
Exascale (heFFTe) numerical library is a C++ implementation nal with N samples, bounding the performance of a DFT from
of distributed multidimensional FFTs targeting heterogeneous above. However, it is commonly taught that the transform can
and scalable systems. To date, the library has relied on users to
provide at least one installation from a selection of well-known be accelerated and computed in O(N log N ) operations using
libraries for the single node/MPI-rank one-dimensional FFT the “Fast Fourier Transform” (FFT), a class of algorithms
calculations that heFFTe is built on. In this paper, we describe the pioneered in the late 20th century [3]–[5].
development of a CPU-based backend to heFFTe as a reference, Currently, the landscape for computing the one-dimensional
or “stock”, implementation. This allows the user to install and FFT of a signal on one node includes many respectable
run heFFTe without any external dependencies that may include
restrictive licensing or mandate specific hardware. Furthermore, implementations, including those of Intel’s OneMKL initiative,
this stock backend was implemented to take advantage of SIMD NVIDIA’s cuFFT, AMD’s rocFFT, and FFTW [6]–[9]. This
capabilities on the modern CPU, and includes both a custom list includes implementations for both CPU and GPU devices,
vectorized complex data-type and a run-time generated call- largely giving flexibility to a user needing to compute the FFT
graph for selecting which specific FFT algorithm to call. The of a few small signals on a local machine, a few intermediate-
performance of this backend greatly increases when vectorized
instructions are available and, when vectorized, it provides rea- sized signals on a robust compute device, or perhaps many
sonable scalability in both performance and accuracy compared independent small- and intermediate-sized signals on a larger,
to an alternative CPU-based FFT backend. In particular, we heterogeneous machine. However, these libraries are seldom
illustrate a highly-performant O(N log N ) code that is about designed for the problem of scale— as scientists desire the fre-
10⇥ faster compared to non-vectorized code for the complex quency representation of increasingly large multidimensional
arithmetic, and a scalability that matches heFFTe’s scalability
when used with vendor or other highly-optimized 1D FFT signals, they will at some point need to shift towards using
backends. The same technology can be used to derive other distributed and heterogeneous machines. Creating scalable
Fourier-related transformations that may be even not available FFTs for large peta- or exascale distributed machines is an
in vendor libraries, e.g., the discrete sine (DST) or cosine (DCT) open problem, and the heFFTe [10] library has the ambition
transforms, as well as their extension to multiple dimensions and to be the most performant on this frontier.
O(N log N ) timing.
Up to this point, the heFFTe [11] library has been fully
dependent on the aforementioned one-dimensional FFT pack-
I. I NTRODUCTION
ages, requiring the user to install and link to external depen-
The Fourier transform is renowned for its utility in innu- dencies for both testing and production runs. Some of these
merable problems in physics, partial differential equations, libraries require abiding by non-permissive licensing agree-
signal processing, systems modeling, and artificial intelligence ments (e.g., FFTW) or proprietary restrictions (e.g., MKL),
among many other fields [1], [2]. The transform can be limiting the use of heFFTe in more sensitive or proprietary
represented as an infinite dimensional linear operator on the domains. Other packages require specialized hardware, e.g.,
Hilbert space of “sufficiently smooth” functions, but becomes a specific brand’s GPU device, and even if such hardware is
a finite dimensional linear operator when applied on the space available on many production machines it is seldom available
of functions with compact support in the frequency domain [2]. on the testing environments. These were prime motivations
Since any finite-dimensional linear operator can be represented for having some fallback or reference implementation self-
as a matrix, this transformation is equivalent to a “discrete” contained in heFFTe that was under the full jurisdiction of the
  
maintainers. Due to the distributed nature of the library, the ac 1 0 bd
= +
speed of the algorithm is less critical compared to traditional bc 0 1 ad
   ✓  ◆
one-dimensional FFT implementations, as the algorithm is a c 1 0 b d
= +
communication and not computation bound. Therefore, the b c 0 1 a d
✓ ◆
reference backend of the library stresses accuracy first with 1 0
a secondary focus on speed. =x y +
1 0
This reference implementation, or “stock FFT”, is not just  ✓✓ ◆ ✓ ◆◆
1 0 0 1 0 1
a naı̈ve implementation of the DFT. The fast O(N log N ) x y
0 1 1 0 0 1
algorithms are employed, and the CPU Single-Instruction
Multiple-Data (SIMD) paradigm is used for complex arith- where represents the Hadamard product (i.e. elementwise
metic. The “stock FFT” implementation also works on batches multiplication). Each operation on individual vectors can be
of data, transforming multiple identically-sized signals at the done in one vectorized instruction and, accounting for the
same time which is the primary use case within the heFFTe capabilities of Fused-Multiply Add, complex multiplication
framework. can then be done in five vector instructions, with three of those
being shuffle operations that are much cheaper than flops [13].
II. V ECTORIZATION OF C OMPLEX N UMBERS
The advantage of the vectorization is further magnified
Many default packages providing complex multiplication, when multiplying many complex numbers. For example, if
like std::complex from the C++ standard library or  
complex from Python, are developed for consistency and a1 a2 c c2
x= , y= 1 ,
compatibility and, thus, will implement complex multiplication b1 b2 d1 d2
as the textbook definition. Given a, b, c, d 2 R, the simplest and we want to do the column-wise multiplication of x and y
way of performing complex multiplication is via the direct (i.e. find (a1 , b1 )⇥(c1 , d1 ) and (a2 , b2 )⇥(c2 , d2 )), then we can
evaluation of (a + bi)(c + di) = (ac bd) + (ad + bc)i. use the same set of 5 operations but with wider registers, e.g.,
This is generally optimal in terms of floating point operations 256-bit AVX as opposed to 128-bit SSE. Using AVX registers
(flops), where one complex multiplication is four floating point and single precision, we can multiply four pairs of complex
multiplications and two floating point additions, or six flops. numbers in five instructions instead of doing 24 individual
However, one must note that a computer performs instructions, flops. Further, CPUs equipped with AVX-512 instructions can
not flops. execute this complex multiplication on eight pairs of single-
Vectorization has been supported to some degree within precision complex numbers and maintain five instructions.
high-performing CPUs since the 1970s, and the more modern High level programming languages, such as C and C++, rely
SSE and AVX instruction sets [12], [13] have exponentially on the compiler to convert simple floating point operations into
increased the possibilities for accelerating code via extended vector instructions, which works well in the simpler instances.
registers [14]. In most scenarios, vectorization is implemented However, the shuffle operations used in complex arithmetic
at the assembly instruction level, and a programmer can are presenting too much of challenge for the commonly used
interface with the assembly using instrinsics or wrappers in compilers, e.g., see Figure 4. This is despite nearly every gen-
a low level language (e.g., C, C++, FORTRAN); higher- eral purpose CPU since 2010 supporting some degree of vector
level interfaces also exist and many scientific computing instructions and nearly all compute clusters (high-performance
packages use vectorization internally. Examples of vectorized or otherwise) supporting these instructions extensively.
instructions in AVX include basic arithmetic operations, such The heFFTe library currently allows the user to enable
as element-wise adding, subtracting, multiplying, dividing, AVX abilities at compile-time and employs them in its stock
and fused multiply-add. Non-arithmetic instructions can range backend to do all complex arithmetic. The user can also
from simple operations, such as permuting the order items in enable AVX512-based complex arithmetic to further increase
a vector, to complicated ideas, such as performing one step of the library’s abilities. These options tremendously increase
AES encryption [13]. Many software libraries take advantage arithmetic throughput in practice, as seen in Figure 1.
of vectorization and as well as other SIMD capabilities of Figure 1 shows that performing arithmetic operations in
computers for numerical computation, and even FFT calcula- batches can accelerate a complex algorithm by a significant
tion [15], [16]. margin. Of course, this necessitates an algorithm that can take
The CPU executes code in terms of instructions, thus it advantage of SIMD, where the instructions are independent of
is more natural to represent an algorithm as a set of vector the data.
operations as opposed to working with individual numbers.
Let x = a + bi and y = c + di and consider the product of III. FAST F OURIER T RANSFORMS
the two complex numbers: It is worth remarking that, once a matrix is known, all opera-
 
a c tions of a matrix-vector multiplication are known. The process
x⇥y = ⇥
b d of evaluating a linear operator is described independent of

ac bd the data used as an input. Similarly, since a DFT is a finite-
= dimensional linear operator, all the arithmetic operations are
ad + bc
N = 3672, Composite

N = 8, 23 F F T N = 459, Composite

N = 27, 33 F F T N = 17, DF T

Fig. 3. An Example Call-Graph in the heFFTe Stock Backend.

Fig. 1. Creating two sets of eight length-N complex vectors and timing
the elementwise multiplication between the sets while scaling N to compare
to calculate the FFT for prime-length signals. Further, the
std::complex and heFFTe::stock::Complex, using gcc-7.3.0 dimensions of X > in step 1 of Figure 2 affects the speed of
with optimization flag O3 in single-precision. execution. To attempt the fastest FFT, the backend establishes
a call-graph of which class of FFT to call recursively and
Performing where contains vectors, what factors to use a priori. The fact that these call-graphs
are created ahead-of-time allows the backend to cache fac-
0.Pack each
row into 1.Transpose 2. Compute torization results and other information that might be costly
vectorized Batched Stride to calculate several times over, thus alleviating some of the
types
computational burden. Additionally, there are optimized FFT
implementations for when N = p` for p = 2, 3 and when
N is prime [3], [4]. An example call-graph is illustrated in
4.Combine
Figure 3.
3.Scale by
output
twiddle factors
A. heFFTe Integration
The heFFTe library takes as input a distributed signal
spread across multiple computer nodes, then uses a series of
reshape operations (implemented using MPI) and converts the
distributed problem into a series of batch 1D FFT transforms.
Fig. 2. Example of Cooley-Tukey in heFFTe.
The user then selects a backend library from a collection to
handle the 1D transforms, and the native stock option is part
fully determined independent of input content. As such, we can of that collection. However, unlike any of the other libraries,
use the idea of vectorized complex numbers to perform one- this comes prepackaged with heFFTe so the library will be
dimensional FFTs in batches. Since an FFT is fully determined usable without external dependencies. The stock backend is
by the size N , two vectors of identical size will have the implemented in C++-11 and the use of AVX vectorization
same sequence of operations regardless of data they contain. is optionally enabled at compile time, since not all devices
As such, if we want the DFT of one single-precision signal, can support the extended register. If AVX is not enabled,
we can get the DFT of up to three more signals in the same the C++ standard std::complex implementation will be
number of instructions and similar time when using AVX used. Additionally, an option is provided so the user can force
instructions. The heFFTe library’s stock backend enables and enable vectorization, e.g., when cross-compiling on a machine
encourages this style of batching. without AVX.
The Cooley-Tukey algorithm [3] forms the foundation for
computing FFTs of generic composite-length signals, batched B. Implementation and Performance
and packed for generic vectorized computing of the FFT of The heFFTe library distributes the work associated with FFT
many signals, visualized in Figure 2. Assuming that the user via the MPI standard, similar to prior work on distributed
needs to compute P FFTs of length M = mR, heFFTe splits and heterogeneous FFT libraries [17]–[20]. Each MPI rank
this up into batches of size B (depending on the vectorization of heFFTe is tasked with performing a set of one-dimensional
supported by the machine), then calls the FFTs as illustrated Fourier transforms. The new integration is built to take a set
on each batch until all P signals have been transformed. of one-dimensional signals, package them in the vectorized
However, the backend also includes specialized FFTs im- complex type, perform an FFT (in batches), then unload the
plemented to calculate signals for length M = p` where p is vectorized outputs into std::complex for communication
2, 3 as well as an implementation of Rader’s algorithm [4] across the ranks. The backend additionally uses the precision
Fig. 4. Performance of the heFFTe library using the stock backend Fig. 5. Benchmarking the stock backend versus FFTW for Complex-to-
with std::complex and heffte::stock::Complex numbers, single- complex transforms on single- and double-precision signals
precision.

and architecture as information to batch in the largest imple-


mented size the CPU can handle (e.g., batches of two for
double precision, and four for single precision using 256-bit
AVX).
Figure 4 shows a near tenfold increase in performance in
heFFTe when using the vectorized complex numbers and a
realistic benchmark1 . This shows that, all else being equal,
the vectorized arithmetic’s acceleration propagates through an
entire call stack instead of being exclusive to some patholog-
ical benchmark.
We compare performance results against FFTW [6], which
is the most comparable to our implementation. Both FFTW
and the stock backend allow for the user to employ AVX512
for performing the FFTs with SIMD. Figure 5 shows that the Fig. 6. Error of the Complex-to-Complex transform using the Stock and
both the stock and FFTW backends for heFFTe are competi- FFTW backends on single- and double-precision signals
tive in many cases, especially regarding single-precision. The
FFTW library is mature and extensively optimized with better
compared to FFTW. On lower rank counts, the single-precision
support for the given CPU’s architecture. Additionally, the
implementation consistently seems to outperform FFTW. As
stock backend scales at the same rate as FFTW, so any future
one would expect, the two backends seem to converge to the
optimizations will most likely be minimizing overhead of the
same elapsed time as the ranks increase and the communica-
current library, as opposed to making substantial changes to
tion overhead becomes larger than the time to perform each
structure of the backend.
transform.
The error of this fallback implementation is shown in
Figure 6, which demonstrates that the error is as dependent IV. C ONCLUSIONS AND FUTURE WORK
on the problem size as the performance. The single-precision Creating a fallback set of FFT implementations has shown
transform seems to generally be between one and three orders reasonable performance within heFFTe, and incorporating vec-
of magnitude of error, where the double-precision is generally torized types accelerates the arithmetic and implementations
around one to two orders of magnitude of error. This error immensely. Adding a native backend to the heFFTe software
is likely attributable to a reasonable amount of floating-point package with sufficient performance for most problems allows
rounding error accumulated while calculating twiddle factors users the flexibility, e.g., for testing, continuous integration and
in the transform. even small scale production runs. Further, the unrestrictive li-
When examining the behavior while strong scaling on a box censing that heFFTe provides makes it viable to incorporate the
with a power of four axis size in Figure 7, the stock backend library with the stock backend into most projects, regardless
shows a consistent match, if not improvement, in performance of propriety or topic sensitivity. This fallback implementation
1 All weak scaling performance was examined on cubes with side lengths
is included and documented within the development version
of 128, 159, 198, 246, 306, and 381 on a Intel(R) Xeon(R) Gold 6140 CPU of heFFTe and will be included in the forthcoming full release
equipped with AVX-512 version. There are many definitive avenues for the growth
R EFERENCES
[1] L. R. Rabiner and B. Gold, Theory and application of digital signal
processing / Lawrence R. Rabiner, Bernard Gold. Englewood Cliffs,
N.J.: Prentice-Hall, 1975.
[2] R. N. Bracewell, Fourier Transform and its Applications. McGraw-Hill,
1999.
[3] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation
of complex Fourier series,” Mathematics of computation, vol. 19, no. 90,
pp. 297–301, 1965.
[4] C. M. Rader, “Discrete Fourier transforms when the number of data
samples is prime,” Proceedings of the IEEE, vol. 56, no. 6, pp. 1107–
1108, 1968.
[5] L. Bluestein, “A linear filtering approach to the computation of discrete
fourier transform,” IEEE Transactions on Audio and Electroacoustics,
vol. 18, no. 4, pp. 451–455, 1970.
[6] M. Frigo and S. G. Johnson, “The design and implementation of
Fig. 7. Performance of the FFTW and stock backends for a fixed- sized signal FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005,
over multiple MPI ranks, single-precision special issue on “Program Generation, Optimization, and Platform
Adaptation”.
[7] “cuFFT library,” 2018. [Online]. Available:
http://docs.nvidia.com/cuda/cufft
and acceleration of this backend; extending support to ARM [8] “rocFFT library,” 2021. [Online]. Available:
vectorized instructions would prepare the backend for the https://github.com/ROCmSoftwarePlatform/rocFFT
[9] Intel, “Intel Math Kernel Library,”
heterogeneity of high-performance computing. Other avenues http://software.intel.com/en-us/articles/intel-mkl/. [Online]. Available:
for growth include testing other vectorizations for complex https://software.intel.com/mkl/features/fft
arithmetic and using more specialized algorithms for common, [10] “heFFTe library,” 2020. [Online]. Available:
https://bitbucket.org/icl/heffte
but specific, problem sizes. For example, accelerating the [11] A. Ayala, S. Tomov, A. Haidar, and J. Dongarra, “heFFTe: Highly
transform on prime-lengthed signals. Further, the error should Efficient FFT for Exascale,” in ICCS 2020. Lecture Notes in Computer
be reduced, which requires adjusting how twiddle factors are Science, 2020.
[12] “Amd64 architecture programmer’s manual, volume 3: General-
created in the stock implementation. Overall, this is an initial purpose and system instructions,” 10 2020. [Online]. Available:
step towards allowing users of the heFFTe library further https://www.amd.com/system/files/TechDocs/40332.pdf
flexibility in how they use the library and what projects they [13] “Intel SSE and AVX Intrinsics,” 2021. [Online]. Available:
can use it in. https://software.intel.com/content/www/us/en/develop/documentation/cpp-
compiler-developer-guide-and-reference/top/compiler-
Future work includes further optimizations and extensions to reference/intrinsics/
other architectures, e.g., GPUs from Nvidia, AMD, and Intel, [14] R. Espasa, M. Valero, and J. E. Smith, “Vector architectures: past,
as well as other algorithms. Of particular interest is to show present and future,” in Proceedings of the 12th international conference
on Supercomputing, 1998, pp. 425–432.
that the same technology can be used to derive other Fourier- [15] M. K. Stoyanov and USDOE, “HALA: Handy Ac-
related transformations that are highly needed but not always celerated Linear Algebra,” 11 2019. [Online]. Available:
available in vendor libraries, e.g., the discrete sine (DST) or https://www.osti.gov//servlets/purl/1630728
[16] D. McFarlin, F. Franchetti, and M. Püschel, “Automatic Generation of
cosine (DCT) transforms, as well as their extension to multiple Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX
dimensions and O(N log N ) timing. Instruction Set Extension,” in High Performance Extreme Computing
(HPEC), 2009.
ACKNOWLEDGMENT [17] H. Shaiek, S. Tomov, A. Ayala, A. Haidar, and J. Dongarra, “GPUDirect
MPI Communications and Optimizations to Accelerate FFTs on Exas-
This research was supported by the Exascale Computing cale Systems,” University of Tennessee, Knoxville, Extended Abstract
Project (17-SC-20-SC), a collaborative effort of two U.S. icl-ut-19-06, 2019-09 2019.
Department of Energy organizations (Office of Science and [18] “parallel 2d and 3d complex ffts,” 2018, available at
http://www.cs.sandia.gov/ sjplimp/download.html.
the National Nuclear Security Administration) responsible for [19] S. Plimpton, A. Kohlmeyer, P. Coffman, and P. Blood, “fftMPI, a library
the planning and preparation of a capable exascale ecosystem, for performing 2d and 3d FFTs in parallel,” Sandia National Lab.(SNL-
including software, applications, hardware, advanced system NM), Albuquerque, NM (United States), Tech. Rep., 2018.
[20] J. L. Träff and A. Rougier, “MPI collectives and datatypes for hierar-
engineering and early testbed platforms, in support of the chical all-to-all communication,” in Proceedings of the 21st European
nation’s exascale computing imperative. MPI Users’ Group Meeting, 2014, pp. 27–32.

You might also like