0% found this document useful (0 votes)
64 views20 pages

VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library

The document discusses VkFFT, an open-source GPU FFT library that aims to provide a cross-platform alternative to vendor-specific solutions. It describes the optimizations implemented in VkFFT and verifies its performance and precision on modern GPUs. VkFFT uses the Vulkan API to achieve portability across GPUs from different vendors while obtaining comparable or better performance than proprietary libraries.

Uploaded by

mifami7390
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views20 pages

VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library

The document discusses VkFFT, an open-source GPU FFT library that aims to provide a cross-platform alternative to vendor-specific solutions. It describes the optimizations implemented in VkFFT and verifies its performance and precision on modern GPUs. VkFFT uses the Vulkan API to achieve portability across GPUs from different vendors while obtaining comparable or better performance than proprietary libraries.

Uploaded by

mifami7390
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Received 15 December 2022, accepted 30 January 2023, date of publication 3 February 2023, date of current version 8 February 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3242240

VkFFT-A Performant, Cross-Platform and


Open-Source GPU FFT Library
DMITRII TOLMACHEV , (Graduate Student Member, IEEE)
Institute of Geophysics, ETH Zürich, 8092 Zürich, Switzerland
e-mail: [email protected]
This work was supported in part by the ETH Research Commission under Grant ETH-03 21-1, in part by the European Research Council
through the Horizon 2020 Programme under Grant 833848-UEMHP (PI A. Jackson), and in part by the Platform for Advanced Scientific
Computing (PASC) initiative of ETH Zürich through the Award of Project AQUA-D (PI A. Jackson).

ABSTRACT The Fast Fourier Transform is an essential algorithm of modern computational sci-
ence. The highly parallel structure of the FFT allows for its efficient implementation on graph-
ics processing units (GPUs), which are now widely used for general-purpose computing. This paper
presents the VkFFT - an efficient GPU-accelerated multidimensional Fast Fourier Transform library for
Vulkan/CUDA/HIP/OpenCL/Level Zero and Metal projects. VkFFT aims to provide the community with a
cross-platform open-source alternative to vendor-specific solutions while achieving comparable or better per-
formance. This paper also discusses the optimizations implemented in VkFFT and verifies its performance
and precision on modern high-performance computing GPUs. VkFFT is released under an MIT license.

INDEX TERMS FFT, GPU, parallel computing, Vulkan, CUDA, HIP, OpenCL, Level Zero, Metal.

I. INTRODUCTION algorithm for all other polynomial transforms, like the Dis-
The recent advances in graphics processing units (GPUs) crete Cosine Transform (DCT) or the Spherical Harmonics
in terms of their computational power, data transfer speed Transform (SHT). The major speedup is achieved through the
and available memory size make them extremely appeal- reduction of the complexity from O(N 2 ) to O(NlogN ), which
ing for general-purpose use, expanding their original ren- results in a substantial decrease in computational effort [3].
dering and visualization target tasks. The single instruction There are multiple application programming interfaces
- multiple threads (SIMT) design approach used in GPUs (APIs) available for GPU computing and corresponding
has proven extremely effective in data-level parallel tasks. FFT implementations. The most known and used for GPU
In this approach, threads are grouped in warps, which are compute tasks, Nvidia Compute Unified Device Architec-
then operated by a scheduler unit so that all threads from a ture (CUDA) platform, has been in development for over a
warp perform the same operation on thread-local registers. decade [4]. It has a sophisticated ecosystem and is refined
A common number of threads in a warp is 32 or 64 for modern to achieve the best results on Nvidia GPUs. Nvidia’s Math
GPU architectures [1]. library collection has an FFT implementation - cuFFT, which
One such highly parallel task is a multidimensional Fast has high-performance and is widely used for building com-
Fourier Transform (FFT) [2]. The FFT algorithm is widely mercial and research applications [5]. However, the Nvidia
used in signal-processing applications to extract the fre- CUDA platform can be only used on Nvidia GPUs, which
quency properties of a given input signal. Another major use limits the portability of the code. Another drawback of using
case of the FFT is to calculate big convolutions in convolu- Nvidia’s Math library collection is that it is closed-source
tional neural networks (CNNs) and image analysis workloads and cannot be optimized at the code level for specific
using the Convolution theorem. The FFT is also used as a base tasks.
AMD has its own API - Heterogeneous Interface for Porta-
The associate editor coordinating the review of this manuscript and bility (HIP) and related software stack, called Radeon Open
approving it for publication was Tomas F. Pena . Compute (ROCm) which is modeled to be as close to CUDA

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 12039
D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

as possible, so developers can port their code more easily [6]. following formula:
However, it is less mature than CUDA, resulting in worse N −1
X 2π i
functionality and performance. For example, their GPU FFT Xk = xn e− N nk = DFTN (xn , k),
library - rocFFT, is significantly slower than cuFFT (on a n=0
similar level of HPC GPUs). where xn is the input sequence, N is the length of the input
For the open-source and cross-platform computations, sequence and k ∈ [0, N − 1], k ∈ Z is the output index, cor-
Khronos Group maintains the OpenCL framework (origi- responding to the frequency in Fourier space. Corresponding
nally developed by Apple), which aims to unify program- to that, the inverse DFT is defined as the following:
ming on heterogeneous platforms, including GPUs, CPUs
N −1
and FPGAs [7]. It also has its own FFT implementation 1 X 2πi
xn = Xk e− N nk = IDFTN (Xk , n)
- clFFT, which stopped development in 2016. OpenCL is N
k=0
undermined by inferior driver support, as all the vendors are
mainly interested in the development of their own APIs. The fastest known method of obtaining the DFT of a sequence
Intel has also released the Level Zero API for their upcom- of arbitrary dimensionality is the Fast Fourier Transform
ing GPUs and is also developing a GPU version of their CPU algorithm. The most commonly used FFT strategy is called
Math Kernel Library - oneMKL, aimed at their GPUs [8]. the Cooley-Tukey algorithm [2]. It is based on a divide-and-
Apple is developing its own proprietary API called Metal conquer strategy that breaks the DFT into a sequence of
for its operating systems that target GPUs, like Apple’s smaller-size DFTs. The idea behind it is to reformulate the
M-series systems on chip (SoC) [9]. Apple’s Accelerate DFT of a sequence with composite size N = N1 · N2 as a
Framework at the moment of writing this paper only had CPU combination of smaller-sized DFTs:
implementation of the FFT algorithm, with no GPU support 1) Perform digit-reversal permutation to reorder data.
plans announced. If performed before stage one - Decimation in Time
The Vulkan API is a low-overhead, cross-platform 3D (DIT), after all stages - Decimation in Frequency (DIF).
graphics and computing API, initially released in 2016 with 2) Perform N1 DFTs of size N2 . Here we assume N1 to be
major updates in 2018 and 2020 [10], [11]. Vulkan allows for smaller than N2 .
better and lower-level control of the GPU than its predecessor, 3) Perform O(N ) multiplications by twiddle factors - com-
OpenGL. Vulkan is backed up by the big gaming industry plex roots of unity defined by the radix. This stage is
(unlike OpenCL), which results in good driver support for needed to combine the results of smaller DFTs, so we
all vendors - Vulkan code can be launched on Nvidia, AMD, can get the DFT of the input sequence. It can also be
Intel and mobile GPUs and on a compatibility layer). Vulkan decomposed and merged with stage four.
is released as open-source, so everybody can contribute to 4) Perform N2 DFTs of size N1 . Here N1 is a small factor
its development. However, a good math library collection in and is designated as the radix of the current transfor-
Vulkan has not yet been created. mation.
This paper proposes an implementation of the FFT algo- N1 -FFT stages and a twiddle multiplication have a small
rithms - VkFFT, that can work with all the aforementioned number of arithmetic operations required (if N1 is sufficiently
programming interfaces. VkFFT source code can be found small, we can write the DFT algorithm explicitly) and can be
on GitHub: https://github.com/DTolm/VkFFT. performed a large number of times (if the size of the input
The paper structure is as follows. Section II introduces sequence N is large). These two stages will be called a step of
background knowledge on the Discrete Fourier transforms FFT in this paper. Applying these steps recursively to N2 -DFT
and implemented in VkFFT algorithms. Section III describes and further will result in the decomposition of size N DFT as
the GPU architecture and what needs to be considered a combination of smaller DFTs. For the simplest scenario,
when designing performant GPU libraries. Section IV con- where N is a power of two and N1 = 2, this decomposition
tains the VkFFT structure details and describes a novel will result in the final algorithm containing only N /2 length-2
approach to GPU runtime kernel optimization. Section V is DFTs (called radix-2 butterflies) and twiddle multiplications
dedicated to algorithms and optimizations implemented in performed log2 N times and a single digit-reversal permu-
VkFFT. Section VI gives the benchmark results and perfor- tation to reorder data. This results in the total complexity
mance analysis of VkFFT compared to cuFFT and rocFFT. of O(N logN ). A similar approach can be applied to other
Section VII contains the precision verification details of radix sizes and their combinations. To compute an inverse
VkFFT, also comparing it to cuFFT and rocFFT. Section VIII FFT, we have to calculate twiddle factors with negative com-
is dedicated to the paper’s conclusion. plex roots of unity and normalize the result, otherwise the
algorithm remains unchanged. Multidimensional FFTs can be
II. FAST FOURIER TRANSFORM ALGORITHMS OVERVIEW performed for each axis separately as a set of 1D transforms.
A. DISCRETE FOURIER TRANSFORMS AND
COOLEY-TUKEY ALGORITHM B. STOCKHAM ALGORITHM
The Fourier transform of a sequence is called Dis- In the Cooley-Tukey algorithm, digit-reversal is per-
crete Fourier Transform (DFT). It is defined by the formed once - before or after performing all steps of the

12040 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 1. Rader’s FFT algorithm step-by-step description. This figure FIGURE 2. Bluestein’s FFT algorithm step-by-step description. This figure
shows how an FFT of length 11 can be computed as a convolution of shows how an FFT of length N can be computed as a radix convolution of
length 10. Convolution is performed with the help of the convolution length N ∗ ≥ 2N − 1. Convolution is performed with the help of the
theorem (light green), which can be done efficiently with the Stockham convolution theorem (light green), which can be done efficiently with the
algorithm for primes from 2 to 13. Stockham algorithm for primes from 2 to 13.

P−2
recursive algorithm. Another option is to have intermediate X 2πi (l+k)
Xgk = x0 + xgl e− N g , k ∈ [0, P − 2]
digit-reversals after each small radix FFT. This approach is
l=0
called the Stockham version of the Cooley-Tukey algorithm
and is commonly used in GPU FFT libraries due to its better The second summation is P − 1 length cyclic convolution.
cache locality at each step [12]. It can be calculated in two ways - as a direct multiplication
with a precomputed kernel, or by using the convolution theo-
C. THE FOUR-STEP FFT ALGORITHM rem (P − 1 FFT to IFFT steps in Fig. 1) for the following two
The Cooley-Tukey algorithm described above can also be sequences:
used to represent one big 1D FFT sequence as a 2D FFT
of smaller sizes [13]. This method is called the Four-step DFT{f ∗ h} = DFT{f } · DFT{h}
FFT and is used when the full FFT sequence is too big to an = xgl
be stored in the L1 cache of the GPU or to split the workload 2π i (l+k)
between multiple GPUs. Firstly, we compute FFTs along the bn = e− N g

columns. Then we multiply the sequence by twiddle factors The bn sequence and its DFT can be precomputed. The direct
and perform FFTs along the rows. The digit-reversal stage multiplication approach does not scale well with increasing P
can be seen as an N2 × N1 transposition in this case and is as it has O(N 2 ) complexity. The convolution theorem version
usually done after the row FFTs to preserve locality. of Rader’s algorithm has another limitation - we rely on
being able to compute a P − 1 length sequence FFT with a
D. RADER’s AND BLUESTEIN’s FFT ALGORITHMS regular Stockham algorithm approach. This is not the case for
The parallelization of Cooley-Tukey and Stockham algo- Sophie-Germain safe primes (numbers like 47, 59, 83) and
rithms is usually done with a single thread (either CPU or sequences divisible by them.
GPU) responsible for a single N1 radix kernel. Then we can For these numbers, another general algorithm can be used -
have N2 = N ÷ N1 threads working at the same time. Bluestein’s algorithm (Fig. 2). This algorithm works for arbi-
To achieve good occupancy on GPUs, N2 has to be big (at trary sequence size N through reformulation of the DFT
least the number of available cores). However, this is often definition:
not the case when N1 is a big prime (for example, when n2 k2 (n − k)2
the full sequence length N is a prime). Another issue is nk = + −
2 2 2
that the generation of optimized radix kernels for big primes N −1
is a non-trivial task in itself. Luckily, there exist two FFT πi 2 X πi 2 πi 2
Xk = e− N k · (xn e− N n )e N (n−k)
algorithms that solve this issue: Rader’s FFT algorithm and
n=0
Bluestein’s FFT algorithm [14], [15].
Rader’s FFT algorithm computes the FFT of a prime-sized To get the DFT of an input sequence, we then have to calculate
length P as a P − 1 length cyclic convolution (Fig. 1). The the convolution between the following two sequences (N ∗
main idea behind the algorithm is that residuals [1, P − 1] FFT to IFFT steps in Fig. 2):
πi 2
form a multiplicative group of integers modulo P with a gen- an = xn e− N n
erator g. Below, all generator powers are calculated modulo πi 2
P. Rader then reformulates the FFT in question (reorder steps bn = e N (n−k)
in Fig. 1) as: If we pad them both with zeros to a sequence size higher
P−1
X than 2N − 1 (FFT of which can be done with a Stockham
X0 = xl algorithm) we can remove the circular part of the convolu-
l=0 tion and compute the convolution by using the convolution

VOLUME 11, 2023 12041


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

theorem. The bn sequence and its DFT can be precomputed. A. ON-CHIP COMPUTE CIRCUITRY
This algorithm covers all the remaining primes and composite Different vendors have different GPU compute unit designs,
sequences. As Bluestein’s algorithm is performing convolu- however, as all of them are based on the SIMT architecture,
tions of length N ∗ ≥ 2N −1 as opposed to Rader’s N −1, it is there are plenty of similarities. The main outcome of this
expected to have lower performance. While usually true, this design approach is that there is only one scheduler and dis-
is at times not the case as Bluestein’s algorithm’s length can patch unit that manages a relatively large group of cores. This
be chosen to have only small primes, like 2 and 3, allowing saves space and allows the manufacturers to pack thousands
for better parallelization. of cores inside a single GPU. We will take Nvidia’s com-
pute unit (streaming multiprocessor) as an example hereon
E. REAL TO COMPLEX FFTs [21], [22], [23]. We will omit the discussion of all
If there is prior knowledge of the input structure, some opti- graphics-related circuitry in this section.
mizations to the regular algorithms can be made. Real-to- We start with a short description of the scheduler and the
complex (R2C) and complex-to-real (C2R) transforms can be dispatch unit. Their job is to provide work assignments in
explained as complex-to-complex (C2C) transforms with the a form of warps to the hardware cores. They can manage
imaginary part set to zero [16]. They exploit the Hermitian multiple kernels on a single GPU and take care of all available
symmetry of the result: Xk = XN∗ −k of the sequence. This in compute unit resources.
results in the reduction of required memory to store the The source of compute power inside GPUs is their numer-
complex result - we may only store floor( N2x ) + 1 complex ous cores. They have pipelined structure - dispatch port,
numbers instead of Nx . Computational complexity can also be operand collector, arithmetic unit and result queue [24].
reduced - two real sequences can be packed as one complex Pipelined structure means that the core can queue new
sequence, or an even real sequence can be calculated as instructions while previous ones are still active or being
a complex sequence of half the length. Overall, these two outputted. The arithmetic unit is often specialized - for
optimizations result in a 2x speedup over the regular C2C integer, single-precision or double-precision floating point
transformation. operations. Different operations executed require different
numbers of clock cycles to complete. If a warp encounters
a bank conflict, it will be stalled, increasing the number of
F. REAL TO REAL FFTs
cycles needed to maintain the convergence of the lanes inside
Real-to-real transforms in VkFFT are implemented in the
a warp.
form of Discrete Cosine Transforms of types I, II, III and IV.
Often compute units contain specialized circuitry for some
Their definitions and transforms match FFTW implementa-
specific operations, that is useful but is too big to fit in each
tion [17], [18], [19], [20]:
core. Examples of this are special function units (SFUs) and
N −2
X πnk tensor cores. The first ones allow for fast calculation of single
1) DCT-I: Xk = x0 + (−1)k xN −1 + 2 xn cos( ),
N −1 precision transcendental functions and the second ones can
n=1
the inverse of DCT-I (itself) perform matrix operations [25], [26], [27].
N
X −1
π 1 The last type of non-memory circuitry present in compute
2) DCT-II: Xk = 2 xn cos( (n + )k), inverse of units is the load/store unit. Their main job is handling memory
N 2
n=1 transfers to cores from the shared memory, caches and global
DCT-III memory. They are designed to feed multiple cores at the
N −1
X π 1 same time (according to SIMT architecture), which results in
3) DCT-III: Xk = x0 + 2 xn cos( n(k + )), inverse
N 2 strict memory coalescing requirements. Nvidia GPUs have
n=1
of DCT-II 8 load-store units that can process data in 32-byte chunks.
N −1
X π 1 1 The load/store units open up another section of discussion -
4) DCT-IV: Xk = 2 xn cos( (n + )(k + )), the
N 2 2 the different types of memory available on a GPU.
n=0
inverse of DCT-IV (itself)
R2R transforms are performed by redefinition of them to the B. GPU MEMORY TYPES
C2C transforms (the internal C2C sequence length can be The main principle of the memory hierarchy is - the closer
different from the input R2R sequence length). something is located to the cores, the faster the core’s access
to it will be and the smaller the size of it. The overall ranking
of memory speeds can be seen in Fig. 3 [23].
III. GPU ARCHITECTURE IN-DEPTH ANALYSIS
Before discussing VkFFT GPU-specific design choices, it is
important to specify which types of hardware are accessible 1) REGISTER FILE
by the GPU and what limitations they have. We will focus on The fastest available memory on the GPU is located on-
the two most important types of circuitry - compute-related chip. It has been limited to 256KB for multiple generations
and memory-related. of GPUs as this size corresponds to the best achievable by

12042 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

of banks so that each of them can process memory requests in


parallel - the latest Nvidia GPUs have 32 banks with 4-byte
bank width. If the number of banks is equal to the number of
threads and each thread accesses a different bank (or the same
word from one bank), the data are transferred in one request
and the bandwidth is similar to the register file bandwidth.
In the case of a bank conflict where multiple threads require
different data from a single bank, requests become serialized
which can lead to a big bandwidth decrease up to the case
when all threads access one bank. Shared memory has to be
explicitly synchronized to avoid race conditions. In Vulkan,
static shared memory allocations are limited to 48KB per
thread block on Nvidia GPUs, 64KB for AMD GPUs and
32KB-64KB for Intel GPUs [23], [26], [28]. CUDA and HIP
APIs, however, allow dynamic shared memory allocations,
increasing this size up to 96KB for V100, 192KB for A100
and 256KB for H100 GPUs, which is extremely helpful for
processing more data in a single upload to the chip (at a cost
of a lower number of active warps per compute unit/streaming
FIGURE 3. GPU memory hierarchy. This figure shows the idealized relative multiprocessor). VkFFT is optimized to use all shared mem-
scaling of speeds of the GPU memory. The values of SM/L1/L2 and global
memory are from the V100 GPU, The value of L3 bandwidth is from AMD
ory available per thread block. The L1 cache of most vendors
6900XT GPU (RDNA 2). uses a 32-byte cache line with a 128-byte update granularity
so that every thread from a 32-sized warp gets one 32-bit word
the current technology level as a balance between register per request [23].
file latency (which increases with the size) and the number
of resources that warps can access (also increases with the 4) L2 CACHE (AND L3 CACHE)
size) [23]. Depending on the architecture of the GPU, the This is intermediate cache between L1 cache and global
register file is split into multiple banks, each with a dif- memory. It has a bigger size than the L1 cache (usually
ferent number of ports capable of handling a single 32-bit 1-5MB, 6MB on V100, 40MB on A100, 50MB in H100)
memory request per clock cycle. It is possible to exchange and faster speeds and lower latency than global memory -
data between registers of threads from a single warp via spe- 193 cycles per hit on V100 [23], [25]. It is located on-chip
cialized subgroup/warp operations, otherwise, registers are but is not compute unit specific - all compute units are
thread-local and can only be accessed by the corresponding connected to it. L2 cache is used as an intermediate buffer -
thread. If threads try to use more registers than the register data requested or written by the GPU goes to global memory
file can provide, register spilling occurs and data is cached to through the L2 cache, until the new information is written
the global memory at a much lower bandwidth. over it. On recent Nvidia and AMD GPUs, the L1-L2 cache
line is 64-bytes wide. On a cache miss, each sector can be
2) L0 INSTRUCTION CACHE filled with a data transaction from global memory, served in
This is the cache that stores instructions to be issued with 32-byte granularity - this allows coalescing requirements to
the scheduler and the dispatch unit. Usually, there is no need be less strict and also eases the strided access performance
for interaction from the developer side with this cache, other impact, as the L2 cache will still be able to combine separate
than avoiding doing large conditional code jumps inside the requests to fully utilize on-chip cache lines with 128-byte
kernel - in this case some of the code may not be prefetched update granularity. On Intel integrated GPUs, all transac-
to the cache resulting in additional stalls [23]. tions are performed with 64-byte granularity [28]. It is worth
noting that the modern GPU generation, such as Nvidia’s
3) SHARED MEMORY/L1 CACHE Ampere and AMD’s RDNA2, made a move towards bigger
The design of a combination of shared memory and L1 cache than-before intermediate cache sizes (40MB L2 cache for
into a single entity with configurable partition has proven to Nvidia DGX A100 and 128MB L3 cache for AMD Radeon
be effective and has been used by Nvidia (and other vendors) RX 6000 series) and allow for better control of them.
since the Volta architecture. This data cache also resides
on-chip and offers high bandwidth and low latency - 28 cycles 5) GRAPHICS CARD GLOBAL MEMORY
for an L1 cache hit and 19 cycles for shared memory without Many GPUs have access to big amounts of high bandwidth
bank conflicts on Nvidia’s V100, according to [23]. Unlike dedicated memory, however, it is located off-chip and poses
the register file, all of this memory can be seen and accessed a big latency hit (> 200 cycles per request). Typical values of
by all threads in a block, which is useful for thread communi- bandwidth global memory provides are 1.5TB/s for HBM2
cation purposes. This is achieved by having a larger amount memory, 500GB/s for GDDR6 and 1TB/s for GDDR6X

VOLUME 11, 2023 12043


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

memory (specification values) [23]. Integrated GPUs often a multiple GPU setup. However, the recent HPC GPUs have
use memory reserved from the system’s random access mem- tens of thousands of cores and usually can perform 10-100
ory (RAM) and are dependent on its bandwidth - the latest mathematical operations per single global memory request.
Apple M1 Pro systems use LPDDR5 technology and can This brings us to the second type of GPU problem - the
achieve up to 200GB/s of bandwidth per chip. Data trans- global memory bandwidth-bound problems. They happen
actions from global memory are served in 32/64 byte gran- when a GPU is limited by how much information it can
ularity, so to fully utilize bandwidth, it is important to avoid transfer from its main VRAM pool per second. In this case,
strided accesses from threads to global memory. If all threads the L2-global bus load will stay high (often close to 100%
from a warp request data from one global memory page and of theoretical peak) throughout the whole execution time and
these requests correspond to multiple consecutive 32/64 bytes the time taken only by data transfers will constitute the largest
in memory, each of these transfers can be performed in one portion of the overall time spent. Algorithm refinement can
transaction and then combined into a single 128-byte trans- still be helpful in this case, however, it has to be performed
action in the L1 cache line update. This technique is called from the standpoint of reducing memory transfers, even at
memory coalescing and it is essential for memory-bound the cost of increasing the number of operations. A more
problems. An important note is that accesses to different powerful GPU will only gain performance scaling with the
VRAM pages often can not be combined by the L2 cache global memory bandwidth, in this case, often independent of
into a single 128-byte transaction in the L1 cache line. In this the amount of compute units.
case, memory accesses have to be requested with a 128-byte Moving up the bandwidth hierarchy, it is worth not-
granularity from the beginning. ing another intermediate type of problem - shared memory
Another source of memory problems related to global bandwidth-bound bottlenecks. They occur when an algo-
memory is L2 cache port serialization and it is especially rithm is optimized to split workload efficiently between com-
relevant to AMD GPUs [29]. If the memory accesses have pute units not requiring much communication between them,
very large distances in global memory between them (more however, requiring large numbers of data transfers between
than 256KB-64MB), even if they are locally coalesced, they threads within a compute unit. These algorithms are often
will use only a limited number of cache ports and have lower much more complex than the ones in the previous two cate-
total bandwidth. There are multiple ways to combat port gories, so the best solution there is to improve shared memory
serialization which try to increase the locality of data accesses communication patterns. They also scale almost linearly with
as much as possible. First, it can be reduced by increasing the number of compute units, so getting a more powerful GPU
the number of bytes coalesced. Secondly, avoiding reordering (the same solution as for compute-bound problems) works in
in the Four-step FFT algorithm also achieves similar results. this case as well.
Overall, it has not yet been solved fully in VkFFT. Some FFT-related algorithms that illustrate the three men-
On Nvidia, distant coalesced accesses can result in a 2x tioned bottlenecks are given below:
reduction of L2 bandwidth if they are not 32-byte aligned -
1) Pure compute-bound problems: polynomial expansion
for example, in the case of odd-length Four-step FFTs. In this
sin/cos calculations, FP64 calculations on systems with
case, the memory controller often requires two instructions
low FP64:FP32 core ratio.
per memory request.
2) Pure global memory bandwidth-bound problems:
Stockham FFT with small primes.
6) SYSTEM MEMORY 3) Shared memory bandwidth-bound problems: Rader’s
This is computer primary storage, which is connected through algorithm (especially its direct multiplication version),
a PCI-E lane to a discrete GPU or through a memory con- Bluestein’s algorithm.
troller if the GPU is integrated. The PCI-E bandwidth is usu-
ally at least an order of magnitude lower than the dedicated Section V will go into detail on how VkFFT tackles these
memory bandwidth, so it is best to have as few requests to problems.
system memory during execution as possible.
IV. THE STRUCTURE OF VkFFT
C. GPU ALGORITHM BOTTLENECKS
This section will describe the implementation of the VkFFT
library, focusing on the runtime kernel generation platform it
There are multiple bottlenecks that can be encountered in
is based on and how it allows to combine and optimize all of
GPU programming which correspond directly to the raw
the implemented GPU algorithms.
computational power of the GPU and the data transferal
speeds. The first type is the compute-bound bottleneck when
a GPU is limited by how many operations it can perform. Usu- A. GENERAL VkFFT DESCRIPTION
ally, in this case, the bus load is low and GPU core utilization VkFFT library is released under an MIT license as a
is high. The main approaches to alleviating this bottleneck header-only interface library written in C. It generates device-
would be to reduce the amount of performed operations via optimized GPU code at run-time with an option to reuse the
algorithm refinement, get a more powerful GPU or switch to code.

12044 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

VkFFT adopts the usual multidimensional memory layout: to use an additional buffer to do an out-of-place transform or
data is stored in the following order (sorted by increase in perform transposition in the Four-step FFT algorithm.
strides): the width, the height, the depth, the coordinate (the VkFFT has the option to optimize generated kernels
number of feature maps) and the batch number. for convolution calculation and system zero-padding with
Similarly to other GPU FFT libraries, VkFFT uses the enhanced performance. VkFFT supports 1 × 1, 2 × 2 and
Stockham algorithm to avoid a separate digit-reversal stage. 3 × 3 matrix convolutions with the symmetric or nonsym-
VkFFT uses a radix-decomposition approach for sequences metric kernel, multiple feature/batch convolutions - one input,
decomposable as a multiplication of an arbitrary number of multiple kernels. VkFFT supports native zero-padding to
the following primes: 2/3/5/7/11/13. These sequences have model open systems. One can specify the range of sequences
comparable performance to that of powers of 2. filled with zeros and the direction in which zero-padding is
VkFFT uses a convolution theorem version of Rader’s FFT applied at the read or write stage).
algorithm for primes from 17 up to max shared memory VkFFT works on Nvidia, AMD, Intel and Apple GPUs.
length ( 10000). VkFFT uses a direct multiplication Rader VkFFT supports all GPUs, ranging from low-power mobile
FFT algorithm for small Sophie-Germain safe primes - 47, GPUs up to HPC GPUs. VkFFT works on Windows, Linux
59 and 83. Both versions are inlined and can be viewed as an and macOS.
extension to radix kernels computed with multiple threads per VkFFT supports Vulkan, CUDA, HIP, OpenCL, Level
prime. Rader’s algorithm works without additional memory Zero and Metal APIs as backends.
transfers, except for the look-up tables (LUT) containing
Rader’s kernel values. B. VkFFT KERNEL GENERATION PLATFORM
Bluestein’s FFT algorithm is used for all other sequences. VkFFT has a hierarchical structure design: Application ->
It is optimized to have as few memory transfers as possible Plan -> Code. This allows the creation of code optimizations
by using zero-padding and merged convolution support of for the target device architecture at runtime. Below a more
VkFFT. VkFFT also allows choosing which sequence to pad detailed description of the VkFFT platform structure is given,
to in Bluestein’s algorithm. starting from the level that is closest to the user application
VkFFT supports complex-to-complex (C2C), real-to- stage.
complex (R2C), complex-to-real (C2R) transformations and
real-to-real (R2R) Discrete Cosine Transformations of types 1) APPLICATION STAGE
I, II, III and IV. At this stage, the VkFFT platform performs all interactions
VkFFT at the moment of this paper creation is a with the user and resources management. The user interaction
single-GPU implementation with current FFT-size limits in covers:
(x, y, z) dimensions respectively: C2C or even C2R/R2C -
(232 , 232 , 232 ). Odd length C2R/R2C - (212 , 232 , 232 ). R2R - 1) Parsing the input configuration given by the user, where
(212 , 212 , 212 ). These values depend on the amount of shared all FFT system parameters and hints to the generator are
memory of the device. given.
VkFFT supports single, double and half-precision (only 2) Application initialization, which consists of calls to the
used for data storage with computations performed in sin- functions from the next Plan stage, corresponding to the
gle precision). VkFFT has two modes of operation - it can provided configuration.
calculate sines and cosines used in twiddle factors directly 3) Application update, which can be used to update some
on the GPU (using special function units (SFU), if the GPU of the parameters after plan creation calls, for example
has them, or using a polynomial approximation of tran- providing different input/output buffers.
scendental functions) or use precomputed on CPU values 4) Application dispatch, which appends the correspond-
stored in the LUT. The CPU precomputation can be done ing plan dispatch command to the user’s queue of GPU
in higher precision, for example, FP128. The LUT approach actions.
allows GPUs to get better precision (it is the default mode of 5) Application deletion, which frees resources allocated
operation in FP64) and for GPUs with low double-precision by VkFFT.
core count to reduce the amount of performed arithmetic 6) Binaries caching, which allows reusing code generated
operations. with the same input configuration.
Real-to-complex, complex-to-real and real-to-real trans-
forms are optimized in the multidimensional case by packing 2) PLAN STAGE
two consecutive sequences as real and imaginary parts of This is the internal configuration stage that constructs the FFT
a complex sequence. The packing and unpacking does not plan - the intermediate representation of the FFT problem.
require additional memory transfers and reduces the number This can be seen as a preparation and post-handling step to
of performed transforms by a factor of two. If an FFT cannot code generation. Overall the processes included in this stage
be performed in one upload to the chip, the Four-step FFT are:
algorithm is used to split the sequence into a combination of 1) Plan stage is called by the application. It is provided
smaller FFTs. VkFFT performs FFT in-place with an option with a local copy of the user configuration (with some

VOLUME 11, 2023 12045


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

values being automatically configured, if they were not A. VkFFT GLOBAL MEMORY TRANSFERS LAYOUT
specified). One of the main challenges in implementing the GPU version
2) Decision making - the main algorithm selection step. of the FFT library is managing memory coalescing. Strided
This step completely defines the structure of the gen- accesses occur in the multidimensional case when accessing
erated kernels, algorithms used, number of threads, data for axes with non-unit stride and in the Four-step FFT
registers and shared memory. This step also determines algorithm when accessing a column set of sequences. The
the sequence of primes used by the Stockham FFT typical approach implemented in many other FFT libraries
algorithm. would be to perform a transposition of a matrix aligning the
3) Resource allocation - LUT, intermediate transposition FFT sequence to a unit stride direction. The transposition
buffers. routine achieves non-strided access by operating on a rectan-
4) Plan initialization, which consists of calls to the func- gular kernel by splitting the input multidimensional system
tions from the next Code generation stage. into tiles and performing transposition in shared memory,
5) Kernel compilation or loading of provided binaries. which does not have a strided access problem if the tile
6) Plan update handles, called by application update is padded so that every column resides in a different bank.
handles. Each of these transpositions, however, requires an additional
7) Plan dispatch handles, called by application dispatch upload and download of the system from the global memory.
handles. VkFFT incorporates a different way of memory coalescing
8) Plan deletion, which is also called during application which does not require separate transposition and can be used
deletion. in other algorithms operating on multidimensional data as
well. The main idea is to implement a tiled upload approach,
3) CODE GENERATION STAGE similar to the one used in matrix transposition, and group
This is the lowest stage in the VkFFT platform. Its main goal the consecutive sequences spanning along the non-unit stride
is to generate a string that will hold GPU code for a particular axis, so that each global memory transaction is filled from
API that can be later compiled and used. The code generation neighboring elements with unit stride. To fill a 32-byte data
stage is further subdivided into three intermediate levels to transaction, we have to group at least four sequences in
facilitate the reuse of code. single and two sequences in double precision. This limits
1) Level 2 kernels – a clear description of the problem the maximum sequence size that can be stored in the shared
via a sequence of calls to lower levels, kernel layout memory more than the non-strided access pattern does. As the
configuration. memory uploads are always performed along the unit stride
2) Level 1 kernels – simple routines: matrix-vector mul- axis, it is guaranteed that reads and writes to any position
tiplication, FFT, pre- and post-processing, R2C/R2R along the non-unit strided axis will be fully coalesced. This
mappings. property also allows performing a transposition afterward to
3) Level 0 kernels – memory management, basic math return data to the correct order required by the Four-step
functions inlining, API-dependent definitions. FFT algorithm. This transposition is merged with the store
phase of the last stage of the Four-step algorithm and does
not need additional memory transfers. However, it requires
V. VkFFT ALGORITHMS AND MEMORY TRANSFER an additional memory buffer with the size of an input system,
OPTIMIZATIONS as these writes to global memory are not done in place.
The FFT algorithm on most modern GPUs is a heavily An example of a task that does not require transposition is
memory-bound problem. Due to this fact, the VkFFT library a convolution performed via the convolution theorem, which
has been designed to have as few global memory-on-chip will be discussed later.
memory data transfers as possible and optimized to have If we assume that the thread block can use 32KB of
efficient shared memory usage. In the best-case scenario, the shared memory (for bigger values sequence sizes will scale
memory layout should follow these rules: accordingly), in the case of a unit strided axis access, up until
• No CPU-GPU transfers during execution except for the sequence size of 4096 single-precision complex floats
asynchronous downloads from the GPU. (exactly 32KB of data), the full sequence can be stored in
• Minimize GPU dedicated memory-L2-L1 communica- shared memory, so an FFT can be performed in one upload.
tion. For bigger sequences, the Four-step FFT algorithm follows
• Maximize on-chip memory usage, but avoid register the same logic as above, as it uses a representation of a 1D
spilling, shared memory bank conflicts and maintain sequence as a multidimensional matrix with one difference
occupancy - the ratio of the average number of active that one stage of it has a unit stride. During this stage, if we
warps on the compute unit to the maximum number omit the transposition at the end, FFTs can fill the full 32KB
of active warps supported by the compute unit. Low of shared memory as the memory accesses are non-strided.
occupancy may result in stalling of either compute or However, if transposition is required, it limits this sequence
memory resources. size to 1024 to coalesce column writes after the last FFT step

12046 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

along the rows. VkFFT currently supports the Four-step splits memory communications. VkFFT, at the moment of publica-
in up to three uploads, so for the system with 32KB of shared tion, employs additional kernels for composite lengths of 4,
memory the maximum sequence that can be done will be 230 . 6, 8, 9, 10, 12, 14, 15, 16 and 32.
This length is further limited if the number is decomposable During the read/write stages, VkFFT prioritizes opti-
as a multiplication of big primes, like 117 . In this case, if we mization for coalesced access patterns to global memory.
split it as 117 = 113 ·112 ·112 , one of the factors (113 = 1331) However, if it is possible VkFFT will optimize data trans-
will be out of the 1024 range. These cases can be done by fers directly to/from registers, omitting the shared memory
coalescing less data or by implementing more than 3 upload transfers.
schemes, which will be done in a future release.
One issue that is also worth noting is related to the dis- C. VkFFT OCCUPANCY OPTIMIZATIONS
tant coalesced accesses described before. It occurs in the In cases when the kernel has to use small FFT kernels of
three-upload scheme during the transposition write part (if we different prime lengths, the number of active threads also
omit the transposition the performance is restored). VkFFT changes. For example, for an FFT of size 77 = 7 · 11, the
performs a single transposition in this case, swapping the FFT kernel of length 7 can be launched with 11 threads, while
logical axis of upload 0 and the axis of upload 2 as the last the FFT kernel of length 11 can be launched with 7 threads.
write operation. VkFFT attempts to solve it by coalescing at The solution is to launch 11 threads with 11 · 2 registers (we
least 128 bytes of data, however, this only partially helps with use complex numbers) and make 4 threads inactive during
the L2-cache additional instructions problem. It is possible, the FFT of size 11. The decision process becomes more
that a different transposition pattern is required to solve this subtle when there are more primes involved. VkFFT has a
issue. handcrafted selection of the number of registers used by many
VkFFT allows for multiple buffers of FFT reads/writes, combinations of primes from 2 to 13. This selection has
where the FFT system is stored in multiple distinct memory been made following two main design decisions - minimize
chunks. This is helpful for systems with a limit of memory the number of inactive threads and have enough threads to
allowed per single allocation (usually 4GB) and allows for utilize the GPU (try not to overuse the number of registers
better reuse of buffers. An example use-case of this feature per thread).
is that VkFFT can reuse multiple small buffers left after
temporary calculations in other parts of code for a temporary D. VkFFT REGISTER FILE OPTIMIZATIONS
transposition buffer after the Four-step FFT algorithm. It also Another way to reduce the number of data transfers is to use
opens up a possibility for a multi-GPU implementation in the a register file to store the data instead of shared memory. This
future. is mostly useful for GPUs with small amounts of allocatable
shared memory - for example, Nvidia GPUs in Vulkan or
B. VkFFT SHARED MEMORY OPTIMIZATIONS OpenCL only allow use of 48KB per thread block. In this
Efficient use of shared memory is an essential part of having case, using 128KB of register file to store the sequence
optimal FFT performance. If the global memory transfer data and another 128KB for operations allows an increase
number is reduced to the minimum, shared memory band- in the maximum unit-strided sequence length to be done in
width will be the next limiting factor. Shared memory in one upload to 214 in single precision. Each thread stores
VkFFT is used as a communication buffer, where a copy of its own part of the input sequence in its registers and uses
the FFT system from registers is loaded/stored after each step shared memory as a communication buffer. This approach
of the FFT algorithm. significantly reduces the number of warps a compute unit can
To optimize shared memory usage it is important to avoid maintain at the same time and increases the number of opera-
bank conflicts. The first stages of the non-strided Stockham tions done per FFT stage due to the additional communication
DIT algorithm for small sequences often cause problems with step, with explicit synchronization between consecutive radix
this - the writes to shared memory there are also non-strided. stages. However, it proves to be effective to bridge the gap
VkFFT solves this issue by performing multiple non-strided when the transition from one-upload to two-upload Four-
FFTs with a single thread block and by performing a transpo- step FFT occurs - at least for the simplest case of powers
sition in shared memory before calculating the FFT. This way of 2. This approach can also be used in non-unit strided
the number of combined sequences will be the lower bound to FFT computations, effectively increasing the maximum FFT
the number of memory banks used. VkFFT also uses padding sequence length there by a factor of four.
(adding free space) between sequences or inside sequences
in shared memory if it determines that the access pattern can E. VkFFT CONVOLUTION OPTIMIZATIONS
improve by doing so. The convolution theorem states that convolution between a
For large sequences that have no occupancy problems, system and a kernel can be computed as an inverse FFT of a
it is beneficial to merge small radix kernels (like 2, 3 and direct multiplication of a system FFT and a kernel FFT. This
others) used by the Stockham algorithm. This reduces the allows the performing of convolutions with the same com-
possible number of threads per block, increases the number plexity as FFT, instead of initial O(n2 ). The number of mem-
of registers per thread and reduces the number of shared ory transfers can be reduced in this case if we combine the

VOLUME 11, 2023 12047


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

last stage of the FFT, kernel multiplication and the first stage (we use a single-chip version of it) and 64KB of shared
of the inverse FFT in one program and do not perform system memory [25], [26]. The GPUs use CUDA 11.7 and ROCm
download from the chip after the FFT, system upload and 5.2, respectively. We use VkFFT version 1.2.31.
download for multiplication and system upload for the inverse The first test will perform an FFT of multiple batched
FFT - a total of four memory transfers. Each of these transfers 1D sequences of all lengths in the 2 to 4096 range followed
is used to transfer the full system at the global memory band- by a corresponding inverse FFT. The number of sequences
width, so it is possible to estimate how much less memory is is chosen based on the sequence length to have a constant
transferred after this optimization. For example, a 1D batched system size of 512MB-1GB in memory, so all compute units
FFT without Four-step FFT (one system upload/download of the GPU are utilized. To mitigate the dispatch call overhead
per FFT) will drop from three uploads/downloads to only of submitting compute tasks to the GPU and remove random
one (plus kernel memory transfer to the GPU). In the case noise, the test performs the FFT and inverse FFT 100 times
of a 1D batched FFT with a Four-step FFT, we can omit the in a row per dispatch and then averages the time taken. Each
transposition step in the algorithm, as the inverse FFT will test is performed three times to verify the consistency of the
return data to the original layout. This allows us to not allocate results. For memory-bound problems, a good performance
an additional buffer for this transposition and reduces global representation exists in a form of a theoretical algorithm
memory usage by a factor of two. It also allows for better bandwidth, defined as:
locality and helps to reduce L2 cache serialization as the last 2 · System size[GB]
stage of the Four-step algorithm will write output with unit Algorithm bandwidth = ,
Single transform time[s]
stride.
The peak global memory bandwidth of the A100 and
F. VkFFT ZERO-PADDING OPTIMIZATIONS MI250 (obtained by a simple large-data memory copy test)
The concept of calculation of convolution as a multiplication is 1.3TB/s.
in the frequency domain (the convolution theorem) has one
important property - it yields circular convolution, which A. DOUBLE-PRECISION, ALL SIZES UP TO 4096
means that it assumes the input data to be periodic. While The resulting pattern seen in the benchmark plots of Fig. 4 and
this holds true for the modeling of periodic systems, it can Fig. 5 will be similar to the ones seen in all other benchmarks.
produce incorrect results on open systems, uninfluenced It is closely related to which algorithm is required for a
by anything outside the system. For this specific case, the particular FFT length. VkFFT incorporates three main FFT
zero-padding technique is used - we pad cells along each algorithms that cover the whole range of numbers - Stockham
dimension to double the size. All the extra values are then autosort, Rader’s algorithm (two versions) and Bluestein’s
filled with zeros. While not changing the circular structure algorithm. For cuFFT and rocFFT, the coloring is an educated
of the performed convolution, this trick sets all the influence guess made by analysis of profiling results.
from the periodic images to zero. The memory layout can Stockham autosort algorithm is designated as radix(2-13)
be significantly improved in this case. Firstly, as we know on the performance plots (red and black colors). In double-
which axes are padded with zeros, we can upload only half of precision (DP), all three libraries use it for sequences repre-
them to the GPU and logically set the other half to zero on- sentable as a multiplication of the arbitrary number of primes
chip. Secondly, if we have a multidimensional zero-padding, up to 13. VkFFT is designed to have as few global memory
there exist sequences full of zeros, which can be omitted from transfers as possible, so it performs all these sequences in a
the calculation altogether, as their FFT is a sequence full of single upload to the chip, thus achieving close to the peak
zeros. This allows us to get up to a 2x speed increase for GPU bandwidth performance for both A100 (Fig. 4) and
multidimensional FFTs. The decrease in transferred memory MI250 (Fig. 5). This is mainly possible due to the fact that
also improves all caches’ hit rates. VkFFT also allows for fre- VkFFT generates kernels at runtime and does not ship bina-
quency zero-padding and a specific selection of zero-padding ries for all sequences. cuFFT, on the other hand, provides only
range placement in the system. This allows zero-pad high precompiled PTX files and switches to multiple uploads for
frequencies, which is useful for upscaling applications. some of the sequences, which can be detected by some black
triangles way below the 1.3TB/s peak bandwidth. rocFFT
VI. VkFFT BENCHMARKS is developing a runtime kernel generator as well, however,
To measure the performance of VkFFT, we compare it to there are still many sequences that fall back to less optimized
Nvidia’s cuFFT and AMD’s rocFFT libraries, which are ven- implementation, which can be seen by the abundance of black
dor solutions developed mainly for HPC use cases. We use triangles results below 300GB/s (Fig. 5).
Nvidia A100 and AMD MI250 GPUs for the tests, which For sequences that are decomposable as a multiplication of
are the latest publicly released HPC GPUs available at the primes up to 4096 (if P-1 FFT can be done with the Stockham
time of the creation of this paper. Both GPUs were measured autosort algorithm), VkFFT uses an FFT-convolution version
to use approximately 250W of power. Nvidia’s A100 GPU of Rader’s algorithm, shown as cyan circles. Each Rader-FFT
has 40GB of HBM2 memory and 192KB of shared memory, algorithm prime requires 2x more shared memory transfers
while AMD MI250’s GPU has 64GB of HBM2 memory than radix 2-13 kernels plus an additional global memory

12048 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 4. Benchmark of Nvidia A100 GPU with VkFFT (circles) and cuFFT (triangles) in batched 1D double-precision FFT+IFFT computations. Different
colors represent different algorithms used for this particular sequence size: red and black are radix decomposition, magenta and green are Blustein’s
algorithm, cyan and orange are Rader’s algorithm.

FIGURE 5. Benchmark of AMD MI250 GPU with VkFFT (circles) and rocFFT (triangles) in batched 1D double-precision FFT+IFFT computations.
Different colors represent different algorithms used for this particular sequence size: red and black are radix decomposition, magenta and green are
Blustein’s algorithm, cyan is Rader’s algorithm.

LUT access for Rader’s kernel (which becomes noticeable still beneficial to use a direct multiplication version of Rader’s
for big primes). For some small primes - 47, 59 and 83, algorithm. However, its performance is inferior to Bluestein’s
which have non-radix 2-13 primes in their Rader algorithm algorithm for primes after 83 due to its high number of shared
(47 − 1 = 46, which is divisible by 23 - 23 is called a memory data transfers - this algorithm is shared memory
Sophie Germain prime and 47 is called a safe prime) it is bandwidth-limited. The cuFFT library only has the direct

VOLUME 11, 2023 12049


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 6. Benchmark of Nvidia A100 GPU with VkFFT (circles) and cuFFT (triangles) in batched 1D single-precision FFT+IFFT computations. Different
colors represent different algorithms used for this particular sequence size: red and black are radix decomposition, magenta and green are Blustein’s
algorithm, cyan is Rader’s algorithm.

multiplication version of the Rader’s algorithm implemented account, all sequences that have to pad to a number bigger
for primes up to 127, shown as orange triangles. This library than 4096 have to be done in two uploads in VkFFT - this
also often does not merge the prime FFT codes into a single can be seen by a drop in performance after 2048 for magenta
upload kernel - this can be deduced by the horizontal orange circles.
lines aggregating at approximately 650GB/s, 400GB/s and
later 300 GB/s - exactly 1:2, 1:3 and 1:4 of the peak band-
B. SINGLE-PRECISION, ALL SIZES UP TO 4096
width of 1.3TB/s. rocFFT does not have the Rader algorithm
implemented (Fig. 5). In single-precision (Fig. 6 and Fig. 7), VkFFT uses
For all other sequences divisible by Sophie Germain safe the same algorithm configuration as in double-precision
primes (or numbers non-decomposable by the previous two (radix+Rader+Bluestein). The main difference comes from
algorithms) for VkFFT, primes after 128 for cuFFT and the usage of GPU’s special function units for the calculation
primes after 17 for rocFFT (approximately) Bluestein’s algo- of sines and cosines - so there is less memory transferred
rithm is used. It is shown as magenta circles for VkFFT for bigger sequences. For MI250 (Fig. 7), all sequences from
and green triangles for other libraries in Fig. 4 and Fig. 5. the range (including Bluestein’s padded) can fit into 64KB of
In VkFFT, the sequence after 2N − 1 to pad to has to be shared memory and are done in a single upload - so there is
chosen as a balance between high decomposition (like a no performance drop at 2048 (it happens after 4096). VkFFT
power of 2) and memory transferred. For example, 1031 is also uses packed math instructions available in MI250 - this
a prime number non-decomposable with Rader’s algorithm GPU can perform two FP32 instructions at the same time.
(1030 is divisible by 103, which is not a radix 2-13 prime), cuFFT does not use the Rader algorithm for some reason
so we have to pad it to at least 2061. However, the next in single precision and switches to Bluestein’s algorithm for
power of 2 is 4096 - which is almost another 2x increase primes after 17 (Fig. 6). Rader’s algorithm implementation in
in size. By extensive testing of all radix 2-13 decomposable VkFFT works just as well in FP32 as in FP64.
numbers from 2061 to 4096, the best performance is achieved
by padding to 2187 = 37 . This prior testing, convolution C. DOUBLE-PRECISION, RADIX 2-7, ALL SIZES UP TO 230
support in VkFFT (they are optimized to have as few memory To test the performance of VkFFT on the whole supported
transfers as possible) and zero-padding optimizations make range of transforms, we select all sequences decomposable as
VkFFT’s Bluestein’s algorithm faster than either cuFFT and a multiplication of the arbitrary number of 2s, 3s, 5s and 7s
rocFFT implementations. It is also worth noting that 64KB (Fig. 8 and Fig. 9). The main consideration behind this choice
of shared memory in MI250 can only support FP64 complex is that these are the most commonly used sizes and that testing
sequences up to 4096, so taking Bluestein’s padding into all 230 numbers is unfeasible for a developing library.

12050 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 7. Benchmark of AMD MI250 GPU with VkFFT (circles) and rocFFT (triangles) in batched 1D single-precision FFT+IFFT computations. Different
colors represent different algorithms used for this particular sequence size: red and black are radix decomposition, magenta and green are Blustein’s
algorithm, cyan is Rader’s algorithm.

FIGURE 8. Benchmark of Nvidia A100 GPU with VkFFT (red circles) and cuFFT (black triangles) in batched 1D double-precision FFT+IFFT computations
for big sequence lengths decomposable as a multiplication of primes up to 7. The FFT length is in logarithmic scale.

For big sequences, all uploads have to be coalesced - the to perform the coalesced transposition there. This means
sequence is represented as a 2D or 3D array. For a 32-byte that it is essentially also considered as a strided axis, as the
length cache line assumption, this means that we have to final write there will be strided. However, if we do not have
upload at least two 16-byte complex numbers at the same to restore the correct layout (for example, in a convolution
time. For the non-strided axis (last one) there exist two calculation the inverse FFT will restore the correct layout
options - if we need to restore the correct FFT layout, we have anyway) the transposition can be omitted and we can do

VOLUME 11, 2023 12051


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 9. Benchmark of AMD MI250 GPU with VkFFT (red circles) and rocFFT (black triangles) in batched 1D double-precision FFT+IFFT computations
for big sequence lengths decomposable as a multiplication of primes up to 7. The FFT length is in logarithmic scale.

bigger sequences in fewer memory transfers. This is used in - for coalesced accesses with big strides they can experience
Bluestein’s FFT implementation of VkFFT for big sequences. memory pin serialization, where all memory transactions to
The switch to a two-upload scheme happens at approxi- global memory happen through a single memory pin. This
mately 10500 for A100 (Fig. 8) and 4096 for MI250 (Fig. 9) issue is known since the Polaris generation of GPUs [29] and
due to the shared memory limitations. The switch to the there has been no simple workaround to it.
three-upload scheme happens at 222 for A100 and 219 for
MI250. The four-step algorithm requires a single twiddle D. SINGLE-PRECISION, RADIX 2-7, ALL SIZES UP TO 230
factor multiplication for each of the uploads except the first. Single precision results (Fig. 10 and Fig. 11) resemble the
For high double-precision core count GPUs, it turns out to be double precision results as both A100 (Fig. 10) and MI250
beneficial to use a slow polynomial approximation of sines (Fig. 11) have a high FP32 to FP64 core ratio. The main dif-
and cosines and pay the price of additional FMA instruc- ference is that coalescing to 32 bytes requires four sequences
tions rather than transfer additional data from LUT. For the in FP32 instead of two sequences in FP64. This moves
consumer-level Nvidia GPUs, this is not the case, so VkFFT the switch from a single upload to multiple uploads from
has both options available. 10500 to 21000 for A100 and from 4096 to 8192 for MI250.
For the three-upload scheme, multiple other problems have All the issues related to the final transposition are present here
been encountered, all closely related to the final transposition as well.
(they are not present if the transposition to restore the layout
is not required). The first issue is related to the size of E. SINGLE-PRECISION, MULTIDIMENSIONAL
systems becoming quite big (1GB and more) - the L2 cache PERFORMANCE, CUBES WITH ALL SIZES UP TO 1024
is often not able to combine four 32-byte transactions to fill This test (Fig. 12 and Fig. 13) evaluates VkFFT performance
a 128-byte cache line. This can happen if the final global on a wide range of multidimensional systems, from small
memory requests target different memory banks. So, VkFFT ones that are not able to saturate a single compute unit(like 23 )
coalesces to 128 bytes in a three-upload scheme, if possible. to excessively big ones filling full memory of a single GPU
The second issue is also related to L2 cache performance - (like 10243 ). The bandwidth is multiplied by 3 compared to
if the stride is odd, the access pattern can not be aligned to the 1D cases, as it is expected that all libraries use separate
128-byte granularity, so the L2 cache has to serve twice the uploads for each of the axes. This way it is easier to compare
number of memory requests, thus limiting its maximum band- results to the peak theoretical bandwidth of the GPU. Red
width by a factor of up to 2×. The possible solution to this circles correspond to VkFFT library results and black trian-
is reimplementing the single transposition into the one with gles are competitor library results. The performance plots also
multiple intermediate transpositions, but it has not been tested show the zero-padding performance (cyan circles), which will
in VkFFT yet. The last issue is mainly related to AMD GPUs be described in the next subsection. VkFFT uses Rader’s

12052 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 10. Benchmark of Nvidia A100 GPU with VkFFT (red circles) and cuFFT (black triangles) in batched 1D single-precision FFT+IFFT computations
for big sequence lengths decomposable as a multiplication of primes up to 7. The FFT length is in logarithmic scale.

FIGURE 11. Benchmark of AMD MI250 GPU with VkFFT (red circles) and rocFFT (black triangles) in batched 1D single-precision FFT+IFFT computations
for big sequence lengths decomposable as a multiplication of primes up to 7. The FFT length is in logarithmic scale.

and Bluestein’s algorithms in the multidimensional case as units are idle or have not enough workload. A100 and MI250
well and all performance considerations from the previous performance in this range is mostly limited by the dispatch
sections also hold true to this case. latencies.
System sizes taking up to 256KB (cubes up to 25 ) experi- System sizes of 512KB-8MB (cubes up to 27 ) experience
ence a high L1 hit rate but have the lowest amount of available a high L2 hit rate and have a large number of available warps,
warps, which results in low GPU utilization as many compute active compute units and occupancy compared to smaller

VOLUME 11, 2023 12053


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 12. Benchmark of AMD A100 GPU with VkFFT (red circles) and cuFFT (black triangles) in 3D single-precision FFT+IFFT cube computations. This
plot also shows the zero-padding optimization performance of VkFFT (cyan circles).

FIGURE 13. Benchmark of AMD MI250 GPU with VkFFT (red circles) and rocFFT (black triangles) in 3D single-precision FFT+IFFT cube computations.
This plot also shows the zero-padding optimization performance of VkFFT (cyan circles).

systems. The nature of the test (it performs multiple FFTs and System sizes bigger than the L2 cache (cubes up to 210 )
inverse FFTs) utilizes this high L2 hit rate, resulting in higher have a performance determined mainly by global mem-
than peak global memory bandwidth. The L2 cache of the ory bandwidth. Memory pin serialization of AMD GPUs is
A100 is much faster than the L2 cache of MI250, achieving most noticeable here for the sizes divisible by high powers
2.4 TB/s on 1283 system. of 2.

12054 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 14. Benchmark of DCT-I with VkFFT (red circles for AMD MI250, green triangles for Nvidia A100) compared to FFTW (black squares for AMD
EPYC 7742) in batched 1D double-precision FFT+IFFT computations.

FIGURE 15. Benchmark of DCT-II/III with VkFFT (red circles for AMD MI250, green triangles for Nvidia A100) compared to FFTW (black squares for
AMD EPYC 7742) in batched 1D double-precision FFT+IFFT computations.

F. ZERO-PADDING PERFORMANCE noticeable on system sizes bigger than the L2 cache, mainly
Fig. 12 and Fig. 13 also demonstrate the zero-padding opti- determined by the global memory bandwidth.
mization performance in the multidimensional case (cyan cir-
cles). In the 3D case reduced memory transfers and increased G. DISCRETE COSINE TRANSFORMS (R2R)
cache hit rate of zero-padding resulted in a 100% perfor- PERFORMANCE
mance increase (on system sizes not affected by low occu- The performance plots of Fig. 14-16 present double precision
pancy, as additional branch instructions executed there can performance results of DCTs of types I-IV. The test con-
make the code worse). The performance increase is most figuration is the same as for the C2C in double precision.

VOLUME 11, 2023 12055


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 16. Benchmark of DCT-IV with VkFFT (red circles for AMD MI250, green triangles for Nvidia A100) compared to FFTW (black squares for AMD
EPYC 7742) in batched 1D double-precision FFT+IFFT computations.

FIGURE 17. FP64 precision comparison of VkFFT (black squares for AMD MI250, cyan triangles for Nvidia A100), cuFFT (green diamonds for Nvidia
A100) and rocFFT (red circles for AMD MI250) against FP128 version of FFTW for a single forward FFT [17].

We compare the performance of the AMD EPYC 7742 VII. VkFFT PRECISION
(64 cores) CPU with threaded FFTW with Nvidia A100 and VkFFT precision is verified by comparing its results with
AMD MI250 GPUs with VkFFT, as no other GPU vendor the FP128 version of FFTW. We test all FFT lengths from
library supports DCTs [17]. The high bandwidth of GPU the [2, 100000] range. We perform tests in double (Fig. 17)
memory allows it to greatly outperform CPU implementation and single (Fig. 18) precision on random input data from
in FFTW. the [−1; 1] range. The test results are compared to the

12056 VOLUME 11, 2023


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

FIGURE 18. FP32 precision comparison of VkFFT (black squares for AMD MI250, cyan triangles for Nvidia A100), cuFFT (green diamonds for Nvidia
A100) and rocFFT (red circles for AMD MI250) against FP128 version of FFTW for a single forward FFT.

corresponding results of Nvidia’s cuFFT and AMD’s rocFFT Fig. 13). The results of this paper also present the platform
libraries. for multiple-API cross-platform GPU code generation.
For both precisions, all tested libraries exhibit logarithmic
error scaling. The main source of error is imprecise twiddle ACKNOWLEDGMENT
factor computation - sines and cosines used by the FFT The author would like to thank the provision of comput-
algorithms. For FP64 (Fig. 17) they are calculated on the ing resources by the Swiss National Supercomputer Centre
CPU either in FP128 or in FP64 and stored in the look-up (CSCS) under allocation s1111.
tables. With FP128 precomputation VkFFT is more precise
than cuFFT and rocFFT - the L2 error of it stays below 10−15 REFERENCES
on the full test range. [1] J. W. Cooley, P. A. W. Lewis, and P. D. Welch, ‘‘The fast Fourier transform
For FP32 (Fig. 18), twiddle factors can be calculated on and its applications,’’ IEEE Trans. Educ., vol. E-12, no. 1, pp. 27–34,
the fly in FP32 or precomputed in FP64/FP32. With FP32 Mar. 1969, doi: 10.1109/TE.1969.4320436.
[2] J. W. Cooley and J. W. Tukey, ‘‘An algorithm for the machine calculation
twiddle factors, VkFFT is slightly less precise in Bluestein’s of complex Fourier series,’’ Math. Comput., vol. 19, no. 90, pp. 297–301,
and Rader’s algorithms. If needed, this can be solved with Apr. 1965.
FP64 precomputation. [3] B. Lloyd, C. Boyd, and N. Govindaraju, ‘‘Fast computation of general
Fourier transforms on GPUs,’’ Microsoft Corp., Redmond, WA,
USA, Tech. Rep., MSR-TR-2008-62, Apr. 2008. [Online]. Available:
VIII. CONCLUSION https://www.microsoft.com/en-us/research/publication/fast-computation-
This paper presents a cross-platform, open-source and per- of-general-fourier-transforms-on-gpus/
[4] J. Nickolls, I. Buck, M. Garland, and K. Skadron, ‘‘Scalable parallel
formant FFT library for GPU accelerators - VkFFT. Sev- programming with CUDA: Is CUDA the parallel programming model
eral GPU memory management and optimization techniques that application developers have been waiting for?’’ Queue, vol. 6, no. 2,
are described in detail. A number of novel solutions to the pp. 40–53, Mar. 2008, doi: 10.1145/1365490.1365500.
problems limiting GPU performance are presented and tested [5] NVIDIA. (2020). CUFFT Library. Accessed: Dec. 14, 2022. [Online].
Available: https://docs.nvidia.com/cuda/cufft
in this paper. By applying described memory layout and [6] AMD. (2016). HIP: C++ Heterogeneous-Compute Interface
developed solutions to the FFT problem, VkFFT is able to for Portability. Accessed: Dec. 14, 2022. [Online]. Available:
show better performance than already established GPU FFT https://github.com/ROCm-Developer-Tools/HIP
[7] K. Group. (2009). OpenCL Overview. Accessed: Dec. 14, 2022. [Online].
libraries, while being scalable and tunable to a much wider Available: https://www.khronos.org/opencl/
range of GPUs presented on the market. Full open-source [8] Intel. (2016). The oneAPI Level Zero API. Accessed: Dec. 14, 2022.
access to the source code of VkFFT allows for the imple- [Online]. Available: https://github.com/oneapi-src/level-zero
mentation of user-specific optimizations, which are demon- [9] Apple. (2014). Metal API. Accessed: Dec. 14, 2022. [Online]. Available:
https://developer.apple.com/metal/
strated on an example of native zero-padding, in which case [10] Khronos Group. (2016). Vulkan Overview. Accessed: Dec. 14, 2022.
FFT experiences up to a 2x speed increase (Fig. 12 and [Online]. Available: https://www.khronos.org/vulkan

VOLUME 11, 2023 12057


D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

[11] Khronos Group. (2019). The Industry Open Standard Intermediate Lan- [23] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, ‘‘Dissecting
guage for Parallel Compute and Graphics. Accessed: Dec. 14, 2022. the NVIDIA Volta GPU architecture via microbenchmarking,’’ 2018,
[Online]. Available: https://www.khronos.org/spir arXiv:1804.06826.
[12] W. Cochran, J. Cooley, D. Favin, H. Helms, R. Kaenel, W. Lang, [24] J. Liu, E. Lindholm, M. Y. Siu, B. W. Coon, and S. F. Oberman, ‘‘Operand
G. Maling, D. Nelson, C. Rader, and P. Welch, ‘‘What is the fast Fourier collector architecture,’’ U.S. Patent 7 834 881 B2, Nov. 16, 2010.
transform?’’ Proc. IEEE, vol. 55, no. 10, pp. 1664–1674, Oct. 1967, doi: [25] NVIDIA. (2020). NVIDIA Ampere GA102 GPU Architecture Whitepaper.
10.1109/PROC.1967.5957. Accessed: Dec. 14, 2022. [Online]. Available: https://www.nvidia.com/
[13] D. H. Bailey, ‘‘FFTs in external of hierarchical memory,’’ in Proc. content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
ACM/IEEE Conf. Supercomputing, Aug. 1989, pp. 234–242, doi: [26] AMD. (2021). AMD CDNA 2 Architecture Whitepaper. Accessed:
10.1145/76263.76288. Dec. 14, 2022. [Online]. Available: https://www.amd.com/
[14] C. M. Rader, ‘‘Discrete Fourier transforms when the number of data system/files/documents/amd-cdna2-white-paper.pdf
samples is prime,’’ Proc. IEEE, vol. 56, no. 6, pp. 1107–1108, Jun. 1968, [27] AMD. (2019). AMD RDNA Architecture Whitepaper. Accessed:
doi: 10.1109/PROC.1968.6477. Dec. 14, 2022. [Online]. Available: https://www.amd.com/
[15] L. Bluestein, ‘‘A linear filtering approach to the computation of discrete system/files/documents/rdna-whitepaper.pdf
Fourier transform,’’ IEEE Trans. Audio Electroacoust., vol. AE-18, no. 4, [28] Intel. (2015). The Compute Architecture of Intel Processor
pp. 451–455, Dec. 1970, doi: 10.1109/TAU.1970.1162132. Graphics Gen9. Accessed: Dec. 14, 2022. [Online]. Available:
[16] H. Sorensen, D. Jones, M. Heideman, and C. Burrus, ‘‘Real-valued https://www.intel.com/content/dam/develop/external/us/en/documents/the-
fast Fourier transform algorithms,’’ IEEE Trans. Acoust., Speech, Sig- compute-architecture-of-intel-processor-graphics-gen9-v1d0.pdf
nal Process., vol. ASSP-35, no. 6, pp. 849–863, Jun. 1987, doi: [29] AMD. (2012). OpenCL performance and optimization. Accessed:
10.1109/TASSP.1987.1165220. Dec. 14, 2022. [Online]. Available: https://rocmdocs.amd.com/
[17] M. Frigo and S. Johnson, ‘‘The design and implementation of en/latest/Programming_Guides/Opencloptimization.html
FFTW3,’’ Proc. IEEE, vol. 93, no. 2, pp. 216–231, Feb. 2005, doi:
10.1109/JPROC.2004.840301.
[18] J. Makhoul, ‘‘A fast cosine transform in one and two dimensions,’’ IEEE
Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 1, pp. 27–34,
Feb. 1980, doi: 10.1109/TASSP.1980.1163351. DMITRII TOLMACHEV (Graduate Student Mem-
[19] Z. Wang, ‘‘On computing the discrete Fourier and cosine transforms,’’ ber, IEEE) was born in Yekaterinburg, Russia,
IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33, no. 5,
in 1996. He received the B.S. degree in applied
pp. 1341–1344, Oct. 1985, doi: 10.1109/TASSP.1985.1164710.
mathematics and physics from the Moscow Insti-
[20] S.-S. Chan and K.-K. Ho, ‘‘Fast algorithms for computing the discrete
cosine transform,’’ IEEE Trans. Circuits Syst. II, Analog Digit. Signal tute of Physics and Technology, Moscow, Russia,
Process., vol. 39, no. 3, pp. 185–190, Mar. 1992, doi: 10.1109/82.127302. in 2018, and the M.S. degree in simulation sci-
[21] F. Sanglard. (2020). A History of Nvidia Stream Mul- ences from Rheinisch-Westfälische Technische
tiprocessor. Accessed: Jul. 2, 2020. [Online]. Available: Hochschule Aachen, Germany, in 2020. He is cur-
https://www.fabiensanglard.net/cuda/index.html rently pursuing the Ph.D. degree with the Insti-
[22] X. Mei and X. Chu, ‘‘Dissecting GPU memory hierarchy through tute of Geophysics, ETH Zürich, Switzerland. His
microbenchmarking,’’ IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 1, research interest includes parallel programming in scientific applications.
pp. 72–86, Jan. 2017, doi: 10.1109/TPDS.2016.2549523.

12058 VOLUME 11, 2023

You might also like