VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library

The document discusses VkFFT, an open-source GPU FFT library that aims to provide a cross-platform alternative to vendor-specific solutions. It describes the optimizations implemented in VkFFT and verifies its performance and precision on modern GPUs. VkFFT uses the Vulkan API to achieve portability across GPUs from different vendors while obtaining comparable or better performance than proprietary libraries.

Uploaded by

mifami7390

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views20 pages

VkFFT-A Performant Cross-Platform and Open-Source GPU FFT Library

Uploaded by

mifami7390

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Received 15 December 2022, accepted 30 January 2023, date of publication 3 February 2023, date of current version 8 February 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3242240

VkFFT-A Performant, Cross-Platform and

Open-Source GPU FFT Library
DMITRII TOLMACHEV , (Graduate Student Member, IEEE)
Institute of Geophysics, ETH Zürich, 8092 Zürich, Switzerland
e-mail: [email protected]
This work was supported in part by the ETH Research Commission under Grant ETH-03 21-1, in part by the European Research Council
through the Horizon 2020 Programme under Grant 833848-UEMHP (PI A. Jackson), and in part by the Platform for Advanced Scientific
Computing (PASC) initiative of ETH Zürich through the Award of Project AQUA-D (PI A. Jackson).

ABSTRACT The Fast Fourier Transform is an essential algorithm of modern computational sci-
ence. The highly parallel structure of the FFT allows for its efficient implementation on graph-
ics processing units (GPUs), which are now widely used for general-purpose computing. This paper
presents the VkFFT - an efficient GPU-accelerated multidimensional Fast Fourier Transform library for
Vulkan/CUDA/HIP/OpenCL/Level Zero and Metal projects. VkFFT aims to provide the community with a
cross-platform open-source alternative to vendor-specific solutions while achieving comparable or better per-
formance. This paper also discusses the optimizations implemented in VkFFT and verifies its performance
and precision on modern high-performance computing GPUs. VkFFT is released under an MIT license.

INDEX TERMS FFT, GPU, parallel computing, Vulkan, CUDA, HIP, OpenCL, Level Zero, Metal.

I. INTRODUCTION algorithm for all other polynomial transforms, like the Dis-
The recent advances in graphics processing units (GPUs) crete Cosine Transform (DCT) or the Spherical Harmonics
in terms of their computational power, data transfer speed Transform (SHT). The major speedup is achieved through the
and available memory size make them extremely appeal- reduction of the complexity from O(N 2 ) to O(NlogN ), which
ing for general-purpose use, expanding their original ren- results in a substantial decrease in computational effort [3].
dering and visualization target tasks. The single instruction There are multiple application programming interfaces
- multiple threads (SIMT) design approach used in GPUs (APIs) available for GPU computing and corresponding
has proven extremely effective in data-level parallel tasks. FFT implementations. The most known and used for GPU
In this approach, threads are grouped in warps, which are compute tasks, Nvidia Compute Unified Device Architec-
then operated by a scheduler unit so that all threads from a ture (CUDA) platform, has been in development for over a
warp perform the same operation on thread-local registers. decade [4]. It has a sophisticated ecosystem and is refined
A common number of threads in a warp is 32 or 64 for modern to achieve the best results on Nvidia GPUs. Nvidia’s Math
GPU architectures [1]. library collection has an FFT implementation - cuFFT, which
One such highly parallel task is a multidimensional Fast has high-performance and is widely used for building com-
Fourier Transform (FFT) [2]. The FFT algorithm is widely mercial and research applications [5]. However, the Nvidia
used in signal-processing applications to extract the fre- CUDA platform can be only used on Nvidia GPUs, which
quency properties of a given input signal. Another major use limits the portability of the code. Another drawback of using
case of the FFT is to calculate big convolutions in convolu- Nvidia’s Math library collection is that it is closed-source
tional neural networks (CNNs) and image analysis workloads and cannot be optimized at the code level for specific
using the Convolution theorem. The FFT is also used as a base tasks.
AMD has its own API - Heterogeneous Interface for Porta-
The associate editor coordinating the review of this manuscript and bility (HIP) and related software stack, called Radeon Open
approving it for publication was Tomas F. Pena . Compute (ROCm) which is modeled to be as close to CUDA

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 12039
D. Tolmachev: VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

as possible, so developers can port their code more easily [6]. following formula:
However, it is less mature than CUDA, resulting in worse N −1
X 2π i
functionality and performance. For example, their GPU FFT Xk = xn e− N nk = DFTN (xn , k),
library - rocFFT, is significantly slower than cuFFT (on a n=0
similar level of HPC GPUs). where xn is the input sequence, N is the length of the input
For the open-source and cross-platform computations, sequence and k ∈ [0, N − 1], k ∈ Z is the output index, cor-
Khronos Group maintains the OpenCL framework (origi- responding to the frequency in Fourier space. Corresponding
nally developed by Apple), which aims to unify program- to that, the inverse DFT is defined as the following:
ming on heterogeneous platforms, including GPUs, CPUs
N −1
and FPGAs [7]. It also has its own FFT implementation 1 X 2πi
xn = Xk e− N nk = IDFTN (Xk , n)
- clFFT, which stopped development in 2016. OpenCL is N
k=0
undermined by inferior driver support, as all the vendors are
mainly interested in the development of their own APIs. The fastest known method of obtaining the DFT of a sequence
Intel has also released the Level Zero API for their upcom- of arbitrary dimensionality is the Fast Fourier Transform
ing GPUs and is also developing a GPU version of their CPU algorithm. The most commonly used FFT strategy is called
Math Kernel Library - oneMKL, aimed at their GPUs [8]. the Cooley-Tukey algorithm [2]. It is based on a divide-and-
Apple is developing its own proprietary API called Metal conquer strategy that breaks the DFT into a sequence of
for its operating systems that target GPUs, like Apple’s smaller-size DFTs. The idea behind it is to reformulate the
M-series systems on chip (SoC) [9]. Apple’s Accelerate DFT of a sequence with composite size N = N1 · N2 as a
Framework at the moment of writing this paper only had CPU combination of smaller-sized DFTs:
implementation of the FFT algorithm, with no GPU support 1) Perform digit-reversal permutation to reorder data.
plans announced. If performed before stage one - Decimation in Time
The Vulkan API is a low-overhead, cross-platform 3D (DIT), after all stages - Decimation in Frequency (DIF).
graphics and computing API, initially released in 2016 with 2) Perform N1 DFTs of size N2 . Here we assume N1 to be
major updates in 2018 and 2020 [10], [11]. Vulkan allows for smaller than N2 .
better and lower-level control of the GPU than its predecessor, 3) Perform O(N ) multiplications by twiddle factors - com-
OpenGL. Vulkan is backed up by the big gaming industry plex roots of unity defined by the radix. This stage is
(unlike OpenCL), which results in good driver support for needed to combine the results of smaller DFTs, so we
all vendors - Vulkan code can be launched on Nvidia, AMD, can get the DFT of the input sequence. It can also be
Intel and mobile GPUs and on a compatibility layer). Vulkan decomposed and merged with stage four.
is released as open-source, so everybody can contribute to 4) Perform N2 DFTs of size N1 . Here N1 is a small factor
its development. However, a good math library collection in and is designated as the radix of the current transfor-
Vulkan has not yet been created. mation.
This paper proposes an implementation of the FFT algo- N1 -FFT stages and a twiddle multiplication have a small
rithms - VkFFT, that can work with all the aforementioned number of arithmetic operations required (if N1 is sufficiently
programming interfaces. VkFFT source code can be found small, we can write the DFT algorithm explicitly) and can be
on GitHub: https://github.com/DTolm/VkFFT. performed a large number of times (if the size of the input
The paper structure is as follows. Section II introduces sequence N is large). These two stages will be called a step of
background knowledge on the Discrete Fourier transforms FFT in this paper. Applying these steps recursively to N2 -DFT
and implemented in VkFFT algorithms. Section III describes and further will result in the decomposition of size N DFT as
the GPU architecture and what needs to be considered a combination of smaller DFTs. For the simplest scenario,
when designing performant GPU libraries. Section IV con- where N is a power of two and N1 = 2, this decomposition
tains the VkFFT structure details and describes a novel will result in the final algorithm containing only N /2 length-2
approach to GPU runtime kernel optimization. Section V is DFTs (called radix-2 butterflies) and twiddle multiplications
dedicated to algorithms and optimizations implemented in performed log2 N times and a single digit-reversal permu-
VkFFT. Section VI gives the benchmark results and perfor- tation to reorder data. This results in the total complexity
mance analysis of VkFFT compared to cuFFT and rocFFT. of O(N logN ). A similar approach can be applied to other
Section VII contains the precision verification details of radix sizes and their combinations. To compute an inverse
VkFFT, also comparing it to cuFFT and rocFFT. Section VIII FFT, we have to calculate twiddle factors with negative com-
is dedicated to the paper’s conclusion. plex roots of unity and normalize the result, otherwise the
algorithm remains unchanged. Multidimensional FFTs can be
II. FAST FOURIER TRANSFORM ALGORITHMS OVERVIEW performed for each axis separately as a set of 1D transforms.
A. DISCRETE FOURIER TRANSFORMS AND
COOLEY-TUKEY ALGORITHM B. STOCKHAM ALGORITHM
The Fourier transform of a sequence is called Dis- In the Cooley-Tukey algorithm, digit-reversal is per-
crete Fourier Transform (DFT). It is defined by the formed once - before or after performing all steps of the