0% found this document useful (0 votes)
10 views20 pages

Adamek FFT

Uploaded by

slavadryga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Adamek FFT

Uploaded by

slavadryga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

GPU Fast Convolution via the Overlap-and-Save Method in

Shared Memory
arXiv:1910.01972v2 [cs.MS] 10 Apr 2020

Karel Adámek1 , Sofia Dimoudi2 , Mike Giles3 , and Wesley Armour ∗ 1

1
Oxford e-Research Centre, Department of Engineering Sciences, University of
Oxford, 7 Keble road, Oxford, OX1 3QG, United Kingdom
2
Centre for Advanced Instrumentation, Durham University, South Road, Durham,
DH1 3LE, United Kingdom
3
Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe
Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, United Kingdom

April 13, 2020

Abstract
We present an implementation of the overlap-and-save method, a method for the con-
volution of very long signals with short response functions, which is tailored to GPUs. We
have implemented several FFT algorithms (using the CUDA programming language) which
exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our im-
plementation with an implementation of the overlap-and-save algorithm utilizing the NVIDIA
FFT library (cuFFT). We demonstrate that by using a shared memory based FFT we can
achieved significant speed-ups for certain problem sizes and lower the memory requirements
of the overlap-and-save method on GPUs.
Keywords — fast convolution, CUDA, GPU, overlap-and-save, FFT

1 Introduction
Convolution is one of the most fundamental signal filtering techniques, widely used in signal
processing, to aid discovery in many areas of natural sciences. It is a linear operation involving
an input signal s of length Ns and a response function (or a filter) h of length M . There are two
principal approaches to linear filtering, where their usability depends on the length of the response
function h.
When the filter (h) is short it might be beneficial to calculate convolution in the time-domain
using the formula for discrete convolution
M−1
X
y[n] = h[k] ⋆ s[n] = s[n − k]h[k] , (1)
k=0

where y[n] are elements of the filtered signal, and brackets [ ] denote
 quantities that are discrete
(sampled). The complexity of time-domain convolution is O Ns2 .
If we have a longer filter it might be better to invoke the convolution theorem and calculate
convolution in the frequency-domain using a Fourier transformation. The convolution theorem
states that [14]
h[k] ⋆ s[n] = FT−1 (H[m] · S[m]) , (2)
∗ E-mail address: [email protected]

1
where H = FT(h) and S = FT(s) are Fourier pairs of h and s and FT and FT−1 is discrete Fourier
transformation and its inverse respectively. By using Fourier transformation in the convolution
calculation we are performing circular convolution (as opposed to linear convolution (eq. 1)),
which introduces an aliasing effect, where samples at the edges1 of the input signal are added
together rendering them useless for convolution. Therefore we have to pad both the filter and the
input signal with zeros (called zero padding), to the same size of at least 0 ≤ m < Ns + M − 1.
The convolution theorem allows us to replace convolution in the time-domain by point-wise
multiplication in the frequency-domain. This, however, would not be computationally feasible
without the Fast Fourier Transformation (FFT)
 algorithm, which decreases the cost of the discrete
Fourier transformation to O Ns log2 (Ns ) . Using the FFT algorithm and the convolution theorem
to perform convolutions is often called fast convolution.
Determining when to use time-domain convolution as opposed to frequency-domain convolution
depends on many factors including the character of the problem being solved, implementation,
the hardware used etc.
As mentioned above, frequency-domain convolution requires that the input signal and the
filter are both of the same length. To calculate the convolution of a long input signal in the
frequency-domain, we have to perform long FFTs on both. This can be very inefficient in terms of
computations and memory storage, particularly if we are applying multiple filters. Two commonly
used algorithms to overcome these shortcomings are the overlap-and-save (OLS) or overlap-and-
add (OLA) [19] methods.
The overlap-and-save(add) is a hybrid method which combines advantages of time-domain con-
volution with frequency-domain convolution. It allows us to break the input signal into segments
of length N and use fast convolution independently on each segment. The two methods differ in
the way they deal with aliased samples and how the output is constructed. The overlap-and-save
method discards the aliased samples from each segment and saves only the correct part of the
segment to an appropriate place in the output signal. The overlap-and-add method adds together
aliased samples from the neighboring segments to create the correct output. Therefore a parallel
implementation of the overlap-and-add method requires exclusive access to the areas of memory
that contain the aliased output signal.
The fast convolution, which is performed on each segment, has four steps: forward FFT of
a segment; point-wise complex multiplication of the filter and the segment in frequency-domain;
inverse FFT of the convolved segment; and rejection of the edges. These steps are traditionally
performed using libraries or custom code, with the input and output stored in the GPU device
memory2 for each step. This is a limiting factor when considering the convolution of the segment
as a whole.
The novelty of this work and its focus is to enable fast convolution by storing signal segments
and filters in the fastest areas of GPU memory. Performing the convolution and the associated
inverse FFT on data held in these fast memories allows us to eliminate device memory traffic and
hence accelerate the convolution algorithm on GPUs.
The novelty of this work and its focus is to enable fast convolution by exploiting the fastest
areas of GPU memory, registers and shared memory. To do this we needed to write FFT codes
that will operate directly on data stored in shared memory (NVIDIA library functions do not do
this). Using these codes we are able to perform the convolution and the associated forward and
inverse FFT on data held in the fastest areas of GPU memory and hence accelerate the convolution
algorithm on GPUs. Specifically, we can eliminate expensive access to the device (global) memory
which is otherwise required. With this goal in mind, we have implemented a basic version of
the Cooley-Tukey FFT algorithm [11] for complex-to-complex FFTs and a basic version of the
Stockham FFT algorithm [3] for real-to-complex and complex-to-real FFTs. We have implemented
1 This depends on the character of the filter used. Filters that use only future samples will be aliased with the

end of the segment, filters that use past samples will be aliased with the beginning of the segment, while a time
centred filter introduces aliasing at both ends. The number of aliased samples is equal to the unpadded length M
of the filter.
2 Device memory (sometimes called main memory or global memory) has the lowest memory bandwidth on the

GPU and as such takes the most time to access

2
these FFT algorithms so that they can execute on data held in shared memory3 . The purpose of
this work is to demonstrate the viability of our approach of moving operations into GPU kernels
using device ready algorithms. The choice of the optimal FFT algorithm and implementation of
optimized and efficient FFT algorithms on GPUs is beyond the scope of this work but will serve
as a focus of our future work.
We have chosen to focus only on the overlap-and-save method rather than on the overlap-and-
add method because the overlap-and-add method would require a synchronization step between
segments due to a race condition that would occur when neighbouring segments try to write their
computed data to the output signal stored in GPU device memory.
The work presented in this paper was developed for NVIDIA GPUs, therefore we have used
the CUDA language extension for our work. The investigation of OpenCL or any other framework
is outside the scope of this work. This work has been used to enable real-time processing of
time-domain radio astronomy data [2, 4, 5].
Our GPU implementation of the overlap-and-save method with a basic user interface is avail-
able on GitHub4 . The user interface we provide allows the user to test the functionality of our
implementation. A more detailed description is provided on our Github wiki.

2 Related work
The comprehensive study of the convolution algorithms on CPUs, GPUs and FPGAs was con-
ducted by Fowers et al. [8]. They have compared convolution algorithms by their computational
cost, energy efficiency and execution time for a range of input signal sizes and filter lengths. Their
investigation shows that the time-domain convolution is faster for either short filters or short in-
put signals. For longer input signals and longer filters, it is beneficial to use the overlap-and-save
method. The performance of the NVIDIA cuDNN library, in the context of convolutional neural
networks, was investigated by Jordà et al. [12]. The authors present different algorithms used
by the cuDNN to calculate two-dimensional convolution. Although this is for two-dimensional
convolutions it shows the advantage of frequency-domain convolution for larger filters and input
signals.
Both overlap-and-save (or OLA) and FFT algorithms are well known and extensively re-
searched, having lots of coverage in literature. Both OLS and OLA methods have been im-
plemented on GPUs [6, 13]. The theory of these methods is also actively developed, for example
[7, 16, 25] and references within.
The FFT algorithm and its implementation on GPUs is equally well researched and extensive
publications can be found on the subject, for example [9, 10, 15, 22, 23, 26]. Govindaraju et al.
[9] focused on providing a set of FFT routines which would be applicable to a wide range of input
signal lengths. The authors have used the Stockham algorithm to avoid reordering of the elements
which is required when the Cooley-Tukey algorithm is used. Gutierrez et al. [10] deals with longer
FFT from the host perspective with emphasis on long input signals. They have implemented
the decimation-in-time Cooley-Tukey algorithm where part of the FFT is performed in shared
memory and Moreland and Angel [15] described the implementation of the two-dimensional FFT
real-to-real algorithm for image processing. More on FFTs in general can be found in [21].
There is also a number of GPU FFT source codes available [22, 23, 26]. However, these FFT
codes were not suited for our needs for integration into the overlap-and-save method. The primary
reason for this is that these FFT codes were not designed as device callable functions.
The FFT by Volkov, Kazian [23] stores larger FFTs (16 elements or more) using thread reg-
isters. Our implementation of convolution uses registers to store the values of the signal segment
and current filter value. Further register utilization would lead to code slowdown.
The FFT code by Vasilache et al. [22] focuses on FFT lengths that are too small for our
intentions. The FFT length considered in the article is N < 256. We require our implementation
3 Shared memory is a small but fast area of GPU memory and can be treated as a user managed cache
4 https://github.com/KAdamek/GPU_Overlap-and-save_convolution

3
to work with the largest filters permitted by either shared memory5 or the number of active threads
per thread-block. For example a filter size of 512 elements would require an FFT length of at least
1024 elements or longer.
Lastly the FFT code by Yang and Zhou [26] was written for the Fermi generation of GPUs
and has not been updated for more modern GPU architectures.
Our FFT implementation differs from the previously published works because it is designed
to use shared memory only and to be called from the GPU kernel itself. Therefore it deals only
with short FFT lengths due to size limitation of the shared memory (currently N¡=4096) and
where N is a power of two. Moreover, our implementation of the Cooley-Tukey FFT algorithm
cannot be used as a standalone FFT routine as it lacks element reordering which is not required
for calculation of the convolution.

3 Implementation
We present our implementation of the overlap-and-save (OLS) method for NVIDIA GPUs using
the CUDA programming language which uses a shared memory implementation of standard FFT
algorithms to calculate one-dimensional convolutions. Our implementation of the OLS method can
calculate complex-to-complex6 (C2C) and real-to-real (R2R) convolutions. These implementations
are compared to an implementation of (direct) convolution that uses the NVIDIA cuDNN library
and also to an implementation of the OLS method which uses the NVIDIA cuFFT library to
perform the FFT parts of the OLS algorithm on the GPU.
In this section we describe all implementations used in this article starting with the NVIDIA
cuDNN library [17] implementation of convolution. Next, we describe the overlap-and-save method
and its implementation using the NVIDIA cuFFT library [18] (cuFFT OLS) which contains highly
optimized and GPU ported FFT algorithms. Our implementation of the OLS method with shared
memory FFT (SM-OLS) is described last.

3.1 Convolution via NVIDIA cuDNN


The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of deep
neural networks primitives. The cuDNN library offers (among many other routines) forward
convolution, which we have used as a comparison.
Our cuDNN convolution implementation is a real-to-real. The cuDNN library uses a range
of different algorithms based on the task and the size of the input. We have left the cuDDN
library to chose the most suitable convolution algorithm for our test case by using the flag
CUDNN CONVOLUTION FWD PREFER FASTEST. Our tests are performed with one-dimensional data
with a single channel7 , therefore we have used the CUDNN TENSOR NCHW data layout. Since we
cannot be sure how many operations are performed by the cuDNN library we have not calculated
the number of FLOPS for the cuDNN convolution implementation in our comparisons, instead we
use the number of processed elements per second.

3.2 Overlap-and-save Method


We will first describe the common steps of the OLS method which are performed by all imple-
mentations. These steps apply to both C2C and R2R convolutions since both are performed in
the Fourier domain which is complex.
A flow diagram of the overlap-and-save algorithm is shown in figure 1 and the method is
represented pictorially in figure 2. We begin by separating the input signal of size S into Nseg
independent segments, all subsequent operations are then applied independently on each and every
5 Thesize of shared memory ultimately limits the size of a signal segment that can be processed in our method.
6 Depending on the post-processing step this might be the complex-to-real convolution as well.
7 Channels in the context of cuDNN library are equivalent to the number of elements per structure in array-of-

structures vs structure-of-arrays data layouts. Since we have a simple data layout, we have used the equivalent of
structure-of-arrays.

4
segment. Next a forward FFT is applied to each segment. What follows is the frequency domain
convolution of the segment A with every filter f from Nfil filters, that is complex multiplication of
the segment with one or more filters. After that, we apply an inverse FFT to the results and then
discard the aliased edges of each block, recombining the samples from all blocks into the output.
Optionally, we can apply some post-processing to the resulting output. In essence, this operation
transforms a blocked circular convolution into one which is linear and continuous.
In the overlap-and-save technique, (shown in figure 2) the length of the segment, that is the
FFT length, N must be chosen such that it minimizes the fraction of discarded samples compared
to the segment length. The number of discarded samples depends on the filter length M that
is being applied to the signal and are equal to M − 1. Thus the number of correct (unaliased)
samples in the segment is L = N − M + 1. A higher fraction of discarded samples increases the
overall number of segments required by the OLS method. To ensure good performance of the FFT
algorithm on a signal segment we limit the segment length N to be lengths equal to powers of
two. The lengths of the segments from which we combine the convolved signal can be different
for each implementation. The cuFFT-OLS performs better with longer segments while SM-OLS
performs better with a shorter segment length. The convolved signal is not affected by the choice
of the segment size.

Start: Input – data to be convolved, set of


filters

Forward FFT a padded filter for all filters


Custom FFT convolution kernel

Forward FFT a padded segment of input (for


overlap-save convolution algorithm)

Pointwise multiply of a FFT’d segment with all


filters

Backward FFT of a convolved segment

Remove aliased samples of a segment and save


segment in appropriate place in output plane

Optional: Postprocessing

End: Output – convolved output plane

Figure 1: Flow diagram of the overlap-and-save method: Our input is a signal which is to be
convolved with a set of filters. The first step is to Fourier transform the padded filters. These will
be used for convolution with each segment. In the next step, we separate the input signal into
independent overlapping segments. The total overlap length for each segment is equal to the filter
length, these segments need to be Fourier transformed. The third step is convolution in the form
of complex point-wise multiplication. The convolved segment is then inverse Fourier transformed.
In the last step we remove the aliased part of each segment and merge the clean parts to produce
a continuous output. Optionally we can perform a post-processing step at the end.

3.3 OLS method using cuFFT library


Using the cuFFT library, we have implemented one-dimensional convolution via the OLS method
(cuFFT-OLS) for two variants of input data. We have implemented complex-to-complex and real-
to-real convolutions. The pseudo-code for both variants of the cuFFT-OLS is shown in algorithm

5
M/2 -M/2 + 1
FT
0 filter: h H

L L L

0 a b c Input signal

N
N
N

FT

FT
A
FT

Aliased bins
FT⁻¹{A*H} C

Result(a) FT⁻¹{B*H}

Result(b) FT⁻¹{C*H}

Result(c)

Out(a) Out(b) Out(c) Output signal

Figure 2: Overlap-and-save method: The input signal, of length Ns , is separated into overlapping
segments (A,B,C,...), where the amount of overlap is given by the filter length M . These segments
are then processed independently, where F T denotes Fourier transformation. At the end the
aliased samples of the segment, which are equal to the filter length are discarded. This example
uses a time centred filter, which aliases both ends of the segment.

1. These variants only differ in the type of FFT used for the forward and the backward Fourier
transform. Both using the cuFFT library to perform FFT routines.
The most efficient way to implement cuFFT-OLS is to utilize a feature of the cuFFT library
called callbacks. The cuFFT callbacks allow the user a per-element access to the data which are
loaded or stored by the cuFFT routine and allow the user to perform pre- or post- processing of
the data without any additional GPU kernels.
We can use callbacks together with the forward FFT to perform frequency domain convolution
(complex multiplication of the sample segment with the appropriate sample from multiple filters)
and also with the inverse FFT where we can remove the aliased samples from the segment. While
the latter eliminates problematic global memory access, the former callback has less effect.
The callback used together with the inverse FFT means we do not need to store intermediate
segments with aliased samples into the device memory. This is a significant bandwidth saving
since the intermediate result is of size Nseg SF and it would have to be written to main memory
(the output from cuFFT), then read so that aliased samples can be removed, then sorted as the
final (corrected) output.
The forward FFT callback eliminates proportionally only a small device memory access to the
segments after the forward FFT. The main device memory access which stores the result of the
frequency domain multiplication remains intact. Therefore the impact of this callback is marginal
for a large number of filters.
The cuFFT library also allows the user to use some shared memory. The amount is however
limited to 16kB which can accommodate only 2048 FFT elements while the optimal FFT length for

6
Algorithm 1: Pseudo-code for the cuFFT-OLS implementation. For the input we have
input data x and set of filters f . The output is the convolved result y. The FFT rou-
tines (ForwardFFT and InverseFFT) are either C2C for the complex input or R2C and C2R
respectively for the real input.
Input: x, f ;
Output: y;
Forward FFT of the filters;
F = ForwardFFT (f );
Separation of the signal into segments a, b, c, ...;
(a, b, c, . . . ) = Separate (x);
Forward FFT of the individual segments;
(A, B, C, . . . ) = ForwardFFT (a, b, c, . . . );
Callback begin
Per-element complex multiplication of the segment a with Nfil filters;
for s = 0 to Nseg do
for r = 0 to Nfil do
A[s] = A[s] × F [r][s];
end
end
end
(a, b, c, . . . ) = InverseFFT (A, B, C, . . . );
Callback begin
y = RemoveAliasedSamples (a, b, c, ...);
end

cuFFT library is 8192 elements. Furthermore, this does not allow us to use forward and backward
transform and as such does not remove problematic device memory access.
The disadvantages of the cuFFT-OLS implementation are that it has to load and store in-
termediate data to the device memory in between the frequency domain convolution (forward
FFT step) and the inverse FFT. Another disadvantage is higher memory requirements as the last
step (where we remove the aliased samples of the segments) cannot be performed in-place due to
the non-deterministic nature of thread-block scheduling on GPUs. The advantage of the cuFFT
implementation is that it works for any filter length and only relies on NVIDIA supported libraries.

3.4 OLS method using shared memory FFT


We present two versions of the one-dimensional overlap-and-save (OLS) method which is performed
in the shared memory for NVIDIA GPUs using the CUDA programming language. The first
implementation of OLS is for complex-to-complex8 (C2C) convolutions, using a shared memory
implementation of the Cooley-Tukey [11] FFT algorithm. The second implementation of the
OLS method is for real-to-real (R2R) convolutions. This implementation uses a shared memory
implementation of the Stockham FFT algorithm [3]. Our shared memory implementation of the
OLS method follows the same steps as the cuFFT-OLS implementation, but has a significant
difference, it incorporates all the steps required by the OLS method into one GPU kernel. This
is possible because we can call forward and inverse FFT device functions directly from the GPU
kernel, which eliminates the computationally costly device memory transactions, working instead
on data held in shared memory and GPU registers. The pseudo-code for our shared memory OLS
method is presented in algorithm 2.
In our implementation of convolution through the OLS method in shared memory, each thread-
block9 is assigned to one segment of the input data. Each thread-block applies a shared memory
8 Depending on the post-processing step this might be complex-to-real convolution as well.
9A thread-block is a set of GPU threads which execute the same code and can cooperate using shared memory.

7
forward FFT and stores segment samples, which are now in the frequency domain, into registers.
Each thread from the thread-block works with four samples. These segment samples are reused
throughout the execution of the thread-block. Stored segment samples are then complex multiplied
with appropriate samples from one or more filters. These filters are already in the frequency
domain since they were Fourier transformed before thread-block execution. When the complex
multiplication step is finished, the resulting samples are brought back to the time domain by
applying an inverse FFT in shared memory and aliased samples are removed before storing them
to the device memory. This ensures high data reuse of both segment and filter samples.

Algorithm 2: Pseudo-code for the shared memory OLS implementation. For input we have
input data x and set of filters f . The output is the convolved result y. The shared memory
FFT functions (ForwardFFT and InverseFFT) are either Cooley-Tukey C2C FFT for the
complex input or Stockham FFT R2C and C2R respectively for the real input.
Input: x, f ;
Output: y;
t = threadId;
b = blockId;
Forward FFT of the filters;
F = ForwardFFT (f);
Each thread-block process one segment ;
GPU kernel begin
Reading signal segment ;
a[t] = x[bNSeg + t];
Forward FFT of the individual segments;
A = ForwardFFT (a);
Per-element complex multiplication of the segment a with F filters;
for r = 0 to Nfil do
A[t] = A[t] × F[r][t];
a = InverseFFT (A);
y = RemoveAliasedSamples (a);
end
end

We have chosen different FFT algorithms for C2C and R2R OLS implementations. The Cooley-
Tukey FFT algorithm is more suited to complex-to-complex convolutions because we can use the
fact that, for a point-wise frequency domain convolution, the order of the data elements in the
convolved arrays does not matter as long as the order of the elements is the same for both the
input signal segment and the filter, provided that the inverse FFT can work with the same order
of elements. In normal circumstances, the Cooley-Tukey FFT algorithm requires a reordering to
take place on the input or output data, but when used in convolution we can forgo this step and
save some execution time.
The Stockham FFT algorithm is used in order to facilitate real-to-complex and complex-to-real
Fourier transformation [19] these require that the elements of the input and output of the FFT
algorithm are in the correct order. The Stockham FFT algorithm is an auto-sort algorithm which
satisfies this condition. Our shared memory implementation of the Stockham FFT algorithm
is 30% slower on average than our shared memory implementation of the Cooley-Tukey FFT
algorithm without the reordering step. This performance penalty is redeemed by the fact that for
real-to-complex and complex-to-real Fourier transformations we can use an FFT length of half the
size (compared to a C2C FFT) as described in [19].
The benefit of having one GPU kernel is not only eliminating device memory accesses, but it
also lowers memory requirements because we do not need to store intermediate results as with
the cuFFT-OLS implementation. The disadvantage of this approach is that it works well only for
small filter sizes M . 3300 (for Titan V GPU). This limitation is imposed by the size of the GPU

8
shared memory.
The analysis of the SM-OLS GPU kernel reveals that it is limited by the shared memory
bandwidth. For R2R version the kernel utilizes around 75% of the shared memory bandwidth.
The utilization is lower (50%) for a segment size of 4096 elements. For the C2C version, the
bandwidth utilization of the shared memory bandwidth is 50%. This is in part because, for the
first few iterations in the FFT routine, we use shuffle instructions which are not reflected by
shared memory bandwidth utilization. The use of the shuffle instructions, however, increases
utilization of the load-store instruction which is also high. The floating point (FP32) compute
utilization is also high. The occupancy, a ratio of the maximum amount of active threads per
streaming multiprocessor (SM) and active threads per SM, is only 50%. This is a consequence of
high register count used by the convolution kernel. The GPU registers are used to store the signal
segment elements after forward Fourier transform which is reused and they are also used for storage
of the currently processed signal segment which undergoes inverse Fourier transformation. The
device memory bandwidth utilization ranges from 60% down to 25% for longer signal segments.
The situation is similar for GPU kernels with non-local post-processing.

4 Results
For our investigation we have used three NVIDIA GPU cards, the P100 GPU, the P4 GPU and
the TitanV GPU (hardware specifications can be found in table 1).

Table 1: GPU card specifications. The shared memory bandwidth is calculated as BW(bytes/s) =
(bank bandwidth (bytes)) × (clock frequency (Hz)) × (32 banks) × (# multiprocessors). We have
used CUDA version 10.0.130 and cuDNN version 7.5.0.
P100 P4 TITAN V
CUDA Cores 3584 2560 5120
SMs 56 20 80
Base/Max Core Clock 1126/1303MHz 810/1063MHz 1220/1455MHz
Memory Clock 1406MHz 6000MHz 850MHz
Gl. m. bandwidth 720GB/s 192GB/s 652GB/s
Shared m. bandwidth 9121GB/s 2657GB/s 14550GB/s
Memory size 16GB 8GB 12GB
TDP 250W 75W 250W
Max. sh. memory per thread-block 48kB 48kB 48/96kB

We have compared both shared memory implementations (C2C, R2R) of OLS convolution
(SM-OLS) for several different filter and signal lengths and also for a varying number of filters
with convolution implementations based on the cuDNN library and our implementation of the
OLS method which uses cuFFT (cuFFT-OLS). For our results presented here, we have chosen to
limit the input signal length to 2 million points or the number of filters to 8 (unless otherwise
stated), in order to include the P4 GPU in our comparisons. The reason for this is that the P4
GPU card has a smaller device memory capacity and as such cannot process the same problem
sizes as the P100 GPU or TitanV GPU.
The length of the input signal or filter length, as well as the number of filters in our imple-
mentation, can be arbitrary and they are not limited to presented values. We have chosen the
value of these quantities to present the scaling behaviour of the problem. The input signal length
is arbitrary from the nature of the OLS method and limited only by available memory. The filter
length is arbitrary, but in the case of the shared memory OLS it is limited by the maximum size of
the FFT which can be processed (currently N=4096 points). The number of filters is also arbitrary
and limited only by available memory.

9
First, we have compared convolution without the OLS method using cuFFT, although OLS is
a well established method this comparison shows how ineffective the standard convolution through
the frequency domain can be for the case of convolution with multiple small filters. The execution
time for convolution without OLS is presented in figure 3.
35
no OLS (cuFFT)
SM-OLS
cuFFT-OLS
30

25
Execution time [ms]

20

15

10

0
1M 2M 4M 8M
Signal length [samples]

Figure 3: Comparison of the execution time of convolution without OLS method using cuFFT,
convolution via OLS method using cuFFT and convolution via custom FFT in shared memory.
Results are for 8 filters of length 64 on TITAN V.

4.1 Comparison with cuDNN library convolution


We begin by comparing the one-dimensional real-to-real SM-OLS convolution with one-dimensional
real-to-real convolution via the cuDNN library. The execution time for the different input signal
lengths and for the different number of filters is shown in Figure 4. The speedup factor for the
same configurations is shown in Figure 5.

4.2 Comparison with cuFFT OLS convolution


Next we present results for the comparison of complex-to-complex (C2C) Fourier domain convo-
lution implementations. The execution time and the number of processed elements vs the number
of filters, and vs input signal length is presented in Figure 6.
The speed-up factors for different filter lengths vs the number of filters and vs the input signal
length used are presented in Figure 7. Furthermore, speedups for signal length other then 2M
samples are shown in Figure 8.
The cuFFT-OLS convolution performs best with segment size N = 8192 for most of the filter
sizes that we have investigated. The best performing segment size in the case of the SM-OLS
convolution varies, this is because our FFT implementation performs better for smaller FFT
lengths. Figure 9 shows how the performance of the SM-OLS convolution depends on the chosen
FFT length (for TitanV GPU). Smaller FFT sizes become less effective with longer filter lengths,
because the aliased part of the segment becomes a higher fraction of the overall FFT size and
more segments are necessary to calculate the OLS convolution.
The comparison of real-to-real SM-OLS with cuFFT-OLS is similar. The execution time and
the number of elements processed per second vs the number of filters, and vs input signal length
is shown in Figure 10.
The speed-up factors for different filter lengths vs the number of filters and vs the input signal
length used for 2M signal length are presented in Figure 11. Speedups for signal lengths other
than 2M samples are shown in Figure 12.

10
P100 P4 Titan V P100 P4 Titan V
40 40 80 80
cuDNN M=64 M=64 cuDNN
36 SM-OLS M=256 36 M=256 SM-OLS

Execution time [Gsamples/s] (32 filters)


M=512 70 M=512 70
M=1024 M=1024
Execution time [ms] (32 filters)

32 32
M=2048 60 M=2048 60
28 M=3072 28 M=3072
50 50
24 24
20 20 40 40
16 16 30 30
12 12
20 20
8 8
10 10
4 4
0 0 0 0
1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M
Signal length Signal length

P100 P4 Titan V P100 P4 Titan V

Performance [Gsamples/s] (input signal 2M samples)


40 40 80 80
Execution time [ms] (input signal 2M samples)

cuDNN M=64 M=64 cuDNN


36 SM-OLS M=256 36 M=256 SM-OLS
M=512 70 M=512 70
32 M=1024 32 M=1024
M=2048 60 M=2048 60
28 M=3072 28 M=3072
24 24 50 50

20 20 40 40
16 16
30 30
12 12
20 20
8 8
4 4 10 10

0 0 0 0
816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96
Number of filters Number of filters

Figure 4: The execution time of the R2R convolution on the left and the number of elements
processed per second on the right via cuDNN (gray) and shared memory OLS (black) for different
input signal lengths (top) and different number of filters (bottom).

P100 P4 Titan V
P100 P4 Titan V 20 20
15 M=64 M=512 M=2048 15 18 18
M=256 M=1024 M=3072 


16 16
12.5 12.5 


14 14

Speedup (32 filters)

12 12
10 10 


10 10


7.5 7.5 8 8



6 6
5 5
4 4


2 2
2.5 2.5 S

0 M=64 M=512 M=2048 0


M=256 M=1024 M=3072
0 0
250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96
Signal length Number of filters (input signal 4M samples)

Figure 5: The speedup factors of the R2R SM-OLS convolution with respect to the cuDNN
convolution for different input signal lengths (left) and different number of filters (right).

11
P100 P4 Titan V P100 P4 Titan V

Performance [Gsamples/s] (input signal 2M samples)


28 28 80 80
Execution time [ms] (input signal 2M samples)

cuFFT-OLS M=65 M=65 cuFFT-OLS


26 SM-OLS M=257 26
M=257 SM-OLS
24 M=513 24 70 M=513 70
22 M=1025 22 M=1025
M=2049 60 M=2049 60
20 M=3073 20 M=3073
18 18
50 50
16 16
14 14 40 40
12 12
10 10 30 30
8 8
20 20
6 6
4 4 10 10
2 2
0 0 0 0
816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96
Number of filters Number of filters

P100 P4 Titan V P100 P4 Titan V


12 12 80 80
cuFFT-OLS M=65 M=65 cuFFT-OLS
11 SM-OLS M=257 11
M=257 SM-OLS
10 M=513 10
Performance [Gsamples/s] (8 filters) 70 M=513 70
M=1025
Execution time [ms] (8 filters)

M=1025
9 M=2049 9 60 M=2049 60
M=3073 M=3073
8 8
50 50
7 7
6 6 40 40
5 5
30 30
4 4
3 3 20 20
2 2
10 10
1 1
0 0 0 0
1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M
Signal length Signal length

Figure 6: The execution time of the C2C convolution on the left and the number of elements pro-
cessed per second on the right of the SM-OLS convolution (black) and the cuFFT-OLS convolution
(gray) for different number of filters (top) and increasing input signal length (bottom).

P100 P4 Titan V P100 P4 Titan V


6 6 6 6
M=65 M=65
5.5 M=257 5.5 5.5 M=257 5.5
8

5 M=513 5 5 M=513 5
6

5
M=1025 M=1025
4

3
4.5 M=2049 4.5 4.5 M=2049 4.5
2

M=3073 M=3073
Speedup (8 filters)

4 4 4 4
1

3.5 3.5 3.5 3.5


/

)
3 3 3 3
2.5 2.5 2.5 2.5
(

'

&

2 2 2 2
$

"

1.5 1.5 1.5 1.5


!


1 1 1 1

2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2.5K 5K 1M 2M 4M 8M 2.5K 5K 1M 2M 4M 8M 2.5K 5K 1M 2M 4M 8M


Number of filters Signal length

Figure 7: The speed-up of the C2C SM-OLS convolution with respect to the C2C cuFFT-OLS
convolution implementation for different filter lengths vs the number of filters (left), and vs the
signal length (right).

12
P100 P4 Titan V P100 P4 Titan V
6 6 6 6
M=65 M=65
5.5 M=257 5.5 5.5 M=257 5.5
Speedup (input signal 250k samples)

M=513 M=513

Speedup (input signal 8M samples)


5 5 5 5
M=1025 M=1025
4.5 M=2049 4.5 4.5 M=2049 4.5
M=3073 M=3073
4 4 4 4
3.5 3.5 3.5 3.5
3 3 3 3
2.5 2.5 2.5 2.5
2 2 2 2
1.5 1.5 1.5 1.5
1 1 1 1

2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96
Number of filters Number of filters

Figure 8: The speed-up of the C2C SM-OLS convolution with respect to the C2C cuFFT-OLS
convolution implementation for different filter lengths vs the number of filters (left) for signal
lengths 250k and 8M samples. The number of filters is limited by amount of device memory the
GPU has, this is why there are missing points for P4 GPU (8GB) and TitanV GPU (12GB).

5
Execution time [ms]

N=256
N=512
1 N=1024
N=2048
N=4096
cuFFT
0
0 500 1000 1500 2000 2500 3000 3500
Filter length [samples]

Figure 9: The execution time of the SM-OLS convolution vs filter length for different segment
(FFT) sizes. The execution time of the cuFFT-OLS convolution is added for comparison.

13
P100 P4 Titan V
20 20 P100 P4 Titan V

Performance [Gsamples/s] (input signal 2M samples)


Execution time [ms] (input signal 2M samples)

cuFFT-OLS M=65 100 100


18 SM-OLS M=257 18 M=65 cuFFT-OLS
M=513 M=257 SM-OLS
16 M=1025 16 M=513
M=2049 80 M=1025 80
14 M=3073 14 M=2049
M=3073
12 12
60 60
10 10
8 8 40 40
6 6
4 4 20 20
2 2
0 0 0 0
816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96 816 32 64 96
Number of filters Number of filters

P100 P4 Titan V
20 20 P100 P4 Titan V
cuFFT-OLS M=65 100 100
18 SM-OLS M=257 18 M=65 cuFFT-OLS
M=513
Performance [Gsamples/s] (32 filters) M=257 SM-OLS
M=1025 M=513
Execution time [ms] (32 filters)

16 16
M=2049 80 M=1025 80
14 M=3073 14 M=2049
M=3073
12 12
60 60
10 10
8 8 40 40
6 6
4 4 20 20
2 2
0 0 0 0
1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M 1M 2M 4M 8M
Signal length Signal length

Figure 10: The execution time of the R2R convolution on the left and the number of elements pro-
cessed per second on the right of the SM-OLS convolution (black) and the cuFFT-OLS convolution
(gray) for different number of filters (top) and increasing input signal length (bottom).

P100 P4 Titan V P100 P4 Titan V


6 6 6 6
M=65 M=512 M=2048 M=65 M=513 M=2049
5.5 M=257 M=1024 M=3072 5.5 5.5 M=257 M=1025 M=3073 5.5
Speedup (input signal 2M samples)

5 5 5 5
4.5 4.5 4.5 4.5
Speedup (32 filters)

4 4 4 4
3.5 3.5 3.5 3.5
3 3 3 3
2.5 2.5 2.5 2.5
2 2 2 2
1.5 1.5 1.5 1.5
1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0
2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M
Number of filters Signal length

Figure 11: The speed-up of the R2R SM-OLS convolution with respect to the R2R cuFFT-OLS
convolution implementation for different filter lengths vs the number of filters (left), and vs the
signal length (right).

14
P100 P4 Titan V P100 P4 Titan V
6 6 6 6
M=65 M=512 M=2048 M=65 M=512 M=2048
5.5 M=257 M=1024 M=3072 5.5 5.5 M=257 M=1024 M=3072 5.5
Speedup (input signal 250k samples)

Speedup (input signal 8M samples)


5 5 5 5
4.5 4.5 4.5 4.5
4 4 4 4
3.5 3.5 3.5 3.5
3 3 3 3
2.5 2.5 2.5 2.5
2 2 2 2
1.5 1.5 1.5 1.5
1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0
2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96
Number of filters Number of filters

Figure 12: The speed-up of the R2R SM-OLS convolution with respect to the R2R cuFFT-OLS
convolution implementation for different filter lengths vs the number of filters (left) for signal
lengths 250k and 8M samples. The number of filters is limited by the amount of device memory
the GPU has, this is why there are missing points for P4 GPU (8GB) and TitanV GPU (12GB).

4.3 Non-local Post-processing


The advantage of the SM-OLS method is that it has access to all output elements of a given
segment. This allows us to perform, in addition to per-element post-processing (for example the
calculation of the power spectrum), non-local post-processing as well (for example the numerical
derivative or interpolation). The non-local post-processing of output data requires access to the
immediate or extended neighborhood of the element to be processed. The cuFFT-OLS method
with callbacks offers only limited capabilities when an output element needs to access the values of
neighboring elements. The achieved speedups with non-local post-processing are shown in Figure
13 where we have calculated the derivative of the convolved signal. We have chosen to calculate the
derivative because it does not require a larger memory footprint for the output, thus the amount
of data which needs to be transferred to and back from device memory remains the same.

4.4 PCI-e latencies


The SM-OLS convolution implementation presented here is most efficient when used as a part of
larger signal processing/data reduction pipeline. If run independently the execution time, which
includes PCI-e transfer times, would be dominated by the time taken to transfer the output data
to the host. Further processing of the output data from the convolution output (such as peak
finding or candidate selection) would reduce the amount of output data transferred to the host to
a point where the transfer of the output data could be hidden by the computations10 .

5 Discussion
The main source of the speedup for our shared memory OLS implementation for one-dimensional
convolution is the elimination of device memory accesses during the convolution step in the OLS
method. If every other aspect of the computations in SM-OLS and cuFFT-OLS were equal, the
elimination of device memory accesses would result in a constant speedup for all filter lengths, the
number of filters or signal length, since the only difference between the two cases would be the
per-sample device memory accesses which were not realised. In real calculations, there are many
other effects which affect the speedup of our shared memory implementation of OLS convolution.
The primary effect is determined by the segment size N , which needs to be set appropriately so
that the number of aliased samples which are given by the filter length M is proportionally small
10 Using, for example, CUDA Streams.

15
P100 P4 Titan V P100 P4 Titan V
6 6 6 6
M=65 M=65
5.5 M=257 5.5 5.5 M=257 5.5
M=513 M=513
Speedup (input signal 2M samples)

5 5 5 5
M=1025 M=1025
4.5 M=2049 4.5 4.5 M=2049 4.5
M=3073 M=3073
4 4 Speedup (8 filters) 4 4
3.5 3.5 3.5 3.5
3 3 3 3
2.5 2.5 2.5 2.5
2 2 2 2
1.5 1.5 1.5 1.5
1 1 1 1

2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 2.5K 5K 1M 2M 4M 8M 2.5K 5K 1M 2M 4M 8M 2.5K 5K 1M 2M 4M 8M


Number of filters Signal length

P100 P4 Titan V P100 P4 Titan V


6 6 6 6
M=65 M=512 M=2048 M=65 M=513 M=2049
5.5 M=257 M=1024 M=3072 5.5 5.5 M=257 M=1025 M=3073 5.5
Speedup (input signal 2M samples)

5 5 5 5
4.5 4.5 4.5 4.5
Speedup (32 filters)

4 4 4 4
3.5 3.5 3.5 3.5
3 3 3 3
2.5 2.5 2.5 2.5
2 2 2 2
1.5 1.5 1.5 1.5
1 1 1 1
0.5 0.5 0.5 0.5
0 0 0 0
2 4 8 16 32 64 96 2 4 8 16 32 64 96 2 4 8 16 32 64 96 250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M 250K 500K 1M 2M 4M 8M
Number of filters Signal length

Figure 13: The speed-up of the SM-OLS over cuFFT-OLS when non-local post-processing is
included into consideration. The speed-up for C2C convolution is at the top and speed-up for
R2R convolution is at the bottom.

16
compared to the number of uncontaminated output samples contained in the output segment. The
segment size in the cuFFT-OLS convolution implementation is not limited to any particular size.
This is not true for our implementation of the SM-OLS convolution. Our SM-OLS is limited to a
segment length of 4096 samples. This limitation is imposed by the size of the shared memory and
also by the number of samples we are able to process per thread.
If we fix the segment size N then any increase in filter length leads to a decrease in the number
of correct output samples per segment, thus more segments are required to calculate the whole
convolution. The effect of this can be observed in Figure 9 where different black lines represent the
execution time of the SM-OLS implementation with a fixed segment (FFT) size. Figure 9 shows
that each segment size is optimal only for a limited range of filter lengths and after that, it is
better to switch to a different segment size. Since our implementation of the SM-OLS convolution
is limited to segment size N = 4096 we cannot use a longer segment size when the number of correct
samples per segment decreases below a certain limit and at that point cuFFT-OLS becomes the
better performing implementation.
The caching of filters is also governed by the size of the segment. The filter length in the
frequency domain is equal to the size of the segment N so by increasing segment size we are
decreasing the number of filters which can be cached by the GPU’s fixed size cache at any instant.

5.1 Comparison with cuDNN convolution


Figure 4 shows the comparison of the execution time of our SM-OLS convolution implementation
and our implementation of convolution via cuDNN library. The execution time scales linearly
with the input signal length shown on the left of Figure 4. Different scaling can be seen as the
number of filters increase. Our implementation of SM-OLS scales linearly but cuDNN has the
same execution time up until the number of filters reaches 32 at which point it scales linearly.
This is due to under-utilization of the GPU resources which is most probably caused by different
work distribution which favours more filters.
Figure 5 shows speedup factors of SM-OLS convolution implementation over the cuDNN con-
volution. The speedup factors for different filter lengths (different line types) vs the signal length
(on the left of Figure 5) shows that both implementations scale at the same rate as the signal
length increases. Figure 5 indicates that the cuDNN library is optimised for small filter lengths
since convolutions with smaller filters have lower speedups. The reverse is true when SM-OLS
convolution is compared to cuFFT-OLS convolution. The high speedups shown on the right of
Figure 5 are due to poor scaling of the cuDNN library for a number of filters below 32.

5.2 Comparison with cuFFT convolution


The execution time, as shown for C2C convolutions in Figure 6 and for R2R convolutions in Figure
10, scales linearly with an increasing number of filters and increasing input signal length. Both
implementations achieve roughly constant performance in the number of processed elements per
second past 16 filters or a signal length of two million samples.
The speedup factors of SM-OLS convolution over cuFFT-OLS convolution are shown for C2C
convolutions in Figure 7 and in Figure 8 and for R2R convolutions in Figure 11 and in Figure 12.
The speedup factors are, in the majority of cases, constant and do not change with the number of
filters or the length of the input signal. This is because the segment size is not affected by these
parameters, the only difference between the two implementations is the number of device memory
accesses performed, or rather not-performed, per sample by the SM-OLS implementation. The
total number of processed samples, which includes also the aliased samples, might be different
between the two implementations due to different segment sizes used, but the ratio of device
memory transfers between these two implementations of the OLS method remains constant and
as such the speedup remains constant as well.
There are exceptions to this rule. In the case of complex-to-complex convolutions, we observe
(in Figure 7 and on the left of Figure 8) that for a small number of filters or short signal lengths
we have higher speedups.

17
The higher speedup for short signal lengths is due to the slower performance of the cuFFT-
OLS convolution which under-utilises GPU resources in this regime. The cuFFT-OLS performs
best with longer segment sizes (8192) which, for shorter signal lengths, does not provide enough
parallelism for the GPU to utilise. The Titan V GPU which has the most SM11 has the highest
speedups while P4 GPU which has the fewest SMs is barely affected.
The high speedups for the small filter numbers are caused by the overhead of creating segments
in the cuFFT-OLS implementation. This step is, in the case of SM-OLS, included in the GPU
kernel and does not create additional device memory accesses.
The situation is different for R2R convolutions. Speedup factors of SM-OLS convolution over
cuFFT-OLS convolution are shown in Figure 11 and in Figure 12. We see that for cases with
short signal lengths the SM-OLS achieves low or below one speedups. This is caused by the
under-utilisation of the GPU resources in our SM-OLS implementation. In the case of R2R
convolutions, we are able to convolve a segment of size N with an FFT size N/2 [19], meaning
that we are able to fit (depending on the FFT size) up to four thread-blocks per SM which leads
to under-utilization even for signal sizes of 500k samples. This can be best observed in Figure 12
on the left where we show speedups for short signal lengths (250k). GPU cards which are most
affected (TitanV GPU, P100 GPU) have also the most SMs, while the P4 GPU with a smaller
number of SMs shows speedups comparable to what we can see in Figure 11.
Lastly, our SM-OLS has lower performance for shorter filters. This is due to shared memory
bank conflicts in our shared memory implementation of the Stockham FFT algorithm. These
shared memory bank conflicts occur in the first few iterations of the algorithm. The execution
time of these first few iterations dominates the execution time of the shorter FFTs and thus
decreases the performance of the whole convolution.

5.3 Non-local post-processing


Figure 13 shows the speedup of SM-OLS over cuFFT-OLS when performing a non-local post-
processing step. Examples, where this might be required, include interpolation of the output or
numerical differentiation (which we have used to demonstrate this). The change in the performance
depends on the filter size used. The speedup can also decrease when compared to convolution
without non-local post-processing. This can be seen for P100 and P4 GPUs when performing
real-to-real convolutions with filters longer than 1025 samples, but for shorter filter lengths the
speedup can be as great as 30% as in the case of the Titan V GPU for filter lengths 257 and 513.

6 Conclusions
We have presented an implementation of the shared memory overlap-and-save method for the
one-dimensional convolution of a large data set with a set of short filters. We have demonstrated
a significant speed-up for our shared memory implementation of overlap-and-save, over an imple-
mentation of the overlap-and-save method which uses a vendor-supplied FFT library (cuFFT).
We have also demonstrated a speedup in the calculation of convolution over a vendor-supplied
library for deep neural network primitives (cuDNN) for NVIDIA GPUs. This work has been used
to enable real-time data processing in AstroAccelerate software package [24] that performs the
Fourier Domain Acceleration Search for the Square Kilometre Array [2, 4, 5]. Considering the sig-
nificance of convolution in signal processing this implementation could have a noticeable impact
in fields such as natural language processing, monitoring and listening services, speech recognition
or pattern matching.
Future work includes the incorporation of the shared memory FFT presented in this paper,
into our implementation of a polyphase filter [1] to increase its data throughput.
11 The SM or streaming multiprocessor is a set of computing cores, the exact number of cores depends on the

architecture, which executes threads instruction in parallel.

18
Acknowledgements
This work has received support from an STFC Grant (ST/R000557/1). The authors would also
like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) [20]
facility in carrying out this work. This work is supported by a Leverhulme Trust Project Grant
(ARTEMIS: Real-time discovery in Radio Astronomy).

References
[1] K. Adámek, J. Novotný, and W. Armour. A polyphase filter for many-core architectures.
Astronomy and Computing, 16:1–16, July 2016. doi: 10.1016/j.ascom.2016.03.003.
[2] K. Adámek, S. Dimoudi, M. Giles, and W. Armour. Improved Acceleration of the GPU
Fourier Domain Acceleration Search Algorithm. Proceedings of ADASS XXVII, November
2017.
[3] W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang, G. C.
Maling, D. E. Nelson, C. M. Rader, and P. D. Welch. What is the fast fourier transform?
Proceedings of the IEEE, 55(10):1664–1674, Oct 1967. ISSN 0018-9219. doi: 10.1109/PROC.
1967.5957.
[4] S. Dimoudi and W. Armour. Pulsar Acceleration Searches on the GPU for the Square Kilo-
metre Array. Proceedings of ADASS XXV, November 2015.
[5] S. Dimoudi, K. Adamek, P. Thiagaraj, S. M. Ransom, A. Karastergiou, and W. Armour.
A GPU implementation of the Correlation Technique for Real-time Fourier Domain Pulsar
Acceleration Searches. ArXiv e-prints, April 2018.
[6] T. Dobashi and H. Kiya. A parallel implementation method of fft-based full-search block
matching algorithms. In 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, pages 2644–2648, May 2013. doi: 10.1109/ICASSP.2013.6638135.
[7] J. A. Fernandez and B. V. K. V. Kumar. Multidimensional overlap-add and overlap-save for
correlation and convolution. In 2013 IEEE International Conference on Image Processing,
pages 509–513, Sept 2013. doi: 10.1109/ICIP.2013.6738105.
[8] Jeremy Fowers, Greg Brown, John Wernsing, and Greg Stitt. A performance and energy
comparison of convolution on gpus, fpgas, and multicore processors. ACM Trans. Archit.
Code Optim., 9(4), January 2013. ISSN 1544-3566. doi: 10.1145/2400682.2400684. URL
https://doi.org/10.1145/2400682.2400684.
[9] Naga Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manfer-
delli. High performance discrete fourier transforms on graphics processors. In Proceed-
ings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, January 2008. URL
https://www.microsoft.com/en-us/research/publication/high-performance-discrete-fourier-transform
[10] Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata. Mem-
ory Locality Exploitation Strategies for FFT on the CUDA Architecture, page
430443. Springer-Verlag, Berlin, Heidelberg, 2008. ISBN 9783540928584. URL
https://doi.org/10.1007/978-3-540-92859-1_39.
[11] John W. Tukey James W. Cooley. An algorithm for the machine calculation of complex fourier
series. Mathematics of Computation, 19(90):297–301, 1965. ISSN 00255718, 10886842. URL
http://www.jstor.org/stable/2003354.
[12] M. Jordà, P. Valero-Lara, and A. J. Peña. Performance evaluation of cudnn convolution
algorithms on nvidia volta gpus. IEEE Access, 7:70461–70473, 2019. ISSN 2169-3536. doi:
10.1109/ACCESS.2019.2918851.

19
[13] A. Lavin and S. Gray. Fast Algorithms for Convolutional Neural Networks. ArXiv e-prints,
September 2015.
[14] R. G. Lyons. Understanding digital signal processing 3rd ed. Prentice Hall, 2011. ISBN
0137027419.
[15] Kenneth Moreland and Edward Angel. The fft on a gpu. In Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ’03, pages 112–119,
Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics Association. ISBN 1-58113-739-7.
URL http://dl.acm.org/citation.cfm?id=844174.844191.
[16] M. J. Narasimha. Modified overlap-add and overlap-save convolution algorithms for real
signals. IEEE Signal Processing Letters, 13(11):669–671, Nov 2006. ISSN 1070-9908. doi:
10.1109/LSP.2006.879475.
[17] NVIDIA. Nvidia cuda deep neural network library (cudnn), 2019. URL
https://developer.nvidia.com/cudnn.
[18] NVIDIA. Nvidia cuda fast fourier transform library (cufft), 2019. URL
https://developer.nvidia.com/cufft.
[19] W.H. Press. Numerical Recipes in C: The Art of Scientific Computing. Number bk. 4 in
Numerical recipes in C : the art of scientific computing / William H. Press. Cambridge
University Press, 1992. ISBN 9780521437202.
[20] Andrew Richards. University of Oxford Advanced Research Computing, August 2015. URL
https://doi.org/10.5281/zenodo.22558.
[21] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society
for Industrial and Applied Mathematics, 1992. doi: 10.1137/1.9781611970999. URL
http://epubs.siam.org/doi/abs/10.1137/1.9781611970999.
[22] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast Convo-
lutional Nets With fbfft: A GPU Performance Evaluation. ArXiv e-prints, December 2014.
[23] Vasily Volkov and Brian Kazian. Fitting fft onto the g80 architecture. University of California,
Berkeley, 2008.
[24] Armour W., Aámek K., Novotný J., Dimoudi S., Carels C., and Ouannoughi N. AstroAccel-
erate, February 2019. URL https://doi.org/10.5281/zenodo.2556573.
[25] Frank Wefers and Michael Vorländer. Using fast convolution for fir filter-
ing overview and guidelines for real-time audio rendering. In Proceedings
of the International Conference on Acoustics, AIA-DAGA 2013, 2013. URL
http://pub.dega-akustik.de/AIA_DAGA_2013/data/articles/000683.pdf.
[26] Yi Yang and Huiyang Zhou. A Highly Efficient FFT Using Shared-Memory Multiplexing,
pages 363–377. Springer International Publishing, Cham, 2014. ISBN 978-3-319-06548-9. doi:
10.1007/978-3-319-06548-9 17. URL https://doi.org/10.1007/978-3-319-06548-9_17.

20

You might also like