Algorithms For Efficient Computation of Convolution
Algorithms For Efficient Computation of Convolution
Algorithms For Efficient Computation of Convolution
Chapter 8
http://dx.doi.org/10.5772/51942
1. Introduction
Convolution is an important mathematical tool in both fields of signal and image processing.
It is employed in filtering [1, 2], denoising [3], edge detection [4, 5], correlation [6],
compression [7, 8], deconvolution [9, 10], simulation [11, 12], and in many other applications.
Although the concept of convolution is not new, the efficient computation of convolution
is still an open topic. As the amount of processed data is constantly increasing, there is
considerable request for fast manipulation with huge data. Moreover, there is demand for
fast algorithms which can exploit computational power of modern parallel architectures.
The basic convolution algorithm evaluates inner product of a flipped kernel and a
neighborhood of each individual sample of an input signal. Although the time complexity
of the algorithms based on this approach is quadratic, i.e. O( N 2 ) [13, 14], the practical
implementation is very slow. This is true especially for higher-dimensional tasks, where
each new dimension worsens the complexity by increasing the degree of polynomial, i.e.
O( N 2k ). Thanks to its simplicity, the naïve algorithms are popular to be implemented on
parallel architectures [15–17], yet the use of implementations is generally limited to small
kernel sizes. Under some circumstances, the convolution can be computed faster than as
mentioned in the text above.
In the case the higher dimensional convolution kernel is separable [18, 19], it can be
decomposed into several lower dimensional kernels. In this sense, a 2-D separable kernel
can be split into two 1-D kernels, for example. Due to the associativity of convolution,
the input signal can be convolved step by step, first with one 1-D kernel, then with the
second 1-D kernel. The result equals to the convolution of the input signal with the
original 2-D kernel. Gaussian, Difference of Gaussian, and Sobel are the representatives
of separable kernels commonly used in signal and image processing. Respecting the time
complexity, this approach keeps the higher dimensional convolution to be a polynomial of
©2012 Karas and Svoboda, licensee InTech. This is an open access chapter distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted
© 2013
use, Pavel and
distribution, David;
and licensee in
reproduction InTech. This is an
any medium, open access
provided article work
the original distributed under
is properly the terms of the
cited.
Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
180 Design and Architectures for Digital Signal Processing
2 Design and Architectures for Digital Signal Processing
lower degree, i.e. O(kN k+1 ). On the other hand, there is a nontrivial group of algorithms that
use general kernels. For example, the deconvolution or the template matching algorithms
based on correlation methods typically use kernels, which cannot be characterized by special
properties like separability. In this case, other convolution methods have to be used.
There also exist algorithms that can perform convolution in time O( N ). In this concept, the
repetitive application of convolution kernel is reduced due to the fact that neighbouring
positions overlap. Hence, the convolution in each individual sample is obtained as a
weighted sum of both input samples and previously computed output samples. The design
of so called recursive filters [18] allows them to be implemented efficiently on streaming
architectures such as FPGA. Mostly, the recursive filters are not designed from scratch.
Rather the well-known 1-D filters (Gaussian, Difference of Gaussian, . . . ) are converted
into their recursive form. The extenstion to higher dimension is straighforward due to their
separability. Also this method has its drawbacks. The conversion of general convolution
kernel into its recursive version is a nontrivial task. Moreover, the recursive filtering often
suffers from inaccuracy and instability [2].
While the convolution in time domain performs an inner product in each sample, in the
Fourier domain [20], it can be computed as a simple point-wise multiplication. Due to this
convolution property and the fast Fourier transform the convolution can be performed in
time O( N log N ). This approach is known as a fast convolution [1]. The main advantage of
this method stems in the fact that no restrictions are imposed on the kernel. On the other
hand, the excessive memory requirements make this approach not very popular. Fortunately,
there exists a workaround: If a direct computation of fast convolution of larger signals or
images is not realizable using common computers one can reduce the whole problem to
several subtasks. In practice, this leads to splitting the signal and kernel into smaller pieces.
The signal and kernel decomposition can be perfomed in two ways:
The aim of this chapter is to review the algorithms and approaches for computation
of convolution with regards to various properties such as signal and kernel size or
kernel separability (when processing k-dimensional signals). Target architectures include
superscalar and parallel processing units (namely CPU, DSP, and GPU), programmable
architectures (e.g. FPGA), and distributed systems (such as grids). The structure of the
chapter is designed to cover various applications with respect to the signal size, from small
to large scales.
In the first part, the state-of-the-art algorithms will be revised, namely (i) naïve approach,
(ii) convolution with separable kernel, (iii) recursive filtering, and (iv) convolution in the
frequency domain. In the second part, will be described convolution decomposition in both
the spatial and the frequency domain and its implementation on a parallel architecture.
Algorithms for Efficient Computation of Convolution 181
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 3
10.5772/51942
1.1. Shortcuts and symbols
In the following list you will find the most commonly used symbols in this chapter. We
recommend you to go through it first to avoid some misunderstanding during reading the
text.
• F [.], F −1 [.] . . . Fourier transform and inverse Fourier transform of a signal, respectively
• Wki , Wk−i . . . k-th sample of i-th Fourier transform base function and inverse Fourier
transform base function, respectively
• z∗ . . . complex conjugate of complex number z
• ∗ . . . symbol for convolution
• e . . . Euler number (e ≈ 2.718)
• j . . . complex unit ( j2 = −1)
• f , g . . . input signal and convolution kernel, respectively
• h . . . convolved signal
• F, G . . . Fourier transforms of input signal f and convolution kernel g, respectively
• N f , Ng . . . length of input signal and convolution kernel, respectively (number of
samples)
• n, k . . . index of a signal in the spatial and the frequency domain, respectively
• n′ , k′ . . . index of a signal of half length in the spatial and the frequency domain,
respectively
• P . . . number of processing units in use
• Φ . . . computational complexity function
• ||s|| . . . number of samples of a discrete signal (sequence) s
2. Naïve approach
First of all, let us recall the basic definition of convolution:
Z∞
h(t) = ( f ∗ g)(t) = f (t − τ ) g(τ )dτ. (1)
−∞
Respecting the fact that Eq. (1) is used mainly in the fields of research different from image
and signal procesing we will focus on the alternative definition that the reader is likely to be
more familiar with—the dicrete signals:
∞
h(n) = ( f ∗ g)(n) = ∑ f ( n − i ) g ( i ). (2)
i =−∞
The basic (or naïve) approach visits the individual time samples n in the input signal
f . In each position, it computes inner product of current sample neighbourhood and
182 Design and Architectures for Digital Signal Processing
4 Design and Architectures for Digital Signal Processing
flipped kernel g, where the size of the neighbourhood is practically equal to the size of
the convolution kernel. The result of this inner product is a number which is simply stored
into the position n in the output signal h. It is noteworthy that according to the definition (2),
the size of output signal h is always equal or greater than the size of the input signal f . This
fact is related to the boundary conditions. Let f (n) = 0 for all n < 0 ∨ n > N f and also
g(n) = 0 for all n < 0 ∨ n > N g . Then computing the expression (2) at the position n = −1
likely gives non-zero value, i.e. the output signal becomes larger. It can be derived that the
size of output signal h is equal to N f + N g − 1.
2.0.0.1. Analysis of time complexity.
h3d (n x , ny , nz ) = f 3d ∗ g3d (n x , ny , nz )
∞ ∞ ∞ (3)
= ∑ ∑ ∑ f 3d (n x − i, ny − j, nz − k) g3d (i, j, k)
i =−∞ j=−∞ k=−∞
f f f
Here, f 3d , g3d and h3d have the similar meaning as in (2). If we assume || f 3d || = Nx × Ny × Nz
g g g
and || g3d || = Nx × Ny × Nz , the complexity of our filtering will raise from N f N g in the
f f f g g g
1-D case to Nx Ny Nz Nx Ny Nz , which is unusable for larger signals or kernels. Hence, for
higher dimensional tasks the use of this approach is becomes impractical, as each dimension
increases the degree of this polynomial. Although the time complexity of this algorithm is
polynomial the use of this solution is advantageous only if we handle with kernels with
a small support. An example of such kernels are well-known filters from signal/image
processing:
1 2 1 121
0 0 0 2 4 2
−1 −2 −1 121
Sobel Gaussian
For better insight, let us consider the convolution of two relatively small 3-D signals 1024 ×
1024 × 100 voxels and 128 × 128 × 100 voxels—the example is shown in Fig. 1. When this
convolution was performed in double precision on Intel Xeon QuadCore 2.83 GHz computer
it lasted cca for 7 days if the computation was based on the basic approach.
2.0.0.2. Parallelization.
Due to its simplicity and no specific restrictions, the naïve convolution is still the most
popular approach. Its computation is usually sped up by employing large computer clusters
that significantly decrease the time complexity per one computer. This approach [15–17]
assumes the availability of some computer cluster, however.
Algorithms for Efficient Computation of Convolution 183
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 5
10.5772/51942
(a) Phantom image (1024 × 1024 × 100 (b) PSF (128 × 128 × 100 (c) Blurred image
pixels) pixels)
Figure 1. Example of a 3-D convolution. The images show an artificial (phantom) image of a tissue, a PSF of an optical
microscope, and blurred image, computed by the convolution of the two images. Each 3-D image is represented by three 2-D
views (XY, YZ, and XZ).
frameworks are widely used among the GPGPU community, namely CUDA [32] and
OpenCL [33].
For their ability to efficiently process 2-D and 3-D images and videos, GPUs have
been utilized in various image processing applications, including those based on the
convolution. Several convolution algorithms including the naïve one are included in the
CUDA Computing SDK [34]. The naïve convolution on the graphics hardware has been also
described in [35] and included in the Nvidia Performance Primitives library [36]. Specific
applications, namely Canny edge detection [37, 38] or real-time object detection[39] have
been studied in the literature. It can be noted that the problem of computing a rank filter
such as the median filter has a naïve solution similar to the one of the convolution. Examples
can be found in the aforementioned CUDA SDK or in [40, 41].
Basically, the convolution is a memory-bound problem [42], i.e. the ratio between the arithmetic
operations and memory accesses is low. The adjacent threads process the adjacent signal
samples including the common neighbourhood. Hence, they should share the data via a
faster memory space, e.g. shared memory [35]. To store input data, programmers can also use
texture memory which is read-only but cached. Furthermore, the texture cache exhibits the
2-D locality which makes it naturally suitable especially for 2-D convolutions.
3. Separable convolution
3.1. Separable convolution
The naïve algorithm is of polynomial complexity. Furthermore, with each added dimension
the polynomial degree raises linearly which leads to very expensive computation of
convolution in higher dimensions. Fortunately, some kernels are so called separable [18,
19]. The convolution with these kernels can be simply decomposed into several
lower dimensional (let us say "cheaper") convolutions. Gaussian and Sobel [4] are the
representatives of such group of kernels.
Separable convolution kernel must fullfil the condition that its matrix has rank equal to one.
In other words, all the rows must be linearly dependent. Why? Let us construct such a
kernel. Given one row vector
~u = (u1 , u2 , u3 , . . . , um )
and one column vector
~v T = (v1 , v2 , v3 , . . . , vn )
let us convolve them together:
v1 u1 v1 u2 v1 u3 v1 . . . u m v1
v2 u1 v2 u2 v2 u3 v2 . . . u m v2
v u v u v u v . . . u m v3 = A
~u ∗ ~v = (u1 , u2 , u3 , . . . , um ) ∗
3= 1 3 2 3 3 3 (4)
. . .. .. .. .
.. .. . . . ..
vn u1 v n u2 v n u3 v n . . . um vn
Algorithms for Efficient Computation of Convolution 185
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 7
10.5772/51942
It is clear that rank( A) = 1. Here, A is a matrix representing some separable convolution
kernel while ~u and ~v are the previously referred lower dimensional (cheaper) convolution
kernels.
3.1.0.3. Analysis of Time Complexity.
In the previous section, we derived the complexity of naïve approach. We also explained
how the complexity worsens when we increase the dimensionality of the processed data. In
case the convolution kernel is separable we can split the hard problem into a sequence of
several simpler problems. Let us recall the 3-D naïve convolution from (3). Assume that g3d
is separable, i.e. g3d = gx ∗ gy ∗ gz . Then the expression is simplified in the following way:
h3d (n x , ny , nz )= f 3d ∗ g3d (n x , ny , nz ) (5)
= f 3d ∗ gx ∗ gy ∗ gz (n x , ny , nz ) /associativity/ (6)
= f 3d ∗ gx ∗ gy ∗ gz (n x , ny , nz ) (7)
!
∞ ∞ ∞
= ∑ ∑ ∑ f 3d (n x − i, ny − j, nz − k) gz (k) gy ( j ) g x ( i ) (8)
i =−∞ j=−∞ k =−∞
f f f g g g
The complexity of such algorithm is then reduced from Nx Ny Nz Nx Ny Nz to
f f f g g g
Nx Ny Nz Nx + Ny + Nz .
One should keep in mind that the kernel decomposition is usually the only one
decomposition that can be performed in this task. It is based on the fact that many
well-known kernels (Gaussian, Sobel) have some special properties. Nevertheless, the input
signal is typically unpredictable and in higher dimensional cases it is unlikely one could
separate it into individual lower-dimensional signals.
4. Recursive filtering
The convolution is a process where the inner product, whose size corresponds to kernel size,
is computed again and again in each individual sample. One of the vectors (kernel), that
enter this operation, is always the same. It is clear that we could compute the whole inner
product only in one position while the neighbouring position can be computed as a slightly
modified difference with respect to the first position. Analogously, the same is valid for
all the following positions. The computation of the convolution using this difference-based
approach is called recursive filtering [2, 18].
4.0.0.4. Example.
The well-known pure averaging filter in 1D is defined as follows:
n −1
h(n) = ∑ f (n − i) (9)
i =0
The performance of this filter worsen with the width of its support. Fortunately, there exists
a recursive version of this filter with constant complexity regardless the size of its support.
Such a filter is no more defined via standard convolution but using the recursive formula:
h ( n ) = h ( n − 1) + f ( n ) − f ( n − n ) (10)
The transform of standard convolution into a recursive filtering is not a simple task. There
are three main issues that should be solved:
1. replication – given slow (but correctly working) non-recursive filter, find its recursive
version
2. stability – the recursive formula may cause the computation to diverge
3. accuracy – the recursion may cause the accumulation of small errors
The transform is a quite complex task and so-called Z-transform [22] is typically employed
in this process. Each recursive filter may be designed as all other filters from scratch.
In practice, the standard well-known filters are used as the bases and subsequently their
recursive counterpart is found. There are two principal approaches how to do it:
• analytically – the filter is step by step constructed via the math formulas [46]
• numerically – the filter is derived using numerical methods [47, 48]
5. Fast convolution
In the previous sections, we have introduced the common approaches to compute the
convolution in the time (spatial) domain. We mentioned that in some applications, one has to
cope with signals of millions of samples where the computation of the convolution requires
too much time. Hence, for long or multi-dimensional input signals, the popular approach is
to compute the convolution in the frequency domain which is sometimes referred to as the
fast convolution. As shown in [45], the fast convolution can be even more efficient than the
separable version if the number of kernel samples is large enough. Although the concept of
the fast Fourier transform [54] and the frequency-based convolution [55] is several decades
old, with new architectures upcoming, one has to deal with new problems. For example, the
efficient access to the memory was an important issue in 1970s [56] just as it is today [21, 23].
Another problem to be considered is the numerical precision [57].
In the following text, we will first recall the Fourier transform along with some of its
important properties and the convolution theorem which provides us with a powerful
tool for the convolution computation. Subsequently, we will describe the algorithm of the
so-called fast Fourier transform, often simply denoted as FFT, and mention some notable
implementations of the FFT. Finally, we will summarize the benefits and drawbacks of the
fast convolution.
Z +∞ Z+∞
1
F (ω ) ≡ f (t)e− jtω dt, f (t) ≡ F (ω )e jωt dω. (11)
−∞ 2π −∞
The discrete finite equivalents of the aforementioned transforms are defined as follows:
N −1
1 N −1
∑ f (n)e− j(2π/N )nk , F (k)e j(2π/N )kn
N k∑
F (k) ≡ f (n) ≡ (12)
n =0 =0
1
where k, n = 0, 1, . . . , N − 1. The so-called normalization factors 2π and N1 , respectively,
guarantee that the identity f = F −1 [F [ f ]] is maintained. The exponential function
e− j(2π/N ) is called the base function. For the sake of simplicity, we will refer to it as WK .
188 Design and Architectures for Digital Signal Processing
10 Design and Architectures for Digital Signal Processing
10 6 7 7
5 6 6
8
5 5
4
6
4 4
3
3 3
4
2
2 2
2
1 1 1
0 0 0 0
0 5 10 15 20 0 5 10 15 20 0 10 20 30 40 0 10 20 30 40
(a) Signal f (b) Kernel g (c) Fast (circular) (d) Basic convolution
convolution
Figure 2. Example of the so-called windowing effect produced by signal f (a) and kernel g (b). The circular convolution causes
border effects as seen in (c). The properly computed basic convolution is shown in (d).
If the sequence f (n), n = 0, 1, . . . , N − 1, is real, the discrete Fourier transform F (k) keeps
some specific properties, in particular:
F (k ) = F ( N − k )∗ . (13)
This means that in the output signal F, only half of the samples are useful, the rest is
redundant. As the real signals are typical for many practical applications, in most popular
FT and FFT implementations, users are hence provided with special functions to handle real
signals in order to save time and memory.
F [ f ∗ g] = F [ f ] F [ g] . (14)
In the following text, we will sometimes refer to the convolution computed by applying
Eq. (14) as the "classical" fast convolution algorithm.
In the discrete case, the same holds for periodic signals (sequences) and is sometimes referred
to as the circular or cyclic convolution [22]. However, in practical applications, one usually
deals with non-periodic finite signals. This results into the so-called windowing problem [59],
causing undesirable artefacts in the output signals—see Fig. 2. In practice, the problem
is usually solved by either imposing the periodicity into the kernel, adding a so-called
windowing function, or padding the kernel with zero values. One also has to consider the
sizes of both the input signal and the convolution kernel which have to be equal. Generally,
this is also solved by padding both the signal and the kernel with zero values. The size of
both padded signals which enter the convolution is hence N = N f + N g − 1 where N f and
N g is the number of signal and kernel samples, respectively. The equivalent property holds
for the multi-dimensional case. The most time-demanding operation of the fast convolution
approach is the Fourier transform which can be computed by the fast Fourier transform
Algorithms for Efficient Computation of Convolution 189
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 11
10.5772/51942
Figure 3. The basic two radix-2 FFT algorithms: decimation-in-time and decimation-in-frequency. Demonstration on an input
signal of 8 samples.
algorithm. The time complexity of the fast convolution is hence equal to the complexity
of the FFT, that is O( N log N ). The detailed discussion on the complexity is provided in
Section 6.
where k = 0, 1, . . . , N − 1. The signals Fe and Fo are of half length, however, they are periodic,
hence
1
The individual variants of the algorithm for a particular N are called radix-N algorithms.
190 Design and Architectures for Digital Signal Processing
12 Design and Architectures for Digital Signal Processing
−n′
f r (n′ ) = f (n′ ) + f (n′ + N/2), f s (n′ ) = f (n′ ) − f (n′ + N/2) WN
. (17)
Then, the Fourier transform Fr and Fs fulfill the following property: Fr (k′ ) = F (2k′ ) and
Fs (k′ ) = F (2k′ + 1) for any k′ = 0, 1, . . . , N/2 − 1. Hence, the sequences f r and f s are then
processed recursively, as shown in Fig. 3(b). It is easy to deduce the inverse equation from
Eq. (17):
1h n′
i 1h n′
i
f (n′ ) = f r (n′ ) + f s (n′ )WN , f (n′ + N/2) = f r (n′ ) − f s (n′ )WN . (18)
2 2
steps [8]. Here, the term ( N f + N g ) means that the processed signal f was zero padded2
to prevent the overlap effect caused by circular convolution. The kernel was modified in
the same way. Another advantage of using Fourier transform stems from its separability.
f f f g
Convolving two 3-D signals f 3d and g3d , where || f ||3d = Nx × Ny × Nz and || g3d || = Nx ×
g g
Ny × Nz , we need only
f g f g f g 9
f g f g f g
( Nx + Nx )( Ny + Ny )( Nz + Nz ) log2 ( Nx + Nx )( Ny + Ny )( Nz + Nz ) + 1 (20)
2
steps in total.
Up to now, this method seems to be optimal. Before we proceed, let us look into the space
complexity of this approach. If we do not take into account buffers for the input/output
signals and serialize both Fourier transforms, we need space for two equally aligned Fourier
signals and some negligible Fourier transform workspace. In total, it is
(N f + Ng ) · C (21)
Figure 4. Using the overlap-save and overlap-add methods, the input data can be segmented into smaller blocks and convolved
separately. Finally, the sub-parts are concatenated (a) or summed (b) together.
!
Nf
+ Ng ·C (22)
m
bytes. Concerning the time complexity, after splitting the signal f into m tiles, we need to
perform
" ! #
f g 9 Nf g
( N + mN ) log2 +N + 1 (23)
2 m
the idea of kernel decomposition and minus sign in Eq. (2) which causes the whole kernel
to be flipped. As soon as the convolution f i′ ∗ g′j is performed, its result is cropped to the
size || f i || and added to the output signal h into the position defined by overlap-save method.
Finally, all the convolutions f i′ ∗ g′j , j = 1, 2, . . . n are performed to get complete result for one
given tile f i . A general form of the method is shown in Fig. 4(b).
The complete computation of the convolution across all signal and kernel tiles is sketched in
the Algorithm 1.
!" ! #
N f Ng 9 N f Ng
+ log2 + +1 . (24)
m n 2 m n
We have m signal tiles and n kernel tiles. In order to perform the complete convolution f ∗ g
we have to perform m×n convolutions (see the nested loops in Algorithm 1) of the individual
signal and kernel tiles. In total, we have to complete
" ! #
f
9
g N f Ng
nN + mN log2 + +1 (25)
2 m n
steps. One can clearly see that without any division (m = n = 1) we get the complexity of
fast convolution, i.e. the class O(( N f + N g ) log( N f + N g )). For total division (m = N f and
Algorithms for Efficient Computation of Convolution 195
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 17
10.5772/51942
10
8
6
4
1
1 2
2 3
3 4
4
y−axis 5 5 x−axis
(size of kernel tile) (size of image tile)
Figure 5. A graph of a function Φ( x, y) that represents the time complexity of tiled convolution. The x-axis and y-axis
correspond to number of samples in signal and kernel tile, respectively. The evident minimum of function Φ( x, y) occurs in the
location, where both variables (sizes of tiles) are maximized and equal at the same time.
!
Nf Ng
+ ·C (26)
m n
bytes, where C is again the precision dependent constant and m, n are the levels of division
of signal f and kernel g, respectively.
6.2.0.9. Algorithm optimality.
We currently designed an algorithm of splitting the signal f into m tiles and the kernel g
into n tiles. Now we will answer the question regarding the optimal way of splitting the
input signal and the kernel. As the relationship between m and n is hard to be expressed
f g
and N f and N g are constants let us define the following substitution: x = Nm and y = Nn .
Here x and y stand for the sizes of the signal and the kernel tiles, respectively. Applying this
substitution to Eq. (25) and simplifying, we get the function
1 1 9
Φ( x, y) = N f N g + log2 ( x + y) + 1 (27)
x y 2
The plot of this function is depicted in Figure 5. The minimum of this function is reached
if and only if x = y and both variables x and y are maximized, i.e. the input signal and the
kernel tiles should be of the same size (equal number of samples) and they should be as large
as possible. In order to reach the optimal solution, the size of the tile should be the power
of small primes [70]. In this sense, it is recommended to fulfill both criteria put on the tile
size: the maximality (as stated above) and the capability of simple decomposition into small
primes.
196 Design and Architectures for Digital Signal Processing
18 Design and Architectures for Digital Signal Processing
" ! #
f g
f g f g
Ny Ny
f g f g f g Nx Nx Nz Nz
nNx+mNx nNy +mNy nNz +mNz 9
2 log2 m + n m + n m + n +1 (28)
This statement can be further generalized for higher dimensions or for irregular tiling
process. The proof can be simply derived from the separability of multidimensional Fourier
transform, which guarantees that the time complexity of the higher dimensional Fourier
transform depends on the amount of processed samples only. There is no difference in the
time complexity if the higher-dimensional signal is elongated or in the shape of cube.
6.4. Parallelization
6.4.0.10. On multicore CPU.
As the majority of recent computers are equipped with multi-core CPUs the following text
will be devoted to the idea of parallelization of our approach using this architecture. Each
such computer is equipped with two or more cores, however both cores share one memory.
This means that execution of two or more huge convolutions concurrently may simply fail
due to lack of available memory. The possible workaround is to perform one more division,
i.e. signal and kernel tiles will be further split into even smaller pieces. Let p be a number
that defines how many sub-pieces the signal and the kernel tiles should be split into. Let P
be a number of available processors. If we execute the individual convolutions in parallel we
get the overall number of multiplications
" ! #
npN f + mpN g 9 N f Ng
log + +1 (29)
P 2 mp np
• p < P . . . The space complexity becomes worse than in the original non-parallelized
version (26). Hence, there is no advantage of using this approach.
• p > P . . . There are no additional memory requirements. However, the signal and kernel
are split into too small pieces. We have to handle large number of overlaps of tiles which
will cause the time complexity (29) to become worse than in the non-parallelized case (25).
Algorithms for Efficient Computation of Convolution 197
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 19
10.5772/51942
• p = P . . . The space complexity is the same as in the original approach. The time
complexity is slightly better but practically it brings no advantage due to lots of memory
accesses. The efficiency of this approach would be brought to evidence only if P ≫ 1.
As the standard multi-core processors are typically equipped with only 2, 4 or 8 cores,
neither this approach was found to be very useful.
(a) DIF decomposition (b) DIF decomposition with the optimization for real
data
Figure 6. A scheme description of the convolution algorithm with the decomposition in the frequency domain [71]. An
input signal is decomposed into 2 parts by the decimation in frequency (DIF) algorithms. The parts are subsequently processed
independently using the discrete Fourier transform (DFT).
memory requirements are inversely proportional to the number of parts d the signals are
divided into. The algorithm is hence suitable for architectures with the star topology where
the central node is relatively slow but has large memory, and the end nodes are fast but have
small memory. The powerful desktop PC with one or several GPU cards is a typical example
of such architecture.
It can be noted that the decimation-in-time (DIT) algorithm can also be used for the purpose
of decomposing the convolution problem. However, its properties make it sub-efficient for
practical use. Firstly, its time complexity is comparable with the one of DIF. Secondly and
most important, it requires significantly more data transfers between the central and end
nodes. In Section 7.5, the complexity of the individual algorithms is analysed in detail.
1 k′
1 k′
F (k′ ) = α+ (k′ ) − jWN α− (k′ ) , F (k′ + N/2) = α+ (k′ ) + jWN α− (k′ ) , (31)
2 2
where
As the third approach yields the best performance, it is used in the final version of the
algorithm. The computation of Eq. (31), (32) will be further referred to as the recombination
phase. The scheme description of the algorithm is shown in Fig. 6(b).
Figure 7. A scheme description of the proposed algorithm for the convolution with the decomposition in the frequency
domain, implemented on GPU [21]. The example shows the decomposition into 4 parts.
Figure 8. A model timeline of the algorithm workflow [21]. The dark boxes denote data transfers between CPU and GPU
while the light boxes represent convolution computations. The first row shows the single-GPU implementation. The second
row depicts parallel usage of two GPUs. The data transfers are performed concurrently but through a common bus, therefore
they last twice longer. For the third row, the data transfers are synchronized so that only one transfer is made at a time. In the
last row, the data transfers are overlapped with the convolution execution.
recommendable to overlap the data transfers with some computation phases in order to keep
the implementation as efficient as possible.
To prove the importance of the overlapping, we provide a detailed analysis of the algorithm
workflow. The overall computation time T required by the algorithm can be expressed as
follows:
tconv
T = max(t p + td , t a ) + th→d + + td→h + tc , (33)
P
where t p is the time required for the initial signal padding, td for decomposition, t a for
allocating memory and setting up FFT plans on GPU, th→d for data transfers from CPU
Algorithms for Efficient Computation of Convolution 201
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 23
10.5772/51942
to GPU (host to device), tconv for the convolution including the FFT, recombination phase,
point-wise multiplication, and the inverse FFT, td→h for data transfers from GPU to CPU
(device to host) and finally tc for composition. The number of end nodes (GPU cards)
is denoted by P. It is evident that in accordance with the famous Amdahl’s law [72],
the speed-up achieved by multiple end nodes is limited to the only parallel phase of the
algorithm which is the convolution itself. Now if the data are decomposed into d parts and
sent to P end units and if d > P > 1, the data transfers can be overlapped with the convolution
phase. This means that the real computation time is shorter than T as in Eq. (33). Eq. (33)
can be hence viewed as the upper limit. The model example is shown in Fig. 8.
Memory
Method # of operations # of data transfers required
per node
h i
9
DIF (N f + Ng) log2 ( N f + N g ) + 1 3( N f + Ng) 4( N f + N g )/d
h2 i
DIT ( N f + N g ) 92 log2 ( N f + N g ) + 2 (d + 1)( N f + N g ) 4( N f + N g )/d
h f
i
Ng
Tiling d( N f + N g ) 92 log2 ( N + d )+1 (d + 1)( N f + N g ) ( N f + N g )/d
Table 1. Methods for decomposition of the fast convolution and their requirements
To conclude the results, it can be noted that the tiling method is the best one in terms
of memory demands. It requires 4× less memory per end node than the DIF-based and
the DIT-based algorithms. On the other hand, both the number of the operations and the
number of data transfers are dependent on the d parameter which is not the case of the
DIF-based method. By dividing the data into more sub-parts, the memory requirements of
the DIF-based algorithm decrease while the number of operations and memory transactions
remain constant. Hence, the DIF-based algorithm can be generally more efficient than the
tiling.
GPU units where CPU is usually provided with more memory, hence it is used as a central
node for the (de)composition of the data.
8. Conclusions
In this text, we introduce the convolution as an important tool in both signal and image
processing. In the first part, we mention some of the most popular applications it is employed
in and recall its mathematical definition. Subsequently, we present a number of common
algorithms for an efficient computation of the convolution on various architectures. The
simplest approach—so-called naïve convolution—is to perform the convolution straightly
using the definition. Although it is less efficient than other algorithms, it is the most general
one and is popular in some specific applications where small convolution kernels are used,
such as edge or object detection. If the convolution kernel is multi-dimensional and can be
expressed as a convolution of several 1-D kernels, then the naïve convolution is usually
replaced by its alternative, so-called separable convolution. The lowest time complexity
can be achieved by using the recursive filtering. Here, the result of the convolution at
each position can be obtained by applying a few arithmetical operations to the previous
result. Besides the efficiency, the advantage is that these filters are suitable for streaming
architectures such as FPGA. On the other hand, this method is generally not suitable for all
convolution kernels as the recursive filters are often numerically unstable and inaccurate.
The last algorithm present in the chapter is the fast convolution. According to the so-called
convolution theorem, the convolution can be computed in the frequency domain by a simple
point-wise multiplication of the Fourier transforms of the input signals. This approach is
the most suitable for long signals and kernels as it yields generally the best time complexity.
However, it has non-trivial memory demands caused by the fact that the input data need to
be padded.
Therefore, in the second part of the chapter, we describe two approaches to reduce the
memory requirements of the fast convolution. The first one, so-called tiling is performed in
the spatial (time) domain. It is the most efficient with respect to the memory requirements.
However, with a higher number of sub-parts the input data are divided into, both the
number of arithmetical operations and the number of potential data transfers increase.
Hence, in some applications or on some architectures (such as the desktop PC with one
ore multiple graphics cards) where the overhead of data transfers is critical, one can use
a different approach, based on the decomposition-in-frequency (DIF) algorithm which is
widely known from the concept of the fast Fourier transform. We also mention the third
method based on the decomposition-in-time (DIT) algorithm. However, the DIT-based
algorithm is sub-efficient from every point of view so there is no reason for it to be used
instead of the DIF-based one. In the end of the chapter, we also provide a detailed analysis
of (i) the number of arithmetical operations, (ii) the number of data transfers, (iii) the memory
requirements for each of the three methods.
As the convolution is one of the most extensively-studied operations in the signal processing,
the list of the algorithms and implementations mentioned in this chapter is not and cannot
be complete. Nevertheless, we tried to include those that we consider to be the most popular
and widely-used. We also believe that the decomposition tricks which are described in the
second part of the chapter and are the subject of the authors’ original research can help
readers to improve their own applications, regardless of target architecture.
Algorithms for Efficient Computation of Convolution 203
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 25
10.5772/51942
Acknowledgments
This work has been supported by the Grant Agency of the Czech Republic (Grant No.
P302/12/G157).
Author details
Pavel Karas⋆ and David Svoboda
⋆ Address all correspondence to: [email protected]
Centre for Biomedical Image Analysis, Faculty of Informatics, Masaryk University, Brno,
Czech Republic
References
[1] J. Jan. Digital Signal Filtering, Analysis and Restoration (Telecommunications Series).
INSPEC, Inc., 2000.
[3] A. Foi. Noise estimation and removal in MR imaging: The variance stabilization
approach. In IEEE International Symposium on Biomedical Imaging: from Nano to Macro,
pages 1809––1814, 2011.
[4] J. R. Parker. Algorithms for Image Processing and Computer Vision. Wiley Publishing, 2nd
edition, 2010.
[5] J. Canny. A computational approach to edge detection. IEEE T-PAMI, 8:769–698, 1986.
[6] D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes. Pattern
Recognition, 13(2):111–122, 1981.
[8] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Prentice Hall, 2002. ISBN:
0-201-18075-8.
[10] P. J. Verveer. Computational and optical methods for improving resolution and signal
quality in fluorescence microscopy. 1998. PhD Thesis.
[13] W. K. Pratt. Digital Image Processing. Wiley, 3rd edition edition, 2001.
[15] H.-M. Yip, I. Ahmad, and T.-C. Pong. An Efficient Parallel Algorithm for Computing the
Gaussian Convolution of Multi-dimensional Image Data. The Journal of Supercomputing,
14(3):233–255, 1999. ISSN: 0920-8542.
[18] B. Jähne. Digital Image Processing. Springer, 5th edition edition, 2002.
[19] Robert Hummel and David Loew. Computing Large-Kernel Convolutions of Images.
Technical report, New York University, Courant Institute of Mathematical Sciences, 1986.
[21] P. Karas and D. Svoboda. Convolution of large 3D images on GPU and its
decomposition. EURASIP Journal on Advances in Signal Processing, 2011(1):120, 2011.
[22] A.V. Oppenheim, R.W. Schafer, J.R. Buck, et al. Discrete-time signal processing, volume 2.
Prentice hall Upper Saddle Riverˆ eN. JNJ, 1989.
[23] D. Svoboda. Efficient computation of convolution of huge images. Image Analysis and
Processing–ICIAP 2011, pages 453–462, 2011.
[27] A. Herout, P. Zemcik, M. Hradis, R. Juranek, J. Havel, R. Josth, and L. Polok. Low-Level
Image Features for Real-Time Object Detection. InTech, 2010.
[28] H. Shan and N. A. Hazanchuk. Adaptive Edge Detection for Real-Time Video Processing
using FPGAs. Application notes, Altera Corporation, 2005.
Algorithms for Efficient Computation of Convolution 205
http://dx.doi.org/10.5772/51942
Algorithms for Efficient Computation of Convolution 27
10.5772/51942
[29] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn, and T.J.
Purcell. A Survey of General-Purpose Computation on Graphics Hardware. pages
21–51, August 2005.
[31] S. Ryoo, C.I. Rodrigues, S.S. Baghsorkhi, S.S. Stone, D.B. Kirk, and Wen-mei W. Hwu.
Optimization principles and application performance evaluation of a multithreaded
GPU using CUDA. In PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium
on Principles and practice of parallel programming, pages 73–82, New York, NY, USA, 2008.
ACM.
[37] Y. Luo and R. Duraiswami. Canny edge detection on NVIDIA CUDA. In Computer Vision
and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on,
pages 1–8, Jun 2008.
[38] K. Ogawa, Y. Ito, and K. Nakano. Efficient Canny Edge Detection Using a GPU. In
Networking and Computing (ICNC), 2010 First International Conference on, pages 279–280,
Nov 2010.
[40] Ke Zhang, Jiangbo Lu, G. Lafruit, R. Lauwereins, and L. Van Gool. Real-time accurate
stereo with bitwise fast voting on CUDA. In IEEE 12th International Conference on
Computer Vision Workshops (ICCV Workshops), pages 794 –800, Oct 2009.
[41] Wei Chen, M. Beister, Y. Kyriakou, and M. Kachelries. High performance median
filtering using commodity graphics hardware. In Nuclear Science Symposium Conference
Record (NSS/MIC), 2009 IEEE, pages 4142–4147, Nov 2009.
[42] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, and K. Skadron. A performance
study of general-purpose applications on graphics processors using CUDA. Journal of
parallel and distributed computing, 68(10):1370–1380, 2008.
206 Design and Architectures for Digital Signal Processing
28 Design and Architectures for Digital Signal Processing
[43] Zhaoyi Wei, Dah-Jye Lee, B. E. Nelson, J. K. Archibald, and B. B. Edwards. FPGA-Based
Embedded Motion Estimation Sensor. 2008.
[44] XinXin Wang and B.E. Shi. GPU implemention of fast Gabor filters. In Circuits and
Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 373–376, Jun
2010.
[45] O. Fialka and M. Čadík. FFT and Convolution Performance in Image Filtering on
GPU. In Information Visualization, 2006. IV 2006. Tenth International Conference on, pages
609–614, 2006.
[46] J. S. Jin and Y. Gao. Recursive implementation of LoG filtering. Real-Time Imaging,
3(1):59–65, February 1997.
[47] R. Deriche. Using Canny’s criteria to derive a recursively implemented optimal edge
detector. The International Journal of Computer Vision, 1(2):167–187, May 1987.
[48] I. T. Young and L. J. van Vliet. Recursive implementation of the Gaussian filter. Signal
Processing, 44(2):139–151, 1995.
[49] F.G. Lorca, L. Kessal, and D. Demigny. Efficient ASIC and FPGA implementations of IIR
filters for real time edge detection. In Image Processing, 1997. Proceedings., International
Conference on, volume 2, pages 406–409 vol.2, Oct 1997.
[50] R.D. Turney, A.M. Reza, and J.G.R. Delva. FPGA implementation of adaptive temporal
Kalman filter for real time video filtering. In Acoustics, Speech, and Signal Processing, 1999.
Proceedings., 1999 IEEE International Conference on, volume 4, pages 2231–2234 vol.4, Mar
1999.
[51] J. Diaz, E. Ros, F. Pelayo, E.M. Ortigosa, and S. Mota. FPGA-based real-time optical-flow
system. Circuits and Systems for Video Technology, IEEE Transactions on, 16(2):274–279, Feb
2006.
[54] E.O. Brigham and R.E. Morrow. The fast Fourier transform. Spectrum, IEEE, 4(12):63–70,
1967.
[55] H.J. Nussbaumer. Fast Fourier transform and convolution algorithms. Berlin and New
York, Springer-Verlag(Springer Series in Information Sciences., 2, 1982.
[58] R. N. Bracewell. The Fourier Transform and Its Applications. McGraw-Hill, 3rd edition,
2000.
[59] F.J. Harris. On the use of windows for harmonic analysis with the discrete Fourier
transform. Proceedings of the IEEE, 66(1):51–83, 1978.
[60] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex
Fourier series. Math. Comput, 19(90):297–301, 1965.
[61] M. Frigo and S.G. Johnson. The Fastest Fourier Transform in the West. 1997.
[65] A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel
for GPUs using CUDA. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, pages 1–11, Piscataway, NJ, USA, 2008. IEEE Press.
[68] Z. Li, H. Sorensen, and C. Burrus. FFT and convolution algorithms on DSP
microprocessors. In Acoustics, Speech, and Signal Processing, IEEE International Conference
on ICASSP’86., volume 11, pages 289–292. IEEE, 1986.
[69] I.S. Uzun, A. Amira, and A. Bouridane. FPGA implementations of fast Fourier
transforms for real-time signal and image processing. In Vision, Image and Signal
Processing, IEE Proceedings-, volume 152, pages 283–296. IET, 2005.
[70] M. Heideman, D. Johnson, and C. Burrus. Gauss and the history of the fast Fourier
transform. ASSP Magazine, IEEE, 1(4):14–21, Oct 1984. ISSN: 0740-7467.
[71] P. Karas, D. Svoboda, and P. Zemčík. GPU Optimization of Convolution for Large 3-D
Real Images. In Advanced Concepts for Intelligent Vision Systems (ACIVS), 2012. Springer,
2012. Accepted.
208 Design and Architectures for Digital Signal Processing
30 Design and Architectures for Digital Signal Processing
[72] G.M. Amdahl. Validity of the single processor approach to achieving large scale
computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer
conference, pages 483–485. ACM, 1967.