0% found this document useful (0 votes)
39 views14 pages

Generating Device-Specific GPU Code For Local Operators in Medical Imaging

This document summarizes a research paper about generating optimized code for local operators in medical imaging applications to run on GPUs. The authors developed a framework that allows describing image processing kernels in a domain-specific language embedded in C++. A compiler then generates low-level CUDA or OpenCL code from these descriptions. It considers metadata about memory access patterns and execution constraints to produce efficient parallel implementations for different GPU architectures. Evaluation shows the generated code outperforms manual implementations and other frameworks.

Uploaded by

Accounts English
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views14 pages

Generating Device-Specific GPU Code For Local Operators in Medical Imaging

This document summarizes a research paper about generating optimized code for local operators in medical imaging applications to run on GPUs. The authors developed a framework that allows describing image processing kernels in a domain-specific language embedded in C++. A compiler then generates low-level CUDA or OpenCL code from these descriptions. It considers metadata about memory access patterns and execution constraints to produce efficient parallel implementations for different GPU architectures. Evaluation shows the generated code outperforms manual implementations and other frameworks.

Uploaded by

Accounts English
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261355860

Generating Device-specific GPU Code for Local Operators in Medical Imaging

Conference Paper · May 2012


DOI: 10.1109/IPDPS.2012.59

CITATIONS READS
45 689

5 authors, including:

Richard Membarth Frank Hannig


Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany Friedrich-Alexander-University of Erlangen-Nürnberg
52 PUBLICATIONS   539 CITATIONS    236 PUBLICATIONS   2,251 CITATIONS   

SEE PROFILE SEE PROFILE

Wieland Eckert
Siemens Healthineers
46 PUBLICATIONS   1,822 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

HighPerMeshes View project

All content following this page was uploaded by Richard Membarth on 08 April 2014.

The user has requested enhancement of the downloaded file.


This is the author’s version of the work. The definitive work was published in Proceedings of the 26th IEEE International Parallel & Distributed
Processing Symposium (IPDPS), Shanghai, China, May 21–25, 2012.

Generating Device-specific GPU code for Local Operators in Medical Imaging

Richard Membarth, Frank Hannig, and Jürgen Teich Mario Körner and Wieland Eckert
Department of Computer Science, Siemens Healthcare Sector, H IM AX,
University of Erlangen-Nuremberg, Germany. Forchheim, Germany.
{richard.membarth,hannig,teich}@cs.fau.de {mario.koerner,wieland.eckert}@siemens.com

Abstract—To cope with the complexity of programming GPU performance may significantly depend on low-level details,
accelerators for medical imaging computations, we developed a such as data layout and iteration space mapping [2], [3].
framework to describe image processing kernels in a domain- CUDA and OpenCL are low-level programming languages
specific language, which is embedded into C++. The descrip-
tion uses decoupled access/execute metadata, which allow the since the user has to take care of parallelizing the algorithm
programmer to specify both execution constraints and memory as well as the communication between the host and the GPU
access patterns of kernels. A source-to-source compiler translates accelerator. In general, low-level programming increases the
this high-level description into low-level CUDA and OpenCL code cost of software development and maintenance: whilst low-
with automatic support for boundary handling and filter masks. level languages can be robustly compiled into efficient ma-
Taking the annotated metadata and the characteristics of the
parallel GPU execution model into account, two-layered paral- chine code, they effectively lack support for creating portable
lel implementations—utilizing SPMD and MPMD parallelism— and composable software.
are generated. An abstract hardware model of graphics card High-level languages with domain-specific features are
architectures allows to model GPUs of multiple vendors like more attractive to domain experts, who do not necessarily
AMD and NVIDIA, and to generate device-specific code for wish to become target system experts. To compete with low-
multiple targets. It is shown that the generated code is faster than
manual implementations and those relying on hardware support level languages for programming accelerated systems, how-
for boundary handling. Implementations from RapidMind, a ever, domain-specific languages should have an acceptable
commercial framework for GPU programming, are outperformed performance penalty.
and similar results achieved compared to the GPU backend of We present the Heterogeneous Image Processing Accel-
the widely used image processing library OpenCV. eration (HIPAcc) framework for medical image processing
Keywords-GPU; CUDA; OpenCL; domain-specific language; in the domain of angiography that allows programmers to
code generation; medical imaging; local operators concentrate on developing algorithms and applications, rather
than on mapping them to the target hardware. Specifically, as
I. I NTRODUCTION design entry to our framework, we defined a domain-specific
language to describe image processing kernels on an abstract
Computer systems are increasingly heterogeneous, as many level. We identified three groups of operators that have to
important computational tasks, such as multimedia processing, be supported by the domain-specific language for medical
can be accelerated by special-purpose processors that out- imaging: a) point operators, b) local operators, and c) global
perform general-purpose processors by one or two orders of operators. Point operators are applied to the pixels of the image
magnitude, importantly, in terms of energy efficiency as well and solely the pixel the point operator is applied to contributes
as in terms of execution speed. to the operation (e. g., adding a constant to each pixel in
Until recently, every accelerator vendor provided their own the image). Local operators are similar to point operators,
application programming interface (API), typically based on but also neighboring pixels contribute to the operation (e. g.,
the C language. For example, NVIDIA’s API called CUDA replacing the current pixel value with the average value of
targets systems accelerated with Graphics Processing Units its neighbors). In contrast, global operators (e. g., reduction
(GPUs). In CUDA, the programmer dispatches compute- operators) produce one output for the operator applied to all
intensive data-parallel functions (kernels) to the GPU, and pixels of the image (e. g., compute the sum of all pixels). While
manages the interaction between the CPU and the GPU via the basic framework supporting point operations was presented
API calls. Ryoo et al. [1] highlight the complexity of CUDA earlier (see [4]), we focus on the support of local operators
programming, in particular, the need for exploring thoroughly and the efficient mapping to target hardware in this paper.
the space of possible implementations and configuration op- Our framework is implemented as a library of C++ classes
tions. OpenCL, a new industry-backed standard API that and will be summarized in Section II. The DSL of our frame-
inherits many traits from CUDA, aims to provide software work is extended to describe local operators (Section III). Our
portability across heterogeneous systems: correct OpenCL Clang-based compiler translates this description into efficient
programs will run on any standard-compliant implementation. low-level CUDA and OpenCL code as well as corresponding
OpenCL per se, however, does not address the problem of code to talk to the GPU accelerator (Section IV). The transfor-
performance portability; that is, OpenCL code optimized for mations applied to the target code and the run-time configura-
one accelerator device may perform dismally on another, since tion are device-specific and take an abstract architecture model
of the target graphics card hardware into account (Section V). The bilateral filter replaces each pixel by a weighted average
This results in efficient code that is evaluated against highly of geometrically nearby and photometrically similar pixel val-
optimized GPU frameworks from RapidMind and OpenCV, ues. Only pixels within the neighborhood of the relevant pixel
as well as hand-written implementations (Section VI). In are used. The neighborhood and consequently also the con-
Section VII, we compare our work against other approaches. volution size is determined by the geometric spread σd . The
We outline future extensions to our framework in Section VIII parameter σr (photometric spread) in the similarity function
for code generation optimizations. Finally, we conclude our determines the amount of combination. Thereby, difference
paper in Section IX. of pixel values below σr are considered more similar than
differences above σr .
II. I MAGE P ROCESSING F RAMEWORK
For each pixel in the output image, the filter kernel is applied
In this paper, we use and extend our previous framework for to a neighborhood of pixels in the input image. When executed
describing image processing kernels [4]. The proposed frame- on a GPU, the required operations to produce one pixel in
work uses a source-to-source compiler based on Clang [5] in the output image are specified in a program. Such a program
order to generate low-level, optimized CUDA and OpenCL is also called kernel, executed on the GPU in parallel, and
code for execution on GPU accelerators. The framework applied to each pixel of the image independently. The pseudo
consists of built-in C++ classes that describe the following code for the bilateral filter executed on GPUs is shown in
four basic components required to express image processing Algorithm 1. Here, the kernel itself is described in lines 3–13,
on an abstract level: defining the bilateral filter for one pixel, taking the pixels in its
• Image: Describes data storage for the image pixels. Each neighborhood into account. This code is executed sequentially
pixel can be stored as an integer number, a floating point as can be seen in the doubly nested innermost loops. This is
number, or in another format such as RGB, depending the code that is typically written by a programmer in either
on instantiation of this templated class. The data layout CUDA or OpenCL. The parallel execution on the graphics
is handled internally using multi-dimensional arrays. hardware is expressed by the parallel foreach loops in line
• Iteration Space: Describes a rectangular region of interest 1 and 2. While the first parallel for loop describes the data-
in the output image, for example the complete image. parallel execution within groups of processors on the GPU
Each pixel in this region is a point in the iteration space. (think of it as a Single Instruction Multiple Data (SIMD) unit),
• Kernel: Describes an algorithm to be applied to each pixel the second parallel for loop depicts the parallel execution on
in the Iteration Space. all available groups of processors. Note that this is a two-
• Accessor: Describes which pixels of an Image are seen layered approach exploiting two different types of parallelism:
within the Kernel. Similar to an Iteration Space, the the Single Program Multiple Data (SPMD) parallel execution
Accessor defines an Iteration Space on an input image. model within groups of processors, and the Multiple Program
These components are an instance of decoupled access/ex- Multiple Data (MPMD) parallel execution between multiple
ecute metadata [3]: the Iteration Space specification provides groups of processors. Although the graphics hardware supports
ordering and partitioning constraints (execute metadata); the the execution of different programs on different SIMD units,
Kernel specification provides a pattern of accesses to uniform programs written in CUDA and OpenCL contain the same
memory (access metadata). Currently, the access/execute meta- code for all SIMD units.
data is mostly implicit: we assume that the iteration space is
independent in all dimensions and has a 1:1 mapping to work- Algorithm 1: Bilateral filter implementation on the graph-
items (threads), and that the memory access pattern is obvious ics card.
from the kernel code. 1 foreach thread block b in grid g do in parallel
A. Example: Bilateral Filter 2 foreach thread t in thread block b do in parallel
3 x ← get_index_x(b, t);
In the following, we illustrate our image processing frame- 4 y ← get_index_y(b, t);
work using a bilateral filter that smoothes an image while 5 p, k ← 0;
preserving edges within the image [6]. The bilateral filter 6 for y f = −2 ∗ sigma_d to +2 ∗ sigma_d do
7 for x f = −2 ∗ sigma_d to +2 ∗ sigma_d do
applies a local operator to each pixel that consists of two 8 c ← closeness((x, y), (x + x f , y + y f ));
components to determine the weighting of pixels: a) the 9 s ← similarity(input[x, y],
closeness function that considers the distance to the center input[x + x f , y + y f ]);
pixel, and b) the similarity function that takes the difference 10 k ← k + c ∗ s;
between pixel values into account. Gaussian functions of the 11 p ← p + c ∗ s ∗ input[x + x f , y + y f ];
12 end
Euclidean distance are used for the closeness and similarity 13 end
weights as seen in Equation (1) and (2), respectively. 14 output[x, y] ← p/k;
15 end
k(x,y)−(x0 ,y0 )k 2 16 end
− 12 ( )
closeness((x, y), (x0 , y0 )) = e σd
(1)
0
1 kp−p k )2
similarity(p, p0 ) = e− 2 ( σr (2) To express this filter in our framework, the programmer
derives a class from the built-in Kernel class and implements specific optimization such as global memory padding for mem-
the virtual kernel function, as shown in Listing 1. To access ory coalescing and optimal memory bandwidth utilization.
the pixels of an input image, the parenthesis operator () is In the remaining of this paper, we describe the extension
used, taking the column (dx) and row (dy) offsets as optional of the framework of how to utilize access/execute metadata
parameters. The output image as specified by the Iteration for efficient code generation for local operators in medical
Space is accessed using the output() method provided imaging and the optimal mapping of this code to different
by the built-in Kernel class. The user instantiates the class GPU accelerator architectures from AMD and NVIDIA.
with input image accessors, one iteration space, and other 1 const int width = 1024, height = 1024, sigma_d =
parameters that are member variables of the class. 3, sigma_r = 5;
2
1 class BilateralFilter : public Kernel<float> {
3 // pointers to raw image data
2 private:
4 float *host_in = ...;
3 Accessor<float> &Input;
5 float *host_out = ...;
4 int sigma_d, sigma_r;
6
5
7 // input and output images
6 public:
8 Image<float> IN(width, height);
7 BilateralFilter(IterationSpace<float> &IS,
9 Image<float> OUT(width, height);
Accessor<float> &Input, int sigma_d, int
10
sigma_r) :
11 // initialize input image
8 Kernel(IS), Input(Input), sigma_d(sigma_d)
12 IN = host_in; // operator=
, sigma_r(sigma_r)
13
9 { addAccessor(&Input); }
14 // define region of interest
10
15 IterationSpace<float> IsOut(OUT);
11 void kernel() {
16
12 float c_r = 1.0f/(2.0f*sigma_r*sigma_r);
17 // accessor used to access image pixels
13 float c_d = 1.0f/(2.0f*sigma_d*sigma_d);
18 Accessor<float> AccIn(IN);
14 float d = 0.0f, p = 0.0f, s = 0.0f;
19
15
20 // define kernel
16 for (int yf = -2*sigma_d; yf<=2*sigma_d; yf++)
21 BilateralFilter BF(IS, AccIn, sigma_d, sigma_r);
{
22
17 for (int xf = -2*sigma_d; xf<=2*sigma_d; xf
23 // execute kernel
++) {
24 BF.execute();
18 float diff = Input(xf, yf) - Input();
25
19
26 // retrieve output image
20 s = exp(-c_r * diff*diff);
27 host_out = OUT.getData();
21 c = exp(-c_d * xf*xf) * exp(-c_d * yf*yf);
22
23
d
p
+= s*c;
+= s*c * Input(xf, yf);
Listing 2: Example code that initializes and executes the
24 } bilateral filter.
25 }
26 output() = p/d;
27 }
28 }; III. L OCAL O PERATORS

Listing 1: Kernel description of the bilateral filter. In image processing and in particular in medical imaging,
local operators are widely used. These operators can be applied
In Listing 2, the input and output Image objects IN and independently to all pixels of the image and depend only
OUT are defined as two-dimensional W × H grayscale images, on the pixel itself and the pixels in its neighborhood. The
having pixels represented as floating-point numbers (lines 8– neighborhood read by local operators in medical imaging is
9). The Image object IN is initialized with the host_in typically symmetrically arranged around the pixel to be calcu-
pointer to a plain C array with raw image data, which invokes lated with the same diameter in each direction. Furthermore,
the = operator of the Image class (line 12). The region of local operators describe the convolution of an image by a filter
interest IsOut contains the whole image (line 15). Also the mask (e. g., Gaussian function) with the following properties:
Accessor AccIn on the input image is defined on the whole a) the filter mask is centered at the pixel it is applied to [0, 0]
image (line 18). The kernel is initialized with the iteration and b) bounded to the neighborhood [−m, +m] × [−n, +n].
space object, accessor objects and parameters sigma_d and The latter implies a window size (2m + 1) × (2n + 1) of local
sigma_r for the bilateral filter (line 21), and executed by a call operators to be uneven (e. g., 3 × 3, 5 × 5, 9 × 3). Since the
to the execute() method (line 24). To retrieve the output image, filter mask used for convolution is typically constant, the
the host_out pointer to a plain C data array is assigned the values for the filter mask can be stored to a lookup table.
Image object OUT, which invokes the getData() operator Figure 1 shows the filter masks used for a window size of
(line 27). 3 × 3: the pixels in the neighborhood (f) are convolved with
From this code, the source-to-source compiler can finally the closeness (c) and similarity (s) filter masks. While the filter
create CUDA and OpenCL code for execution on GPU ac- mask for the closeness component is constant and depends
celerators automatically. In [4], we have presented an initial only on the distance to the center pixel, the filter mask for the
version of our framework, that is, how to generate efficient similarity component depends on the pixel values and has to
code for point operators by support for effective device- be calculated for each pixel separately.
Table I: Supported boundary handling modes.

Mode Returned pixel value for out of bounds:


Undefined not specified, undefined
Repeat pixel value of image repeated at the border
Clamp last valid pixel within image
Mirror pixel value of image mirrored at the border
Constant constant value, user defined

pletely transparent to the programmer in our framework. The


programmer specifies only the desired boundary handling
mode on images and the source-to-source compiler will handle
the low-level details. Supported boundary handling modes are
listed in Table I and visualized in Figure 2. These include the
most important boundary handling modes commonly encoun-
tered in image processing1 . In particular, mirroring is impor-
tant in medical imaging, for example, when a multiresoultion
filter (see [7]) is applied to an image: the image gets upsampled
multiple times and at the border occur large unnatural-looking
artifacts when the border pixel gets replicated repeatedly. In
contrast, using mirroring leads to natural looking images.
Instead of tying the boundary handling mode to a particular
image, the framework provides Accessors to describe the way
an image is accessed as mentioned earlier. That is, no memory
is allocated and hold by Accessors. In case boundary handling
is required, the Accessor defines the view on a BoundaryCon-
Figure 1: Bilateral filter: Convolution of the image (f) with the dition object rather than on an Image. The BoundaryCondition
closeness (c) and similarity (s) filter masks. itself specifies the boundary handling mode on an input image
and the size of the local operator. Listing 3 shows the collab-
oration of BoundaryCondition and Accessor in order to define
Figure 1 also indicates a problem when local operators are clamping as boundary handling mode for a local operator with
applied to an image: if the 3 × 3 filter mask is applied to a a window size of (4 ∗ σd + 1) × (4 ∗ σd + 1). The resulting
pixel located at the image border, pixels beyond the image are Accessor can be used in the BilateralFilter implementation
accessed. Accessing pixels out of bounds leads to erroneous defined in Listing 1 and instantiated in Listing 2 to add bound-
pixel values in the resulting image and may result in program ary handling support. In case a constant boundary handling is
termination. Therefore, boundary handling modes need to be specified, the BoundaryCondition requires the constant value
defined that determine the behavior when pixels beyond the as an additional parameter.
image are accessed. 1 // input image and accessor with boundary handling
The remainder of this section discusses how properties of 2 Image<float> IN(width, height);
local operators are expressed in our framework that relieve 3 BoundaryCondition<float> BcIn(IN, 4*sigma_d+1, 4*
sigma_d+1, BOUNDARY_CLAMP);
developers from low-level programming of GPU accelerators. 4 Accessor<float> AccIn(BcIn);
In the following sections, the benefit of this domain-specific
knowledge for efficient low-level code generation for GPU Listing 3: Usage of a BoundaryCondition to define the
accelerators using source-to-source compilation is shown. boundary handling mode for a local operator.

A. Boundary Handling The source-to-source compiler creates then the appropriate


boundary handling support utilizing the access metadata pro-
When an image is accessed out of bounds, there are different
vided by the framework. No further modifications are required.
possibilities to handle this situation: the image is virtually ex-
Tying the boundary handling mode to an Accessor instead of
panded beyond the border and the pixel value of the expanded
an Image has the additional benefit that multiple boundary
image is returned. This can be realized by a) adding additional
handling modes can be defined on the same image. This
pixels at each image boundary, and b) adjusting the index of
allows different algorithms or kernels to access the same
the accessed pixel to a pixel that resides within the image. In
image, but to use different boundary handling modes—the
our implementation, we follow the second approach, which is
discussed in Section IV. 1 However, if necessary, the framework can be easily extended to support
However, we will show that boundary handling is com- further boundary handling modes.
? ? ? ? ? ? ? ? ? ?
precalculated values can be assigned to the Mask. Listing 4
? ? ? ? ? ? ? ? ? ?
shows the usage of a Mask for the closeness filter mask.
? ? ? ? ? ? ? ? ? ?
1 // pre-calculated mask coefficients
? ? ? A B C D ? ? ?
2 float mask[] = {0.018316, ... };
? ? ? E F G H ? ? ? 3

? ? ? I J K L ? ? ? 4 // filter mask
5 Mask<float> CMask(4*sigma_d+1, 4*sigma_d+1);
? ? ? M N O P ? ? ?
6 CMask = mask;
? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?
Listing 4: Usage of a Mask to store the filter mask for a local
? ? ? ? ? ? ? ? ? ?
operator.
(a) Undefined. Listing 5 shows the usage of the Mask within the kernel of
the BilateralFilter: using the indices xf and yf, the precalcu-
F G H E F G H E F G A A A A B C D D D D
lated values are retrieved from the closeness filter mask. Also
J K L I J K L I J K A A A A B C D D D D
note that the calculation of c_d is not necessary anymore.
N O P M N O P M N O A A A A B C D D D D

B C D A B C D A B C A A A A B C D D D D 1 void kernel() {
2 float c_r = 1.0f/(2.0f*sigma_r*sigma_r);
F G H E F G H E F G E E E E F G H H H H
3 float d = 0.0f, p = 0.0f, s = 0.0f;
J K L I J K L I J K I I I I J K L L L L 4

N O P M N O P M N O M M M M N O P P P P
5 for (int yf = -2*sigma_d; yf<=2*sigma_d; yf++) {
6 for (int xf = -2*sigma_d; xf<=2*sigma_d; xf++)
B C D A B C D A B C M M M M N O P P P P
{
F G H E F G H E F G M M M M N O P P P P 7 float diff = Input(xf, yf) - Input();
8
J K L I J K L I J K M M M M N O P P P P
9 s = exp(-c_r * diff*diff);
(b) Repeat. (c) Clamp. 10 c = CMask(xf, yf);
11 d += s*c;
K G C I J K L B F J Q Q Q Q Q Q Q Q Q Q
12 p += s*c * Input(xf, yf);
13 }
J F B E F G H C G K Q Q Q Q Q Q Q Q Q Q
14 }
I E A A B C D D H L Q Q Q Q Q Q Q Q Q Q 15 output() = p/d;
C B A A B C D D C B Q Q Q A B C D Q Q Q
16 }
G F E E F G H H G F Q Q Q E F G H Q Q Q
Listing 5: Using a Mask within the kernel to retrieve the filter
K J I I J K L L K J Q Q Q I J K L Q Q Q mask coefficients for a local operator.
O N M M N O P P O N Q Q Q M N O P Q Q Q

E I M M N O P P L H Q Q Q Q Q Q Q Q Q Q
IV. S OURCE - TO -S OURCE C OMPILATION
F J N I J K L O K G Q Q Q Q Q Q Q Q Q Q

G K O E F G H N J F Q Q Q Q Q Q Q Q Q Q
This section describes the transformations applied by our
source-to-source compiler and the steps taken to create CUDA
(d) Mirror. (e) Constant.
and OpenCL code from a high-level description of local
Figure 2: Boundary handling modes for image processing. operators and access metadata. Based on the latest Clang
By default, the behavior is undefined when the image is compiler framework [5], our source-to-source compiler uses
accessed out of bounds (a). The framework allows to specify the Clang frontend for C/C++ to parse the input files and
different boundary handling modes like repeating the image to generate an abstract syntax tree (AST) representation of
(b), clamping to the last valid pixel (c), mirroring the image the source code. Our backend uses this AST representation to
at the image border (d), and returning a constant value when apply transformations and to generate host and device code in
accessed out of bounds (e). CUDA or OpenCL.
A. Local Operators
most appropriate one for the current algorithm—without the Since local operators and convolution functions read typ-
need to keep separate copies of the image. ically neighboring pixels to calculate the value of the result
pixel, most of the neighboring pixels read for pixel pi, j are
also required for pixel pi+1, j . This results in redundant fetches
B. Filter Masks
of the same pixels and imposes high pressure on the global
The framework provides also support for filter masks for memory bandwidth. To preserve global memory bandwidth,
local operators: a Mask holds the precalculated values used by the region read by a group of threads can be a) staged into
the convolution filter function. Since the filter mask is constant fast on-chip memory (scratchpad memory) and read from there
for one kernel, this allows the source-to-source compiler to afterwards, or b) accessed by a memory path that traverses a
apply optimizations such as constant propagation to avoid cache. In the latter case, only the first access to a pixel has
redundant calculations. To define a Mask in the framework, the long latency of global memory. Subsequent accesses are
the filter mask size has to be provided. Afterwards, the served by the cache. In the former case, the data is staged
into scratchpad memory and memory accesses of the kernel are mapped, the low-level code can be emitted. When this
go to the fast scratchpad memory. However, synchronization is done, also the corresponding CUDA texture reference and
is required before the calculation can begin. This separates OpenCL sampler definitions are created. The OpenCL kernel-
data transfer and calculation phases within the kernel. The function parameters for images are emitted with the corre-
benefit of massive multithreading provided by the underlying sponding read_only and write_only attributes obtained from
hardware is to hide memory transfers when data transfers the read/write analysis. Texture references are not added as
and calculations are done at the same time. This benefit kernel-function parameters in CUDA since they are static and
is lost when data is staged to scratchpad memory. Hence, globally visible in CUDA.
staging to scratchpad memory makes only sense in case Scratchpad memory: Current graphics cards provide fast
the benefit of data reuse exceeds the multithreading benefit. on-chip scratchpad memory, also called shared memory in
For local operators with small window sizes, this is rarely CUDA and local memory in OpenCL, that is shared between
the case. Nonetheless, our source-to-source compiler supports all threads of a SIMD unit. Adding the __shared__ (CUDA)
both options. and __local (OpenCL) keywords to memory declarations
Texturing memory: All graphics cards from AMD and within a kernel allows to use this memory. Since the memory is
NVIDIA support cached data access, either using texturing shared between all threads mapped to one SIMD unit (number
hardware or by default (on newer Fermi GPUs from NVIDIA). of threads » SIMD width), synchronization between threads
In CUDA, texturing hardware can be utilized reading from a mapped to a SIMD unit is required so that all threads have a
texture reference that is bound to global memory. In OpenCL, consistent view of the scratchpad memory. This is done by the
the texturing hardware is used when data is read from an image __synchthreads() (CUDA) and barrier() (OpenCL) function.
object. Therefore, accesses to Image(x, y) objects have to be Only after all threads have reached this synchronization point,
mapped to the corresponding tex1Dfetch() and read_imagef() execution continues. Using scratchpad memory includes two
functions in CUDA and OpenCL, respectively. However, this phases: first, the data is staged from the GPU memory into
is only valid if data is read. When data is written to an scratchpad memory and second, data accesses are redirected
Image object, normal global memory array pointers are used to the scratchpad. In Listing 7, the size of the scratchpad
in CUDA and the write_imagef() function in OpenCL. That memory depends on BSY/BSX, the size of the 2D image
is, prior to the mapping of Image accesses to the low-level subregion mapped to the SIMD unit as well as on SY/SX, the
equivalents, we perform a read/write analysis of the kernel image region accessed beyond the subregion within the local
method. Therefore, a control-flow graph (CFG) of the instruc- operator. A constant of 1 is added to BSX so that different
tions in the kernel method is created and traversed afterwards. banks of the scratchpad memory are accessed for row-based
Access information is stored for each Image and Accessor filters to avoid bank conflicts. The data from global memory
object and used to select the appropriate texturing function is staged into scratchpad memory in multiple steps, depending
call. This results in a mapping of read/write-accesses as shown on the size of additional pixels required by the kernel. When
in Listing 6. The mapping is done using a recursive AST- data is read from the scratchpad memory, the thread identifiers
visitor. Whenever an Image or Accessor node is visited, the of the threads mapped to one SIMD unit threadIdx and
above described transformations are applied. get_local_id() are used for the CUDA and OpenCL code,
1 // Read-access to Accessor respectively.
2 IN(xf, yf) 1 // Phase 1: stage data to scratchpad memory
3 // CUDA read with offset 2 // CUDA
4 tex1Dfetch(_texIN, gid_x+xf + (gid_y+yf)*stride) 3 __shared__ float _smemIN[SY + BSY][SX + BSX + 1];
5 // OpenCL read with offset 4 _smemIN[threadIdx.y][threadIdx.x] = IN[..];
6 read_imagef(imgIN, Sampler, (int2)(gid_x+xf, gid_y 5 if (...) _smemIN[threadIdx.y + ..][threadIdx.x +
+yf)).x ..] = IN[..];
7
6 ..
8 // Write-access to output() method 7 __syncthreads();
9 output() 8 // OpenCL
10 // CUDA write without offset 9 __local float _smemIN[SY + BSY][SX + BSX + 1];
11 OUT[gid_x + gid_y*stride] 10 _smemIN[get_local_id(1)][get_local_id(0)] = IN[..]
12 // OpenCL write without offset ;
13 write_imagef(imgOUT, (int2)(gid_x, gid_y), (float4 11 if (...) _smemIN[get_local_id(1) + ..][
) val); get_local_id(0) + ..] = IN[..];
..
Listing 6: Using texturing hardware to read from an Accessor. 12
13 barrier(CLK_LOCAL_MEM_FENCE);
14
The image object access functions in OpenCL take and return 15 // Phase 2: map accesses to scratchpad memory
always vector elements with size of four, although only one 16 IN(xf, yf)
of the four components is required for the example above. 17 // CUDA read with offset
18 _smemIN[threadIdx.y + yf][threadIdx.x + xf]
Therefore, the CL_R channel order is used, which maps 19 // OpenCL read with offset
only one of the four components to memory and populates 20 _smemIN[get_local_id(1)+ yf, get_local_id(0) + xf]
the remaining three channels with zeros. The corresponding
Listing 7: Staging pixels to scratchpad memory before
extraction from and packing to vector elements is added by
accessing image pixels.
the framework as well. Once the accesses to Image objects
A A A A B C D D D D
B. Boundary Handling
A A A A B C D D D D
Boundary handling for image processing kernels in CU-
A A TL ABH A BT BHC D TRDBH D D
DA/OpenCL is typically done with the help of texturing
hardware. When a texture reference/sampler is defined, an A A A A B C D D D D

address mode can be specified. This mode defines the behavior E E E


L BH
E F
NO BH
G H H
R BH
H H

when textures are accessed out of bounds and can be set I I I I J K L L L L

to clamp or repeat accesses. For all other modes, boundary M M M M N O P P P P

handling code has to be added by the programmer. That is, M M BLMBH M NB BHO P BRP BH P P

conditional statements have to be added to check if the indices M M M M N O P P P P

address pixels within the image. However, this goes along with M M M M N O P P P P

large overheads: since the same code is executed by all threads


Figure 3: Kernel assignment to image border: border handling
(i. e., for each pixel), the conditional statements have to be
for local operators is limited to those regions where the image
evaluated for each pixel, although it is only required at the
is accessed out of bounds.
image border.
The access metadata obtained from the Accessor and Bound-
ary Condition definition introduced in Section III specifies
the range of pixels for a local operator that require boundary respectively) as well as on the width/height of
handling code. This access metadata can be combined with the the image, the resulting number of blocks per image
two-layered parallel execution model described in Algorithm 1 (gridDim.x/gridDim.y and get_num_groups(), re-
to reduce the overhead imposed by boundary handling code: spectively), as well as on the size of the filter mask as specified
special boundary handling mode is added for each border— by the Accessors and Boundary Condition. In case multiple
resulting in nine different kernel implementations as depicted Accessors are used within one kernel, the largest window size
in Figure 3. At the top-left image border, only the conditionals specified is taken. This results in low-level source code as
are added that check if the indices are less than zero, at the seen in Listing 8 for a local operator reading 13 × 13 pixels
top image border only the conditionals to check if the y- and a block size of 128 × 1. The conditional statements are
index is less than zero, and so on. For the largest part of the also added during traversal of the AST using a visitor.
image, no conditionals have to be added at all. Launching the
nine specialized implementations sequentially for the different C. Filter Masks
regions of the image would decrease the total performance due Since the same CUDA/OpenCL kernel is executed for
to the launch overhead of the kernels and the relatively small all pixels and, hence, the same element of the filter mask
amount of work per kernel at the image borders. Instead, the is retrieved for all threads within one thread block, it is
source-to-source compiler creates one big kernel that hosts sufficient to load the coefficient only once for a thread block.
all nine implementations, but executes only the required one For such access patterns, the constant memory is optimized.
depending on the currently processed image region. Constant memory is cached and broadcasts data to all threads
1 __global__ void kernel(...) { if all threads access the same element. The source-to-source
2
3
if (blockIdx.x < 1 && blockIdx.y < 6) goto TL_BH;
if (...) goto TR_BH;
compiler stores the filter coefficients of Mask objects to
4 if (...) goto T_BH; constant memory in CUDA and OpenCL. Constant mem-
5 ... ory can be declared either globally and initialized statically
6
7
goto NO_BH;
or used as dynamically allocated memory. Our framework
8 TL_BH: supports both statically and dynamically initialized memory.
9 //kernel implementation with border handling for Static constant memory can be only utilized if the mask
10 ...
top left border
coefficients are constants and can be precomputed at com-
11 return; pile time. In this case, the source-to-source compiler can
12 ... declare a statically initialized constant memory array using
13
14
NO_BH:
//kernel implementation with no border handling
the __device__ __constant__ qualifiers in CUDA and
15 ... the __constant qualifier in OpenCL. For statically ini-
16 } tialized constant memory, no kernel-function parameter is
Listing 8: Using specialized border handling implementations added to the kernel function. In case the Mask object uses
for local operators in CUDA. coefficients that are not constant at compile time, dynamically
initialized constant memory is used. In CUDA, the constant
The current region is identified by the blockIdx.x memory is only declared and gets initialized at run-time
and blockIdx.y built-in variables in CUDA and by using the cudaMemcpyToSymbol() function. Dynamically
the get_group_id() functions in OpenCL. Whether initialized constant memory is handled like normal global
boundary handling is required for that regions depends memory in OpenCL and is stored to a buffer object, only the
on the size of the block processed by one SIMD unit __constant qualifier is added to the kernel-function parameter.
(blockDim.x/blockDim.y and get_local_size(), During AST-traversal, accesses to Mask objects are translated
to accesses to the constant memory arrays created for the Mask one SIMD unit. There are many possible configurations and
coefficients. moving on to another device, a different configuration will
perform better. Even worse, some configurations for one
V. D EVICE - SPECIFIC M APPING device will not run on a second device at all due to different
In this section, the selection of target-specific optimizations hardware capabilities or configuration limitations imposed by
is described. The decision what optimization should be applied the hardware vendor. For example, on graphics cards from
for what target (hardware and backend) is based on our AMD, the maximal number of threads that can be mapped to
own micro-benchmarks for typical kernel candidates from one SIMD unit is 256, while this limit is either 512, 768, or
the medical domain as well as on other micro-benchmarks 1024 on graphics cards from NVIDIA. Even more important
available online for CUDA [8] and OpenCL [9]. are the hardware limitations in terms of available registers and
shared/local memory per SIMD unit. Selecting a configuration
A. Function Mapping
that allocates more resources than available results in a kernel
Since the GPU cannot execute arbitrary functions, the launch error at run-time.
source-to-source compiler has to check every function call in
Therefore, our source-to-source compiler passes the gener-
the kernel method. The basic functions supported by CUDA
ated CUDA and OpenCL code to the nvcc compiler and a tool
and OpenCL are roughly the same, but differ in an important
invoking the OpenCL run-time, respectively. These generate
point: while CUDA preserves the suffix of mathematical
machine-specific assembly code and provide the resource
functions that denotes the data type the function operates
usage information of kernels. This information is combined
on, OpenCL removes these suffixes and overloads the math-
with a hardware model of the target GPU, describing a) the
ematical functions for different data types. That is, suffixes
SIMD width, b) the maximal thread configuration as specified
are not allowed in OpenCL code. Therefore, our source-to-
by the programmer, c) the maximal threads that can be mapped
source compiler supports the mapping of function calls to the
to a SIMD unit, and d) the maximal available registers and
corresponding functions on different backends. For example,
shared memory as well as their allocation strategy. This allows
the expf() function gets mapped to exp() when code is
the compiler not only to select configurations that are valid, but
generated for OpenCL. The mappings are defined in a file and
also those that increase the utilization of available resources
by default all supported mathematical functions supported by
(also called occupancy) in order to hide instruction and global
CUDA and OpenCL are listed therein. In case a function is
memory latency. To do so, our source-to-source compiler
not supported, our compiler emits an error message to the
calculates for all configurations that result in good memory
user. This technique can be also used to map mathematical
bandwidth utilization (i. e., which are a multiple of the SIMD
functions to faster intrinsic functions on graphics cards, for
width to get coalesced memory accesses) their occupancy and
example, the expf() function can be mapped to the hardware
remembers them.
accelerated __expf() function in CUDA. Although this fea-
ture is supported by our source-to-source compiler, we do not From these configuration-occupancy pairs, the source-to-
use it within this paper to select faster, hardware accelerated source compiler selects a configuration, determines a 2D-tiling
intrinsic functions instead of the standard functions. for it, and sets it as kernel configuration based on a heuristic
as described in Algorithm 2: first, all invalid configurations
B. Optimization Selection are removed, and the remaining configurations are sorted
The knowledge we get from our micro-benchmarks as well by descending occupancy. If no border handling code was
as those from external ones are stored in a database that generated, we use the configuration with the highest occupancy
is utilized by the source-to-source compiler to decide what and give precedence to the x-component for tiling, that is,
optimization should be applied for which a) target hardware we get 1D-configurations like 128 × 1 or 256 × 1. In case
and b) backend. This includes the amount of padding required multiple configurations have the same occupancy, the one with
for optimal memory bandwidth utilization, whether texture the lowest number of threads is chosen. Such configurations
memory is beneficial, or whether constant memory should are typically selected by expert programmers and yield good
be initialized statically or dynamically. When the source-to- performance for most kernels. In case border handling code
source compiler is invoked, a compiler flag is used to indicate was generated by our source-to-source compiler, the tiling
if CUDA or OpenCL low-level code should be generated. strategy is different. Here, the heuristic tries to minimize the
The target graphics card hardware is specified the same way. number of threads that execute code with conditionals for
Currently, the compiler database contains information about boundary handling. Since the boundary handling is typically
all available CUDA-capable graphics cards as specified by the symmetric (e. g., 3 × 3 or 5 × 5), the minimal size for the x-
compute capability and AMD GPUs of the Radeon HD 6900 configuration of the SIMD width is in most cases sufficient
and HD 5800 series (VLIW4 and VLIW5 architecture). and the y-configuration is preferred instead. When more than
one configuration is available with high occupancy, we select
C. Automatic Configuration Selection the one that minimizes the number of threads with boundary
One common problem when writing CUDA and OpenCL handling conditionals (e. g., we prefer a configuration of 32×6
programs is to select an appropriate kernel configuration, over 32 × 4 for a window size of 13 × 13, a configuration of
which defines the number of threads that are mapped to 32×3, however, would be preferred to the two aforementioned
configurations.). VI. E VALUATION AND R ESULTS
Once an optimal configuration and tiling for a kernel have In this section, the source-to-source compiler and the
been determined according to the presented heuristic, the device-specific code generation is evaluated and compared
source-to-source compiler can emit the kernel configuration against other approaches. Of particular interest is the support
for invocations of the kernel. In case of border handling, the for boundary handling and the efficient implementation in
final kernel code is generated after the kernel configuration and different frameworks as well as the support for filter masks.
tiling are determined. This is necessary, since the constants
in Listing 8 that determine what code to execute by which A. Boundary Handling, Filter Masks
block relies on the tiling of the kernel configuration. The initial First, we compare the code that is generated from our
kernel code that is used to determine the resource usage uses description with manual implementations and then, with those
default constants. from RapidMind [10] and OpenCV [11].
1) Manual Implementation: The basic version of the
Algorithm 2: Heuristic for selecting optimal kernel con- manual implementations uses straightforward CUDA/OpenCL
figuration and tiling depending on resource usage, border code. These versions are then subsequently improved to utilize
handling size, and target graphics card. linear texture memory in CUDA (image objects in OpenCL),
Input: Kernel K, set of configurations C for GPU G constant memory to store the filter masks, and combinations of
Output: Optimal configuration copt and tiling topt for both. There are two versions using texture memory, one using
kernel K only the texturing hardware to read data (called Tex/Img), and
1 C ← configurations of C multiple of SIMD width of G; one which uses the texturing hardware for boundary handling
2 C ← configurations of C within resource limitations of G; (called Tex2D/ImgBH). Both OpenCL versions use image
3 C ← sorted configurations of C with descending occupancy and
ascending number of threads;
objects to store images. The CUDA version using texturing
4 if border handling then hardware to read data uses linear memory while the second
5 copt ← first configuration of C; version using textures for boundary handling relies on 2D
6 topt ← tiling of copt , prefer y over x; arrays, which support boundary handling in two dimensions.
7 threads_bhopt ← calculate number of threads for border Using texturing hardware for boundary handling, the following
handling for tiling topt for kernel K;
8 C0 ← configurations with highest occupancy;
boundary handling modes are supported in CUDA: a) Clamp
9 foreach configuration c of C0 do and b) Repeat. Using OpenCL the following modes are sup-
10 t ← tiling of c, prefer y over x; ported: a) Clamp, b) Repeat, and c) Constant. However, the
11 threads_bh ← calculate number of threads for border constants can be only floating point values of either 0.0 or 1.0.
handling for tiling t for kernel K; The different implementation variants are summarized in
12 if threads_bh < threads_bhopt then
13 copt ← c;
Table II (Tesla C2050, CUDA), III (Tesla C2050, OpenCL),
14 topt ← t; IV (Quadro FX 5800, CUDA), V (Quadro FX 5800, OpenCL),
15 threads_bhopt ← threads_bh; VI (Radeon HD 5870, OpenCL), and VII (Radeon HD 6970,
16 end OpenCL). It can be seen in particular that our generated
17 end CUDA code outperforms even the code that utilizes texturing
18 else
19 copt ← first configuration of C;
hardware for boundary handling. The difference between the
20 topt ← tiling of copt , prefer x over y; boundary handling modes is small, while the performance
21 end of the manual implementation varies significantly (up to a
factor of two). Note that on the Tesla hardware, the implemen-
tations accessing unallocated memory (undefined boundary
handling) crash. The benefit of texturing hardware in OpenCL
D. Configuration Exploration is not present anymore since no linear memory can be used
(which is of higher performance compared to arrays for the
To evaluate our heuristic that selects kernel configurations considered applications). On GPUs from AMD, the different
and tiling, our source-to-source compiler can generate code optimizations have varying impact on performance and is not
that explores all possible configurations for a given kernel. predictable. We attribute this to the current implementations,
This allows to assess the heuristic easily. Therefore, the source- which are scalar and do not utilize the VLIW4 or VLIW5
to-source compiler replaces the kernel preparation (setting hardware architecture. We expect to get a more predictable be-
arguments and configuration, binding textures, etc.) and in- havior once vector operations are used (see also Section VIII).
vocation by two nested loops iterating over all valid configu- 2) RapidMind: RapidMind, a multi-core development plat-
rations for the current kernel. In this mode, the constants that form allows to describe the considered filter in the same way
determine what code is executed by which block are replaced as our framework. Actually, the kernel description differs only
by macros. The macros are set at run-time when the kernel by the keywords and data types used in our DSL and in
source code is compiled just-in-time. Since CUDA does not RapidMind [10]. In Table II and IV, the performance of the
support just-in-time compilation, our run-time library provides RapidMind implementation is listed in addition to the manual
a wrapper to nvcc to allow this. implementations and the generated code for the graphics cards
Table II: Execution times in ms for the bilateral filter on a Tesla
C2050 using the CUDA backend with manual border handling Table IV: Execution times in ms for the bilateral filter on a
implementation for an image of 4096 × 4096 pixels and a filter Quadro FX 5800 using the CUDA backend with manual bor-
window size of 13×13 (σd = 3). Kernel configuration is 128× der handling implementation for an image of 4096×4096 pixels
1 for all kernels. and a filter window size of 13 × 13 (σd = 3). Kernel configu-
ration is 128 × 1 for all kernels.
Undef. Clamp Repeat Mirror Const.
Undef. Clamp Repeat Mirror Const.
Manual crash 302.27 363.96 321.81 568.46
+Tex 260.03 285.61 362.70 310.61 520.25 Manual 319.67 349.32 394.96 393.00 779.68
+2DTex 272.39 272.40 300.56 n/a n/a +Tex 310.22 336.46 369.74 378.47 590.18
+Mask crash 214.51 281.89 225.88 481.76 +2DTex 330.50 330.49 369.06 n/a n/a
+Mask+Tex 170.79 192.46 259.26 205.29 425.13 +Mask 224.56 321.55 323.50 321.46 778.48
+Mask+2DTex 181.19 181.19 203.13 n/a n/a +Mask+Tex 199.11 237.60 271.45 278.89 497.75
+Mask+2DTex 214.53 215.53 348.92 n/a n/a
Generated crash 285.29 298.29 289.22 291.26
+Tex 276.76 265.36 285.57 278.04 268.01 Generated 321.24 331.36 404.81 332.17 436.77
+Mask crash 181.45 200.66 193.16 197.23 +Tex 312.71 313.74 356.52 316.08 383.19
+Mask+Tex 172.60 182.80 180.38 173.59 175.52 +Mask 225.58 227.65 281.82 228.18 290.78
+Mask+Tex 200.55 204.45 218.22 204.53 246.96
RapidMind 430.95 489.94 crash n/a 539.69
+Tex 456.35 514.63 crash n/a 518.49 RapidMind 737.69 862.86 2352.34 n/a 989.55
+Tex 679.52 734.48 2226.33 n/a 805.62

Table III: Execution times in ms for the bilateral filter on a


Tesla C2050 using the OpenCL backend with manual border
handling implementation for an image of 4096 × 4096 pixels Table V: Execution times in ms for the bilateral filter on a
and a filter window size of 13 × 13 (σd = 3). Kernel configu- Quadro FX 5800 using the OpenCL backend with man-
ration is 128 × 1 for all kernels. ual border handling implementation for an image of 4096 ×
4096 pixels and a filter window size of 13×13 (σd = 3). Kernel
Undef. Clamp Repeat Mirror Const. configuration is 128 × 1 for all kernels.
Manual 449.86 485.60 552.83 504.39 505.11
+Img 465.48 487.80 557.88 501.18 508.28 Undef. Clamp Repeat Mirror Const.
+ImgBH 452.15 452.39 464.07 n/a 452.24 Manual 439.55 504.79 537.04 528.47 770.34
+Mask 215.23 250.67 331.11 261.05 267.62 +Img 509.95 529.39 560.77 550.43 732.55
+Mask+Img 228.29 251.51 322.61 264.54 288.08 +ImgBH 509.82 509.33 509.38 n/a 509.65
+Mask+ImgBH 214.68 227.74 215.07 n/a 215.07 +Mask 355.70 455.69 458.90 452.71 775.83
Generated 453.78 466.49 474.86 455.59 467.05 +Mask+Img 468.94 466.67 467.19 464.62 708.93
+Img 463.62 466.61 472.67 468.43 466.62 +Mask+ImgBH 468.00 470.04 468.80 n/a 470.46
+Mask 217.95 215.61 222.78 220.27 220.16 Generated 446.24 449.67 514.89 453.68 460.68
+Mask+Img 219.49 219.64 238.81 220.28 232.57 +Img 511.38 512.50 553.23 511.78 654.08
+Mask 354.93 357.77 407.01 357.72 384.30
+Mask+Img 466.26 465.70 522.53 461.56 539.77

supporting CUDA. It can be seen that our generated code


outperforms the one of RapidMind by a factor of two, even
without boundary handling. Just-in-time compilation overhead Table VI: Execution times in ms for the bilateral filter on a
is not included in the timing of RapidMind and was verified Radeon HD 5870 using the OpenCL backend with man-
using the CUDA profiler. Using Repeat as boundary handling, ual border handling implementation for an image of 4096 ×
the program crashes on the Tesla and is by a factor of three 4096 pixels and a filter window size of 13×13 (σd = 3). Kernel
slower on the Quadro compared to the variant with undefined configuration is 128 × 1 for all kernels.
boundary handling.
Undef. Clamp Repeat Mirror Const.
3) OpenCV: The Open Source Computer Vision (OpenCV)
library [11] is one widely used library for image processing Manual 334.96 408.36 404.83 419.59 440.64
and provides support for GPU accelerators using CUDA. +Img 353.93 385.23 405.81 396.45 484.25
+ImgBH 353.93 353.91 353.96 n/a 353.95
Previously, we have shown that we can generate efficient code +Mask 311.85 397.40 434.36 408.32 402.59
for point operators that outperforms the implementations of +Mask+Img 341.23 373.93 400.71 375.48 444.36
OpenCV (which did not support boundary handling and are +Mask+ImgBH 341.25 341.24 341.24 n/a 341.27
wrapped calls to the NVIDIA Performance Primitives (NPP) Generated 342.67 354.49 472.20 355.57 351.83
library) [4]. Since then, OpenCV added low-level CUDA +Img 372.14 376.91 482.28 382.71 446.98
implementations for row-based and column-based (separable) +Mask 326.22 357.96 487.53 359.72 348.77
kernels like Gaussian and Sobel filters. Their implementation +Mask+Img 350.56 364.34 481.76 364.39 428.22
stages image data to shared memory and utilizes precalculated
Table VII: Execution times in ms for the bilateral filter on
a Radeon HD 6970 using the OpenCL backend with man- Table VIII: Execution times in ms for the Gaussian filters from
ual border handling implementation for an image of 4096 × OpenCV on the Tesla C2050 and our generated implementa-
4096 pixels and a filter window size of 13×13 (σd = 3). Kernel tions using the CUDA and OpenCL backends for an image
configuration is 128 × 1 for all kernels. of 4096 × 4096 pixels and different filter window sizes.

Undef. Clamp Repeat Mirror Const. Gaussian: 3 × 3


Manual 286.29 337.13 375.11 346.18 381.76 Clamp Repeat Mirror Const.
+Img 286.38 319.20 364.59 328.12 435.16 OpenCV: PPT=8 5.10 6.36 8.09 6.75
+ImgBH 286.44 286.44 286.43 n/a 286.46 OpenCV: PPT=1 9.44 11.85 15.97 12.36
+Mask 265.57 332.41 387.81 340.59 349.37
+Mask+Img 268.26 310.84 349.31 311.42 387.73 CUDA(Gen) 7.00 7.53 7.21 7.10
+Mask+ImgBH 268.20 268.23 268.20 n/a 268.24 CUDA(+Tex) 7.00 7.44 7.17 7.13
CUDA(+Smem) 7.73 8.09 8.02 8.00
Generated 291.30 309.52 470.90 322.69 321.19
+Img 303.36 298.50 465.30 305.38 438.74 OpenCL(Gen) 9.26 9.70 9.40 9.33
+Mask 289.33 296.20 467.76 332.91 314.05 OpenCL(+Tex) 13.41 13.62 13.33 13.16
+Mask+Img 279.66 291.49 474.60 291.58 414.31 OpenCL(+Lmem) 11.29 11.46 11.12 11.13
Gaussian: 5 × 5
Clamp Repeat Mirror Const.
masks. In addition, OpenCV maps multiple output pixels to
OpenCV: PPT=8 5.11 6.36 8.10 6.76
the same thread on the GPU in order to minimize scheduling OpenCV: PPT=1 9.45 11.88 15.99 12.37
overheads and maximize data reuse. Here, we investigate two
kernels implemented in such a way in OpenCV, namely the CUDA(Gen) 8.84 9.86 9.47 9.45
CUDA(+Tex) 8.94 9.72 9.35 9.47
Gaussian and Sobel filter, and compare them against our CUDA(+Smem) 9.38 9.59 9.44 9.55
generated code with boundary handling. Table VIII (Tesla
OpenCL(Gen) 10.88 11.82 11.13 10.44
C2050) and IX (Quadro FX 5800) show the results for
OpenCL(+Tex) 14.96 15.87 15.17 15.12
different boundary handling modes for the Gaussian filter (note OpenCL(+Lmem) 13.24 13.72 13.35 13.22
that the Sobel filter uses the same implementation and has
the same performance). For OpenCV, two implementations are
added: one with the original implementation, mapping eight
output pixels to one thread (PPT=8), and one with a one-to-one Table IX: Execution times in ms for the Gaussian filters
mapping (PPT=1). Mapping multiple output pixels to the same from OpenCV on the Quadro FX 5800 and our generated
thread gives a significant performance boost to the OpenCV implementations using the CUDA and OpenCL backends for
implementation, but the performance varies a lot—depending an image of 4096 × 4096 pixels and different filter window
on the boundary handling mode. In contrast, the execution sizes.
times of the different boundary handling modes is constant for
the code generated by our framework. The performance of the Gaussian: 3 × 3
generated code is about as fast as the OpenCV implementation Clamp Repeat Mirror Const.
using the simple one-to-one mapping. All our implementations OpenCV: PPT=8 4.86 5.82 10.46 6.22
use constant memory for the mask coefficients and automatic OpenCV: PPT=1 7.63 9.22 20.98 9.79
kernel configuration as determined by our framework. Tex- CUDA(Gen) 8.60 8.63 8.64 8.67
ture memory has only marginal benefits for these kernels CUDA(+Tex) 8.55 8.58 8.60 8.63
and scratchpad memory slows the kernel down for reasons CUDA(+Smem) 11.83 11.83 11.84 11.90
mentioned earlier. OpenCL(Gen) 13.58 13.47 13.10 13.46
OpenCL(+Img) 15.42 15.47 15.06 15.24
B. Configuration Exploration OpenCL(+Lmem) 17.84 17.86 17.91 18.35
To verify that the heuristic presented in Algorithm 2 for Gaussian: 5 × 5
automatic kernel configuration selects a good configuration
and tiling, we generate code for the bilateral filter using the Clamp Repeat Mirror Const.
CUDA backend on the Tesla C2050 that explores all valid OpenCV: PPT=8 4.90 5.87 10.45 6.22
configurations and visualize them in Figure 4 (note that the OpenCV: PPT=1 7.64 9.22 20.98 9.79
configuration with 32 threads with an execution time of 425 ms CUDA(Gen) 9.88 9.95 9.95 10.12
is not shown). Multiple points with the same number of CUDA(+Tex) 9.91 9.97 9.98 10.20
threads denote a different tiling for that configuration. The CUDA(+Smem) 14.36 14.36 14.37 14.43
configuration selected by our framework, 32×6, is in this case OpenCL(Gen) 16.14 16.26 16.18 16.60
also the optimal configuration. This is not always the case, but OpenCL(+Img) 18.38 18.44 18.33 18.65
the configurations selected by our heuristic are typically within OpenCL(+Lmem) 23.61 23.62 23.62 24.13
10 % of the best configuration.
imaging. Programs that operate on arrays are identified by
240 special keywords in RapidMind, while we use compiler-known
C++-classes to express image processing kernels. Within a
RapidMind program, neighboring elements can be accessed
execution time (ms)

220 using the shift() method on input data. Since there are no
details on code generation for border handling publicly avail-
able for RapidMind, we can compare our approach only
200 quantitatively with the one of RapidMind.
In 2009, Intel acquired RapidMind and incorporated the
RapidMind technology into Intel ArBB (Array Building
180
Blocks) [13]. Since then, RapidMind is discontinued as is Sh.
The focus of Intel’s ArBB is on vector parallel programming
and for that reason, image processing features of RapidMind
Optimum with 32 × 6: 167.94 ms like generic boundary handling support were not adopted.
160
128 256 384 512 640 768 896 1024 In ArBB, only a constant value is returned when arrays are
number of threads (blocksize) accessed out of bounds. To access neighboring elements, the
current processed element and the offset is passed to the
Figure 4: Configuration space exploration for the bilateral filter neighbor() function. Using the position() function, the position
(filter window size: 13 × 13) for an image of 4096 × 4096 on within the n-dimensional iteration space can be retrieved
the Tesla C2050. Shown are the execution times for processing and used to implement more sophisticated boundary handling
the bilateral filter in dependence on the blocksize. modes. However, this comes along with large overheads on
GPUs that we remedy by exploiting multiple levels of paral-
lelism in our code-generation backend. When merging Rapid-
C. Discussion Mind technology with Intel’s Ct, the backend for graphics
Although the presented framework focuses on programma- cards was dropped and is not supported anymore.
bility for domain experts in medical imaging, it offers decent Beside language based frameworks, there exist library based
performance on GPUs from different manufacturers. The do- frameworks like OpenCV [11] and the NVIDIA Performance
main experts can express algorithms in a high-level language Primitives (NPP) library. These libraries allow to use prede-
tailored to their domain. This allows high productivity and the fined kernels. However, to offload new algorithms that are not
mapping to different target hardware platforms from the same available, low-level code has to be written.
algorithm description. The development time for the manual Other compiler based approaches allow also to offload code
implementations is in the range of several days (for non-GPU to GPU accelerators. The input to such compilers is typically
experts even weeks), while the DSL description takes only a sequential C or basic CUDA as well as annotations describing
couple of minutes. transformations applied to the code. Examples are HMPP
For example, the source-to-source compiler generates a Workbench [14], PGI Accelerator [15], hiCUDA [16], and
CUDA kernel with 317 lines of code for the kernel description CUDA-lite [17], just to name a few. In order to obtain a decent
shown in Listing 5 (16 lines of code). This comes from performance using these compiler based approaches, the pro-
9 different kernel implementations for the top, top-right, right, grammer has to know what compiler transformations can be
etc. image borders plus index adjustments for boundary han- applied and how to rewrite code to make such transformations
dling. In addition, the generated code depends on the filter possible. Algorithm designers and domain experts, however,
window size and image size. Writing such code by hand is have only little knowledge of the underlying hardware and
often error-prone and tedious. compiler transformations. As a consequence, the full potential
of such frameworks is only rarely exploited. In the proposed
VII. R ELATED W ORK DSL, however, the required metadata is implicitly given by
The work most close to ours is the RapidMind multi- the DSL syntax and has not to be specified separately.
core development platform [10] targeting standard multi-core Our framework is most similar in spirit to Cornwall et
processors as well as accelerators like the Cell B.E. and GPUs. al.’s work on indexed metadata for visual effects [18], but
The RapidMind technology is based on Sh [12], a high-level introduces additional device-specific optimizations such as
metaprogramming language for graphics cards. RapidMind global memory padding for memory coalescing, support for
provides its own data types that can be arranged in multi- boundary handling, and the heuristic for automatic kernel
dimensional arrays. Accessors can be used to define boundary configuration selection.
handling properties of the underlying data. A two-dimensional The main contributions of this work include a) a domain-
array in RapidMind corresponds to an Image object in our specific description of local operators in medical imaging, b)
framework. In addition to the boundary handling modes sup- a new code generation framework that utilizes a two-layered
ported in RapidMind, we support also mirroring at the image parallelization approach exploiting both SPMD and MPMD
border, a widely used boundary handling mode in medical parallelism on current graphics cards architectures, and c)
a heuristic for automatic kernel configuration selection and ACKNOWLEDGMENTS
tiling. The presented approach is not limited to algorithms We thank Anton Lokhmotov and anonymous reviewers for
stemming from the medical domain, but can be also utilized their helpful comments.
for other application domains, in particular the two-layered
parallelization approach. R EFERENCES
[1] S. Ryoo, C. Rodrigues, S. Stone, J. Stratton, S. Ueng, S. Baghsorkhi, and
VIII. O UTLOOK W. Hwu, “Program Optimization Carving for GPU Computing,” Journal
of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1389–1401,
The current compiler optimizations for local operators can Oct. 2008.
be further extended to unroll the loops of convolutions and [2] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra,
“From CUDA to OpenCL: Towards a Performance-portable Solution for
to propagate the constants of the filter masks. To do so, we Multi-platform GPU Programming,” Parallel Computing, 2011.
defined a syntax using lambda-functions as seen in Listing 9. [3] L. Howes, A. Lokhmotov, A. Donaldson, and P. Kelly, “Towards
However, Clang, on which our source-to-source compiler is Metaprogramming for Parallel Systems on a Chip,” in Proceedings of
the 3rd Workshop on Highly Parallel Processing on a Chip (HPPC).
based, does not yet support lambda-functions. As soon as this Springer, Aug. 2009, pp. 36–45.
support is available, we will also be able to support constant [4] R. Membarth, A. Lokhmotov, and J. Teich, “Generating GPU Code
propagation and loop unrolling. For global operators, we look from a High-level Representation for Image Processing Kernels,” in
Proceedings of the 5th Workshop on Highly Parallel Processing on a
for a similar syntax that allows the programmer to define Chip (HPPC). Springer, Aug. 2011.
operations that merge/reduce two pixels. [5] Clang, “Clang: A C Language Family Frontend for LLVM,” http://clang.
llvm.org, 2007–2012.
1 void kernel() { [6] C. Tomasi and R. Manduchi, “Bilateral Filtering for Gray and Color
2 output() = convolve(cMask, SUM, [&] () { Images.” IEEE Computer Society, Jan. 1998, pp. 839–846.
3 return cMask()*Input(cMask); [7] D. Kunz, K. Eck, H. Fillbrandt, and T. Aach, “Nonlinear Multiresolution
4 }); Gradient Adaptive Filter for Medical Images,” in Proceedings of SPIE
5 } Medical Imaging 2003: Image Processing, vol. 5032. SPIE, Feb. 2003,
pp. 732–742.
Listing 9: Using a lambda-function and the framework- [8] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos,
provided convolve function to express convolution kernels. “Demystifying GPU Microarchitecture through Microbenchmarking,” in
Proceedings of the 2010 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS). IEEE, 2010, pp. 235–246.
Furthermore, we are looking into vectorization for graphics [9] P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer, “Au-
cards from AMD so that the impact of the transformations tomatic OpenCL Device Characterization: Guiding Optimized Kernel
Design,” in Proceedings of the 17th International European Conference
described in this paper is visible. First manual vectorization on Parallel and Distributed Computing (Euro-Par). Springer, Aug.
shows that the performance improves significantly on graphics 2011, pp. 438–452.
cards from AMD. [10] RapidMind, RapidMind Development Platform Documentation, Rapid-
Mind Inc., Jun. 2009.
[11] Willow Garage, “Open Source Computer Vision (OpenCV),” http://
IX. C ONCLUSIONS opencv.willowgarage.com/wiki, 1999–2012.
[12] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, “Shader
We presented a domain-specific description for local opera- Algebra,” ACM Transactions on Graphics (TOG), vol. 23, no. 3, pp.
tors in medical imaging and the efficient mapping to low-level 787–795, Aug. 2004.
CUDA and OpenCL code. Based on the metadata provided by [13] C. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. Du Toit,
Z. Wang, Z. Du, Y. Chen, G. Wu, P. Guo, Z. Liu, and D. Zhang,
the programmer, a two-layered parallel code utilizing SPMD “Intel’s Array Building Blocks: A Retargetable, Dynamic Compiler and
and MPMD parallelism is generated. Using this approach, Embedded Language,” in Proceedings of the 9th Annual IEEE/ACM
we showed that we can generate code for boundary han- International Symposium on Code Generation and Optimization (CGO).
IEEE, Apr. 2011, pp. 224–235.
dling that has constant performance independent from the [14] R. Dolbeau, S. Bihan, and F. Bodin, “HMPP: A Hybrid Multi-core Par-
selected boundary handling mode while the performance of allel Programming Environment,” in Proceedings of the 1st Workshop on
other solutions varies significantly. Filter masks are stored General Purpose Processing on Graphics Processing Units (GPGPU),
Oct. 2007.
to constant memory to avoid unnecessary recalculations. To [15] M. Wolfe, “Implementing the PGI Accelerator Model,” in Proceedings
determine a good configuration for the generated kernels, we of the 3rd Workshop on General-Purpose Computation on Graphics
presented a heuristic that takes boundary handling metadata, Processing Units (GPGPU). ACM, Mar. 2010, pp. 43–50.
[16] T. Han and T. Abdelrahman, “hiCUDA: High-level GPGPU Program-
the resource usage of kernels, as well as hardware capabilities ming,” IEEE Transactions on Parallel and Distributed Systems, vol. 22,
and limitations into account. The resulting kernel configuration no. 1, pp. 78–90, Jan. 2011.
and tiling minimizes the number of threads executing code for [17] S. Ueng, M. Lathara, S. Baghsorkhi, and W. Hwu, “CUDA-lite: Re-
ducing GPU Programming Complexity,” Languages and Compilers for
boundary handling. Also, the generated code by our framework Parallel Computing, vol. 5335, pp. 1–15, 2008.
is typically even faster than manual implementations and those [18] J. Cornwall, L. Howes, P. Kelly, P. Parsonage, and B. Nicoletti,
relying on hardware support for boundary handling. In an “High-Performance SIMT Code Generation in an Active Visual Effects
Library,” in Proceedings of the 6th ACM Conference on Computing
experimental analysis, we outperform even implementations Frontiers (CF). ACM, May 2009, pp. 175–184.
from RapidMind, a commercial framework for multi-core
and GPU programming, and get similar results to the GPU
backend of the widely used image processing library OpenCV.
The presented HIPAcc framework is available as open-source
under https://sourceforge.net/projects/hipacc.

View publication stats

You might also like