Generating Device-Specific GPU Code For Local Operators in Medical Imaging
Generating Device-Specific GPU Code For Local Operators in Medical Imaging
net/publication/261355860
CITATIONS READS
45 689
5 authors, including:
Wieland Eckert
Siemens Healthineers
46 PUBLICATIONS 1,822 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Richard Membarth on 08 April 2014.
Richard Membarth, Frank Hannig, and Jürgen Teich Mario Körner and Wieland Eckert
Department of Computer Science, Siemens Healthcare Sector, H IM AX,
University of Erlangen-Nuremberg, Germany. Forchheim, Germany.
{richard.membarth,hannig,teich}@cs.fau.de {mario.koerner,wieland.eckert}@siemens.com
Abstract—To cope with the complexity of programming GPU performance may significantly depend on low-level details,
accelerators for medical imaging computations, we developed a such as data layout and iteration space mapping [2], [3].
framework to describe image processing kernels in a domain- CUDA and OpenCL are low-level programming languages
specific language, which is embedded into C++. The descrip-
tion uses decoupled access/execute metadata, which allow the since the user has to take care of parallelizing the algorithm
programmer to specify both execution constraints and memory as well as the communication between the host and the GPU
access patterns of kernels. A source-to-source compiler translates accelerator. In general, low-level programming increases the
this high-level description into low-level CUDA and OpenCL code cost of software development and maintenance: whilst low-
with automatic support for boundary handling and filter masks. level languages can be robustly compiled into efficient ma-
Taking the annotated metadata and the characteristics of the
parallel GPU execution model into account, two-layered paral- chine code, they effectively lack support for creating portable
lel implementations—utilizing SPMD and MPMD parallelism— and composable software.
are generated. An abstract hardware model of graphics card High-level languages with domain-specific features are
architectures allows to model GPUs of multiple vendors like more attractive to domain experts, who do not necessarily
AMD and NVIDIA, and to generate device-specific code for wish to become target system experts. To compete with low-
multiple targets. It is shown that the generated code is faster than
manual implementations and those relying on hardware support level languages for programming accelerated systems, how-
for boundary handling. Implementations from RapidMind, a ever, domain-specific languages should have an acceptable
commercial framework for GPU programming, are outperformed performance penalty.
and similar results achieved compared to the GPU backend of We present the Heterogeneous Image Processing Accel-
the widely used image processing library OpenCV. eration (HIPAcc) framework for medical image processing
Keywords-GPU; CUDA; OpenCL; domain-specific language; in the domain of angiography that allows programmers to
code generation; medical imaging; local operators concentrate on developing algorithms and applications, rather
than on mapping them to the target hardware. Specifically, as
I. I NTRODUCTION design entry to our framework, we defined a domain-specific
language to describe image processing kernels on an abstract
Computer systems are increasingly heterogeneous, as many level. We identified three groups of operators that have to
important computational tasks, such as multimedia processing, be supported by the domain-specific language for medical
can be accelerated by special-purpose processors that out- imaging: a) point operators, b) local operators, and c) global
perform general-purpose processors by one or two orders of operators. Point operators are applied to the pixels of the image
magnitude, importantly, in terms of energy efficiency as well and solely the pixel the point operator is applied to contributes
as in terms of execution speed. to the operation (e. g., adding a constant to each pixel in
Until recently, every accelerator vendor provided their own the image). Local operators are similar to point operators,
application programming interface (API), typically based on but also neighboring pixels contribute to the operation (e. g.,
the C language. For example, NVIDIA’s API called CUDA replacing the current pixel value with the average value of
targets systems accelerated with Graphics Processing Units its neighbors). In contrast, global operators (e. g., reduction
(GPUs). In CUDA, the programmer dispatches compute- operators) produce one output for the operator applied to all
intensive data-parallel functions (kernels) to the GPU, and pixels of the image (e. g., compute the sum of all pixels). While
manages the interaction between the CPU and the GPU via the basic framework supporting point operations was presented
API calls. Ryoo et al. [1] highlight the complexity of CUDA earlier (see [4]), we focus on the support of local operators
programming, in particular, the need for exploring thoroughly and the efficient mapping to target hardware in this paper.
the space of possible implementations and configuration op- Our framework is implemented as a library of C++ classes
tions. OpenCL, a new industry-backed standard API that and will be summarized in Section II. The DSL of our frame-
inherits many traits from CUDA, aims to provide software work is extended to describe local operators (Section III). Our
portability across heterogeneous systems: correct OpenCL Clang-based compiler translates this description into efficient
programs will run on any standard-compliant implementation. low-level CUDA and OpenCL code as well as corresponding
OpenCL per se, however, does not address the problem of code to talk to the GPU accelerator (Section IV). The transfor-
performance portability; that is, OpenCL code optimized for mations applied to the target code and the run-time configura-
one accelerator device may perform dismally on another, since tion are device-specific and take an abstract architecture model
of the target graphics card hardware into account (Section V). The bilateral filter replaces each pixel by a weighted average
This results in efficient code that is evaluated against highly of geometrically nearby and photometrically similar pixel val-
optimized GPU frameworks from RapidMind and OpenCV, ues. Only pixels within the neighborhood of the relevant pixel
as well as hand-written implementations (Section VI). In are used. The neighborhood and consequently also the con-
Section VII, we compare our work against other approaches. volution size is determined by the geometric spread σd . The
We outline future extensions to our framework in Section VIII parameter σr (photometric spread) in the similarity function
for code generation optimizations. Finally, we conclude our determines the amount of combination. Thereby, difference
paper in Section IX. of pixel values below σr are considered more similar than
differences above σr .
II. I MAGE P ROCESSING F RAMEWORK
For each pixel in the output image, the filter kernel is applied
In this paper, we use and extend our previous framework for to a neighborhood of pixels in the input image. When executed
describing image processing kernels [4]. The proposed frame- on a GPU, the required operations to produce one pixel in
work uses a source-to-source compiler based on Clang [5] in the output image are specified in a program. Such a program
order to generate low-level, optimized CUDA and OpenCL is also called kernel, executed on the GPU in parallel, and
code for execution on GPU accelerators. The framework applied to each pixel of the image independently. The pseudo
consists of built-in C++ classes that describe the following code for the bilateral filter executed on GPUs is shown in
four basic components required to express image processing Algorithm 1. Here, the kernel itself is described in lines 3–13,
on an abstract level: defining the bilateral filter for one pixel, taking the pixels in its
• Image: Describes data storage for the image pixels. Each neighborhood into account. This code is executed sequentially
pixel can be stored as an integer number, a floating point as can be seen in the doubly nested innermost loops. This is
number, or in another format such as RGB, depending the code that is typically written by a programmer in either
on instantiation of this templated class. The data layout CUDA or OpenCL. The parallel execution on the graphics
is handled internally using multi-dimensional arrays. hardware is expressed by the parallel foreach loops in line
• Iteration Space: Describes a rectangular region of interest 1 and 2. While the first parallel for loop describes the data-
in the output image, for example the complete image. parallel execution within groups of processors on the GPU
Each pixel in this region is a point in the iteration space. (think of it as a Single Instruction Multiple Data (SIMD) unit),
• Kernel: Describes an algorithm to be applied to each pixel the second parallel for loop depicts the parallel execution on
in the Iteration Space. all available groups of processors. Note that this is a two-
• Accessor: Describes which pixels of an Image are seen layered approach exploiting two different types of parallelism:
within the Kernel. Similar to an Iteration Space, the the Single Program Multiple Data (SPMD) parallel execution
Accessor defines an Iteration Space on an input image. model within groups of processors, and the Multiple Program
These components are an instance of decoupled access/ex- Multiple Data (MPMD) parallel execution between multiple
ecute metadata [3]: the Iteration Space specification provides groups of processors. Although the graphics hardware supports
ordering and partitioning constraints (execute metadata); the the execution of different programs on different SIMD units,
Kernel specification provides a pattern of accesses to uniform programs written in CUDA and OpenCL contain the same
memory (access metadata). Currently, the access/execute meta- code for all SIMD units.
data is mostly implicit: we assume that the iteration space is
independent in all dimensions and has a 1:1 mapping to work- Algorithm 1: Bilateral filter implementation on the graph-
items (threads), and that the memory access pattern is obvious ics card.
from the kernel code. 1 foreach thread block b in grid g do in parallel
A. Example: Bilateral Filter 2 foreach thread t in thread block b do in parallel
3 x ← get_index_x(b, t);
In the following, we illustrate our image processing frame- 4 y ← get_index_y(b, t);
work using a bilateral filter that smoothes an image while 5 p, k ← 0;
preserving edges within the image [6]. The bilateral filter 6 for y f = −2 ∗ sigma_d to +2 ∗ sigma_d do
7 for x f = −2 ∗ sigma_d to +2 ∗ sigma_d do
applies a local operator to each pixel that consists of two 8 c ← closeness((x, y), (x + x f , y + y f ));
components to determine the weighting of pixels: a) the 9 s ← similarity(input[x, y],
closeness function that considers the distance to the center input[x + x f , y + y f ]);
pixel, and b) the similarity function that takes the difference 10 k ← k + c ∗ s;
between pixel values into account. Gaussian functions of the 11 p ← p + c ∗ s ∗ input[x + x f , y + y f ];
12 end
Euclidean distance are used for the closeness and similarity 13 end
weights as seen in Equation (1) and (2), respectively. 14 output[x, y] ← p/k;
15 end
k(x,y)−(x0 ,y0 )k 2 16 end
− 12 ( )
closeness((x, y), (x0 , y0 )) = e σd
(1)
0
1 kp−p k )2
similarity(p, p0 ) = e− 2 ( σr (2) To express this filter in our framework, the programmer
derives a class from the built-in Kernel class and implements specific optimization such as global memory padding for mem-
the virtual kernel function, as shown in Listing 1. To access ory coalescing and optimal memory bandwidth utilization.
the pixels of an input image, the parenthesis operator () is In the remaining of this paper, we describe the extension
used, taking the column (dx) and row (dy) offsets as optional of the framework of how to utilize access/execute metadata
parameters. The output image as specified by the Iteration for efficient code generation for local operators in medical
Space is accessed using the output() method provided imaging and the optimal mapping of this code to different
by the built-in Kernel class. The user instantiates the class GPU accelerator architectures from AMD and NVIDIA.
with input image accessors, one iteration space, and other 1 const int width = 1024, height = 1024, sigma_d =
parameters that are member variables of the class. 3, sigma_r = 5;
2
1 class BilateralFilter : public Kernel<float> {
3 // pointers to raw image data
2 private:
4 float *host_in = ...;
3 Accessor<float> &Input;
5 float *host_out = ...;
4 int sigma_d, sigma_r;
6
5
7 // input and output images
6 public:
8 Image<float> IN(width, height);
7 BilateralFilter(IterationSpace<float> &IS,
9 Image<float> OUT(width, height);
Accessor<float> &Input, int sigma_d, int
10
sigma_r) :
11 // initialize input image
8 Kernel(IS), Input(Input), sigma_d(sigma_d)
12 IN = host_in; // operator=
, sigma_r(sigma_r)
13
9 { addAccessor(&Input); }
14 // define region of interest
10
15 IterationSpace<float> IsOut(OUT);
11 void kernel() {
16
12 float c_r = 1.0f/(2.0f*sigma_r*sigma_r);
17 // accessor used to access image pixels
13 float c_d = 1.0f/(2.0f*sigma_d*sigma_d);
18 Accessor<float> AccIn(IN);
14 float d = 0.0f, p = 0.0f, s = 0.0f;
19
15
20 // define kernel
16 for (int yf = -2*sigma_d; yf<=2*sigma_d; yf++)
21 BilateralFilter BF(IS, AccIn, sigma_d, sigma_r);
{
22
17 for (int xf = -2*sigma_d; xf<=2*sigma_d; xf
23 // execute kernel
++) {
24 BF.execute();
18 float diff = Input(xf, yf) - Input();
25
19
26 // retrieve output image
20 s = exp(-c_r * diff*diff);
27 host_out = OUT.getData();
21 c = exp(-c_d * xf*xf) * exp(-c_d * yf*yf);
22
23
d
p
+= s*c;
+= s*c * Input(xf, yf);
Listing 2: Example code that initializes and executes the
24 } bilateral filter.
25 }
26 output() = p/d;
27 }
28 }; III. L OCAL O PERATORS
Listing 1: Kernel description of the bilateral filter. In image processing and in particular in medical imaging,
local operators are widely used. These operators can be applied
In Listing 2, the input and output Image objects IN and independently to all pixels of the image and depend only
OUT are defined as two-dimensional W × H grayscale images, on the pixel itself and the pixels in its neighborhood. The
having pixels represented as floating-point numbers (lines 8– neighborhood read by local operators in medical imaging is
9). The Image object IN is initialized with the host_in typically symmetrically arranged around the pixel to be calcu-
pointer to a plain C array with raw image data, which invokes lated with the same diameter in each direction. Furthermore,
the = operator of the Image class (line 12). The region of local operators describe the convolution of an image by a filter
interest IsOut contains the whole image (line 15). Also the mask (e. g., Gaussian function) with the following properties:
Accessor AccIn on the input image is defined on the whole a) the filter mask is centered at the pixel it is applied to [0, 0]
image (line 18). The kernel is initialized with the iteration and b) bounded to the neighborhood [−m, +m] × [−n, +n].
space object, accessor objects and parameters sigma_d and The latter implies a window size (2m + 1) × (2n + 1) of local
sigma_r for the bilateral filter (line 21), and executed by a call operators to be uneven (e. g., 3 × 3, 5 × 5, 9 × 3). Since the
to the execute() method (line 24). To retrieve the output image, filter mask used for convolution is typically constant, the
the host_out pointer to a plain C data array is assigned the values for the filter mask can be stored to a lookup table.
Image object OUT, which invokes the getData() operator Figure 1 shows the filter masks used for a window size of
(line 27). 3 × 3: the pixels in the neighborhood (f) are convolved with
From this code, the source-to-source compiler can finally the closeness (c) and similarity (s) filter masks. While the filter
create CUDA and OpenCL code for execution on GPU ac- mask for the closeness component is constant and depends
celerators automatically. In [4], we have presented an initial only on the distance to the center pixel, the filter mask for the
version of our framework, that is, how to generate efficient similarity component depends on the pixel values and has to
code for point operators by support for effective device- be calculated for each pixel separately.
Table I: Supported boundary handling modes.
? ? ? I J K L ? ? ? 4 // filter mask
5 Mask<float> CMask(4*sigma_d+1, 4*sigma_d+1);
? ? ? M N O P ? ? ?
6 CMask = mask;
? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ?
Listing 4: Usage of a Mask to store the filter mask for a local
? ? ? ? ? ? ? ? ? ?
operator.
(a) Undefined. Listing 5 shows the usage of the Mask within the kernel of
the BilateralFilter: using the indices xf and yf, the precalcu-
F G H E F G H E F G A A A A B C D D D D
lated values are retrieved from the closeness filter mask. Also
J K L I J K L I J K A A A A B C D D D D
note that the calculation of c_d is not necessary anymore.
N O P M N O P M N O A A A A B C D D D D
B C D A B C D A B C A A A A B C D D D D 1 void kernel() {
2 float c_r = 1.0f/(2.0f*sigma_r*sigma_r);
F G H E F G H E F G E E E E F G H H H H
3 float d = 0.0f, p = 0.0f, s = 0.0f;
J K L I J K L I J K I I I I J K L L L L 4
N O P M N O P M N O M M M M N O P P P P
5 for (int yf = -2*sigma_d; yf<=2*sigma_d; yf++) {
6 for (int xf = -2*sigma_d; xf<=2*sigma_d; xf++)
B C D A B C D A B C M M M M N O P P P P
{
F G H E F G H E F G M M M M N O P P P P 7 float diff = Input(xf, yf) - Input();
8
J K L I J K L I J K M M M M N O P P P P
9 s = exp(-c_r * diff*diff);
(b) Repeat. (c) Clamp. 10 c = CMask(xf, yf);
11 d += s*c;
K G C I J K L B F J Q Q Q Q Q Q Q Q Q Q
12 p += s*c * Input(xf, yf);
13 }
J F B E F G H C G K Q Q Q Q Q Q Q Q Q Q
14 }
I E A A B C D D H L Q Q Q Q Q Q Q Q Q Q 15 output() = p/d;
C B A A B C D D C B Q Q Q A B C D Q Q Q
16 }
G F E E F G H H G F Q Q Q E F G H Q Q Q
Listing 5: Using a Mask within the kernel to retrieve the filter
K J I I J K L L K J Q Q Q I J K L Q Q Q mask coefficients for a local operator.
O N M M N O P P O N Q Q Q M N O P Q Q Q
E I M M N O P P L H Q Q Q Q Q Q Q Q Q Q
IV. S OURCE - TO -S OURCE C OMPILATION
F J N I J K L O K G Q Q Q Q Q Q Q Q Q Q
G K O E F G H N J F Q Q Q Q Q Q Q Q Q Q
This section describes the transformations applied by our
source-to-source compiler and the steps taken to create CUDA
(d) Mirror. (e) Constant.
and OpenCL code from a high-level description of local
Figure 2: Boundary handling modes for image processing. operators and access metadata. Based on the latest Clang
By default, the behavior is undefined when the image is compiler framework [5], our source-to-source compiler uses
accessed out of bounds (a). The framework allows to specify the Clang frontend for C/C++ to parse the input files and
different boundary handling modes like repeating the image to generate an abstract syntax tree (AST) representation of
(b), clamping to the last valid pixel (c), mirroring the image the source code. Our backend uses this AST representation to
at the image border (d), and returning a constant value when apply transformations and to generate host and device code in
accessed out of bounds (e). CUDA or OpenCL.
A. Local Operators
most appropriate one for the current algorithm—without the Since local operators and convolution functions read typ-
need to keep separate copies of the image. ically neighboring pixels to calculate the value of the result
pixel, most of the neighboring pixels read for pixel pi, j are
also required for pixel pi+1, j . This results in redundant fetches
B. Filter Masks
of the same pixels and imposes high pressure on the global
The framework provides also support for filter masks for memory bandwidth. To preserve global memory bandwidth,
local operators: a Mask holds the precalculated values used by the region read by a group of threads can be a) staged into
the convolution filter function. Since the filter mask is constant fast on-chip memory (scratchpad memory) and read from there
for one kernel, this allows the source-to-source compiler to afterwards, or b) accessed by a memory path that traverses a
apply optimizations such as constant propagation to avoid cache. In the latter case, only the first access to a pixel has
redundant calculations. To define a Mask in the framework, the long latency of global memory. Subsequent accesses are
the filter mask size has to be provided. Afterwards, the served by the cache. In the former case, the data is staged
into scratchpad memory and memory accesses of the kernel are mapped, the low-level code can be emitted. When this
go to the fast scratchpad memory. However, synchronization is done, also the corresponding CUDA texture reference and
is required before the calculation can begin. This separates OpenCL sampler definitions are created. The OpenCL kernel-
data transfer and calculation phases within the kernel. The function parameters for images are emitted with the corre-
benefit of massive multithreading provided by the underlying sponding read_only and write_only attributes obtained from
hardware is to hide memory transfers when data transfers the read/write analysis. Texture references are not added as
and calculations are done at the same time. This benefit kernel-function parameters in CUDA since they are static and
is lost when data is staged to scratchpad memory. Hence, globally visible in CUDA.
staging to scratchpad memory makes only sense in case Scratchpad memory: Current graphics cards provide fast
the benefit of data reuse exceeds the multithreading benefit. on-chip scratchpad memory, also called shared memory in
For local operators with small window sizes, this is rarely CUDA and local memory in OpenCL, that is shared between
the case. Nonetheless, our source-to-source compiler supports all threads of a SIMD unit. Adding the __shared__ (CUDA)
both options. and __local (OpenCL) keywords to memory declarations
Texturing memory: All graphics cards from AMD and within a kernel allows to use this memory. Since the memory is
NVIDIA support cached data access, either using texturing shared between all threads mapped to one SIMD unit (number
hardware or by default (on newer Fermi GPUs from NVIDIA). of threads » SIMD width), synchronization between threads
In CUDA, texturing hardware can be utilized reading from a mapped to a SIMD unit is required so that all threads have a
texture reference that is bound to global memory. In OpenCL, consistent view of the scratchpad memory. This is done by the
the texturing hardware is used when data is read from an image __synchthreads() (CUDA) and barrier() (OpenCL) function.
object. Therefore, accesses to Image(x, y) objects have to be Only after all threads have reached this synchronization point,
mapped to the corresponding tex1Dfetch() and read_imagef() execution continues. Using scratchpad memory includes two
functions in CUDA and OpenCL, respectively. However, this phases: first, the data is staged from the GPU memory into
is only valid if data is read. When data is written to an scratchpad memory and second, data accesses are redirected
Image object, normal global memory array pointers are used to the scratchpad. In Listing 7, the size of the scratchpad
in CUDA and the write_imagef() function in OpenCL. That memory depends on BSY/BSX, the size of the 2D image
is, prior to the mapping of Image accesses to the low-level subregion mapped to the SIMD unit as well as on SY/SX, the
equivalents, we perform a read/write analysis of the kernel image region accessed beyond the subregion within the local
method. Therefore, a control-flow graph (CFG) of the instruc- operator. A constant of 1 is added to BSX so that different
tions in the kernel method is created and traversed afterwards. banks of the scratchpad memory are accessed for row-based
Access information is stored for each Image and Accessor filters to avoid bank conflicts. The data from global memory
object and used to select the appropriate texturing function is staged into scratchpad memory in multiple steps, depending
call. This results in a mapping of read/write-accesses as shown on the size of additional pixels required by the kernel. When
in Listing 6. The mapping is done using a recursive AST- data is read from the scratchpad memory, the thread identifiers
visitor. Whenever an Image or Accessor node is visited, the of the threads mapped to one SIMD unit threadIdx and
above described transformations are applied. get_local_id() are used for the CUDA and OpenCL code,
1 // Read-access to Accessor respectively.
2 IN(xf, yf) 1 // Phase 1: stage data to scratchpad memory
3 // CUDA read with offset 2 // CUDA
4 tex1Dfetch(_texIN, gid_x+xf + (gid_y+yf)*stride) 3 __shared__ float _smemIN[SY + BSY][SX + BSX + 1];
5 // OpenCL read with offset 4 _smemIN[threadIdx.y][threadIdx.x] = IN[..];
6 read_imagef(imgIN, Sampler, (int2)(gid_x+xf, gid_y 5 if (...) _smemIN[threadIdx.y + ..][threadIdx.x +
+yf)).x ..] = IN[..];
7
6 ..
8 // Write-access to output() method 7 __syncthreads();
9 output() 8 // OpenCL
10 // CUDA write without offset 9 __local float _smemIN[SY + BSY][SX + BSX + 1];
11 OUT[gid_x + gid_y*stride] 10 _smemIN[get_local_id(1)][get_local_id(0)] = IN[..]
12 // OpenCL write without offset ;
13 write_imagef(imgOUT, (int2)(gid_x, gid_y), (float4 11 if (...) _smemIN[get_local_id(1) + ..][
) val); get_local_id(0) + ..] = IN[..];
..
Listing 6: Using texturing hardware to read from an Accessor. 12
13 barrier(CLK_LOCAL_MEM_FENCE);
14
The image object access functions in OpenCL take and return 15 // Phase 2: map accesses to scratchpad memory
always vector elements with size of four, although only one 16 IN(xf, yf)
of the four components is required for the example above. 17 // CUDA read with offset
18 _smemIN[threadIdx.y + yf][threadIdx.x + xf]
Therefore, the CL_R channel order is used, which maps 19 // OpenCL read with offset
only one of the four components to memory and populates 20 _smemIN[get_local_id(1)+ yf, get_local_id(0) + xf]
the remaining three channels with zeros. The corresponding
Listing 7: Staging pixels to scratchpad memory before
extraction from and packing to vector elements is added by
accessing image pixels.
the framework as well. Once the accesses to Image objects
A A A A B C D D D D
B. Boundary Handling
A A A A B C D D D D
Boundary handling for image processing kernels in CU-
A A TL ABH A BT BHC D TRDBH D D
DA/OpenCL is typically done with the help of texturing
hardware. When a texture reference/sampler is defined, an A A A A B C D D D D
handling code has to be added by the programmer. That is, M M BLMBH M NB BHO P BRP BH P P
address pixels within the image. However, this goes along with M M M M N O P P P P
220 using the shift() method on input data. Since there are no
details on code generation for border handling publicly avail-
able for RapidMind, we can compare our approach only
200 quantitatively with the one of RapidMind.
In 2009, Intel acquired RapidMind and incorporated the
RapidMind technology into Intel ArBB (Array Building
180
Blocks) [13]. Since then, RapidMind is discontinued as is Sh.
The focus of Intel’s ArBB is on vector parallel programming
and for that reason, image processing features of RapidMind
Optimum with 32 × 6: 167.94 ms like generic boundary handling support were not adopted.
160
128 256 384 512 640 768 896 1024 In ArBB, only a constant value is returned when arrays are
number of threads (blocksize) accessed out of bounds. To access neighboring elements, the
current processed element and the offset is passed to the
Figure 4: Configuration space exploration for the bilateral filter neighbor() function. Using the position() function, the position
(filter window size: 13 × 13) for an image of 4096 × 4096 on within the n-dimensional iteration space can be retrieved
the Tesla C2050. Shown are the execution times for processing and used to implement more sophisticated boundary handling
the bilateral filter in dependence on the blocksize. modes. However, this comes along with large overheads on
GPUs that we remedy by exploiting multiple levels of paral-
lelism in our code-generation backend. When merging Rapid-
C. Discussion Mind technology with Intel’s Ct, the backend for graphics
Although the presented framework focuses on programma- cards was dropped and is not supported anymore.
bility for domain experts in medical imaging, it offers decent Beside language based frameworks, there exist library based
performance on GPUs from different manufacturers. The do- frameworks like OpenCV [11] and the NVIDIA Performance
main experts can express algorithms in a high-level language Primitives (NPP) library. These libraries allow to use prede-
tailored to their domain. This allows high productivity and the fined kernels. However, to offload new algorithms that are not
mapping to different target hardware platforms from the same available, low-level code has to be written.
algorithm description. The development time for the manual Other compiler based approaches allow also to offload code
implementations is in the range of several days (for non-GPU to GPU accelerators. The input to such compilers is typically
experts even weeks), while the DSL description takes only a sequential C or basic CUDA as well as annotations describing
couple of minutes. transformations applied to the code. Examples are HMPP
For example, the source-to-source compiler generates a Workbench [14], PGI Accelerator [15], hiCUDA [16], and
CUDA kernel with 317 lines of code for the kernel description CUDA-lite [17], just to name a few. In order to obtain a decent
shown in Listing 5 (16 lines of code). This comes from performance using these compiler based approaches, the pro-
9 different kernel implementations for the top, top-right, right, grammer has to know what compiler transformations can be
etc. image borders plus index adjustments for boundary han- applied and how to rewrite code to make such transformations
dling. In addition, the generated code depends on the filter possible. Algorithm designers and domain experts, however,
window size and image size. Writing such code by hand is have only little knowledge of the underlying hardware and
often error-prone and tedious. compiler transformations. As a consequence, the full potential
of such frameworks is only rarely exploited. In the proposed
VII. R ELATED W ORK DSL, however, the required metadata is implicitly given by
The work most close to ours is the RapidMind multi- the DSL syntax and has not to be specified separately.
core development platform [10] targeting standard multi-core Our framework is most similar in spirit to Cornwall et
processors as well as accelerators like the Cell B.E. and GPUs. al.’s work on indexed metadata for visual effects [18], but
The RapidMind technology is based on Sh [12], a high-level introduces additional device-specific optimizations such as
metaprogramming language for graphics cards. RapidMind global memory padding for memory coalescing, support for
provides its own data types that can be arranged in multi- boundary handling, and the heuristic for automatic kernel
dimensional arrays. Accessors can be used to define boundary configuration selection.
handling properties of the underlying data. A two-dimensional The main contributions of this work include a) a domain-
array in RapidMind corresponds to an Image object in our specific description of local operators in medical imaging, b)
framework. In addition to the boundary handling modes sup- a new code generation framework that utilizes a two-layered
ported in RapidMind, we support also mirroring at the image parallelization approach exploiting both SPMD and MPMD
border, a widely used boundary handling mode in medical parallelism on current graphics cards architectures, and c)
a heuristic for automatic kernel configuration selection and ACKNOWLEDGMENTS
tiling. The presented approach is not limited to algorithms We thank Anton Lokhmotov and anonymous reviewers for
stemming from the medical domain, but can be also utilized their helpful comments.
for other application domains, in particular the two-layered
parallelization approach. R EFERENCES
[1] S. Ryoo, C. Rodrigues, S. Stone, J. Stratton, S. Ueng, S. Baghsorkhi, and
VIII. O UTLOOK W. Hwu, “Program Optimization Carving for GPU Computing,” Journal
of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1389–1401,
The current compiler optimizations for local operators can Oct. 2008.
be further extended to unroll the loops of convolutions and [2] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra,
“From CUDA to OpenCL: Towards a Performance-portable Solution for
to propagate the constants of the filter masks. To do so, we Multi-platform GPU Programming,” Parallel Computing, 2011.
defined a syntax using lambda-functions as seen in Listing 9. [3] L. Howes, A. Lokhmotov, A. Donaldson, and P. Kelly, “Towards
However, Clang, on which our source-to-source compiler is Metaprogramming for Parallel Systems on a Chip,” in Proceedings of
the 3rd Workshop on Highly Parallel Processing on a Chip (HPPC).
based, does not yet support lambda-functions. As soon as this Springer, Aug. 2009, pp. 36–45.
support is available, we will also be able to support constant [4] R. Membarth, A. Lokhmotov, and J. Teich, “Generating GPU Code
propagation and loop unrolling. For global operators, we look from a High-level Representation for Image Processing Kernels,” in
Proceedings of the 5th Workshop on Highly Parallel Processing on a
for a similar syntax that allows the programmer to define Chip (HPPC). Springer, Aug. 2011.
operations that merge/reduce two pixels. [5] Clang, “Clang: A C Language Family Frontend for LLVM,” http://clang.
llvm.org, 2007–2012.
1 void kernel() { [6] C. Tomasi and R. Manduchi, “Bilateral Filtering for Gray and Color
2 output() = convolve(cMask, SUM, [&] () { Images.” IEEE Computer Society, Jan. 1998, pp. 839–846.
3 return cMask()*Input(cMask); [7] D. Kunz, K. Eck, H. Fillbrandt, and T. Aach, “Nonlinear Multiresolution
4 }); Gradient Adaptive Filter for Medical Images,” in Proceedings of SPIE
5 } Medical Imaging 2003: Image Processing, vol. 5032. SPIE, Feb. 2003,
pp. 732–742.
Listing 9: Using a lambda-function and the framework- [8] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos,
provided convolve function to express convolution kernels. “Demystifying GPU Microarchitecture through Microbenchmarking,” in
Proceedings of the 2010 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS). IEEE, 2010, pp. 235–246.
Furthermore, we are looking into vectorization for graphics [9] P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer, “Au-
cards from AMD so that the impact of the transformations tomatic OpenCL Device Characterization: Guiding Optimized Kernel
Design,” in Proceedings of the 17th International European Conference
described in this paper is visible. First manual vectorization on Parallel and Distributed Computing (Euro-Par). Springer, Aug.
shows that the performance improves significantly on graphics 2011, pp. 438–452.
cards from AMD. [10] RapidMind, RapidMind Development Platform Documentation, Rapid-
Mind Inc., Jun. 2009.
[11] Willow Garage, “Open Source Computer Vision (OpenCV),” http://
IX. C ONCLUSIONS opencv.willowgarage.com/wiki, 1999–2012.
[12] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, “Shader
We presented a domain-specific description for local opera- Algebra,” ACM Transactions on Graphics (TOG), vol. 23, no. 3, pp.
tors in medical imaging and the efficient mapping to low-level 787–795, Aug. 2004.
CUDA and OpenCL code. Based on the metadata provided by [13] C. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. Du Toit,
Z. Wang, Z. Du, Y. Chen, G. Wu, P. Guo, Z. Liu, and D. Zhang,
the programmer, a two-layered parallel code utilizing SPMD “Intel’s Array Building Blocks: A Retargetable, Dynamic Compiler and
and MPMD parallelism is generated. Using this approach, Embedded Language,” in Proceedings of the 9th Annual IEEE/ACM
we showed that we can generate code for boundary han- International Symposium on Code Generation and Optimization (CGO).
IEEE, Apr. 2011, pp. 224–235.
dling that has constant performance independent from the [14] R. Dolbeau, S. Bihan, and F. Bodin, “HMPP: A Hybrid Multi-core Par-
selected boundary handling mode while the performance of allel Programming Environment,” in Proceedings of the 1st Workshop on
other solutions varies significantly. Filter masks are stored General Purpose Processing on Graphics Processing Units (GPGPU),
Oct. 2007.
to constant memory to avoid unnecessary recalculations. To [15] M. Wolfe, “Implementing the PGI Accelerator Model,” in Proceedings
determine a good configuration for the generated kernels, we of the 3rd Workshop on General-Purpose Computation on Graphics
presented a heuristic that takes boundary handling metadata, Processing Units (GPGPU). ACM, Mar. 2010, pp. 43–50.
[16] T. Han and T. Abdelrahman, “hiCUDA: High-level GPGPU Program-
the resource usage of kernels, as well as hardware capabilities ming,” IEEE Transactions on Parallel and Distributed Systems, vol. 22,
and limitations into account. The resulting kernel configuration no. 1, pp. 78–90, Jan. 2011.
and tiling minimizes the number of threads executing code for [17] S. Ueng, M. Lathara, S. Baghsorkhi, and W. Hwu, “CUDA-lite: Re-
ducing GPU Programming Complexity,” Languages and Compilers for
boundary handling. Also, the generated code by our framework Parallel Computing, vol. 5335, pp. 1–15, 2008.
is typically even faster than manual implementations and those [18] J. Cornwall, L. Howes, P. Kelly, P. Parsonage, and B. Nicoletti,
relying on hardware support for boundary handling. In an “High-Performance SIMT Code Generation in an Active Visual Effects
Library,” in Proceedings of the 6th ACM Conference on Computing
experimental analysis, we outperform even implementations Frontiers (CF). ACM, May 2009, pp. 175–184.
from RapidMind, a commercial framework for multi-core
and GPU programming, and get similar results to the GPU
backend of the widely used image processing library OpenCV.
The presented HIPAcc framework is available as open-source
under https://sourceforge.net/projects/hipacc.