0% found this document useful (0 votes)
13 views12 pages

Accelerating B Spline Interpolation On GPUs AP 2020 Computer Methods and PR

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Accelerating B Spline Interpolation On GPUs AP 2020 Computer Methods and PR

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Computer Methods and Programs in Biomedicine 193 (2020) 105431

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine


journal homepage: www.elsevier.com/locate/cmpb

Accelerating B-spline interpolation on GPUs: Application to medical


image registration
Orestis Zachariadis a,∗, Andrea Teatini b,c, Nitin Satpute a, Juan Gómez-Luna d, Onur Mutlu d,
Ole Jakob Elle b,c, Joaquín Olivares a
a
Department of Electronics and Computer Engineering, Universidad de Cordoba, Córdoba, Spain
b
The Intervention Centre, Oslo University Hospital - Rikshospitalet, Oslo, Norway
c
Department of Informatics, University of Oslo, Oslo, Norway
d
Department of Computer Science, ETH Zurich, Zurich, Switzerland

a r t i c l e i n f o a b s t r a c t

Article history: Background and Objective: B-spline interpolation (BSI) is a popular technique in the context of medical
Received 7 November 2019 imaging due to its adaptability and robustness in 3D object modeling. A field that utilizes BSI is Image
Revised 14 February 2020
Guided Surgery (IGS). IGS provides navigation using medical images, which can be segmented and recon-
Accepted 2 March 2020
structed into 3D models, often through BSI. Image registration tasks also use BSI to transform medical
imaging data collected before the surgery and intra-operative data collected during the surgery into a
Keywords: common coordinate space. However, such IGS tasks are computationally demanding, especially when ap-
Medical image registration plied to 3D medical images, due to the complexity and amount of data involved. Therefore, optimization
Medical image processing of IGS algorithms is greatly desirable, for example, to perform image registration tasks intra-operatively
Parallel computing
and to enable real-time applications. A traditional CPU does not have sufficient computing power to
GPU
achieve these goals and, thus, it is preferable to rely on GPUs. In this paper, we introduce a novel GPU im-
B-splines
plementation of BSI to accelerate the calculation of the deformation field in non-rigid image registration
algorithms.
Methods: Our BSI implementation on GPUs minimizes the data that needs to be moved between mem-
ory and processing cores during loading of the input grid, and leverages the large on-chip GPU register
file for reuse of input values. Moreover, we re-formulate our method as trilinear interpolations to reduce
computational complexity and increase accuracy. To provide pre-clinical validation of our method and
demonstrate its benefits in medical applications, we integrate our improved BSI into a registration work-
flow for compensation of liver deformation (caused by pneumoperitoneum, i.e., inflation of the abdomen)
and evaluate its performance.
Results: Our approach improves the performance of BSI by an average of 6.5× and interpolation accu-
racy by 2× compared to three state-of-the-art GPU implementations. Through pre-clinical validation, we
demonstrate that our optimized interpolation accelerates a non-rigid image registration algorithm, which
is based on the Free Form Deformation (FFD) method, by up to 34%.
Conclusion: Our study shows that we can achieve significant performance and accuracy gains with our
novel parallelization scheme that makes effective use of the GPU resources. We show that our method
improves the performance of real medical imaging registration applications used in practice today.
© 2020 Published by Elsevier B.V.

1. Introduction alization [1]. IGS is created by combining medical images, such as


Computed Tomography (CT) or Magnetic Resonance Imaging (MRI)
Image Guided Surgery (IGS) aims to provide surgeons with nav- [2], with surgical instrument tracking technologies [3]. However,
igation capabilities to perform safer surgeries through better visu- the accuracy of image guided surgery is often undermined by or-
gan deformations, especially in soft tissue surgeries. These defor-
mations are difficult to account for due to their non-linear be-

Corresponding author. haviour. Non-rigid registration is a technique that has been devel-
E-mail addresses: [email protected] (O. Zachariadis),
oped to reproduce and model such non-linear deformations [4].
[email protected] (A. Teatini).

https://doi.org/10.1016/j.cmpb.2020.105431
0169-2607/© 2020 Published by Elsevier B.V.
2 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

Non-rigid registration through Free Form Deformation (FFD) [5], 2. Background


based on cubic B-spline interpolation (BSI) [5,6], is a state-of-the-
art technique for non-rigid registration. FFD works by manipulat- In this section, we first introduce the foundations of B-spline in-
ing a grid of control points. The shape of a 3D object (e.g., an terpolation. Since our GPU implementation of BSI is specific to 3D
organ) underlying the control points can be changed by using a medical images (CT, MRI, or US volumes), formulations and anal-
smooth and C2 continuous transform (i.e., continuous up to second ysis focus on the 3D case. Second, we review two state-of-the-art
order derivatives). FFD uses BSI in the calculation of the deforma- implementations of BSI on GPUs.
tion field.
BSI is one of the most computationally demanding parts of FFD 2.1. B-spline interpolation theory
[7]. Graphics Processing Units (GPUs) can help achieve the real-
time requirements of IGS, namely FFD, as they offer massive com- We introduce B-spline interpolation for 3D images, i.e., the do-
putational performance in comparison to Central Processing Units main of the image volume is in the x, y, z coordinate space. As
(CPUs). GPUs deploy thousands of execution threads, which oper- Eq. (1) shows [5,6], the BSI transformation of FFD for each voxel
ate on large batches of data. GPUs provide higher throughput and (i.e., each interpolated point of FFD) with coordinates x, y, z is T(x,
power-efficiency than CPUs on multithreaded workloads [8]. The y, z). The BSI transformation is a function of control points φ i,j,k ,
performance of medical imaging applications benefits significantly which are arranged into a grid of dimensions nx × ny × nz . The
from GPUs [9–15,45]. control point grid is uniformly spaced, with δ x , δ y , and δ z being
For these reasons, several authors have used GPUs for BSI [6,16– the spacing (in voxels) in the three dimensions.
19]. Sigg et al. [16] and Ruijters et al. [17] achieve a substantial

3 
3 
3
reduction in the number of input samples by representing the T (x, y, z ) = Bl (u )Bm (v )Bn (w )φi+l, j+m,k+n (1)
weighted sums as trilinear interpolations. More recently, Elling- l=0 m=0 n=0
wood et al. [6] and Du et al. [18] use GPU implementations of BSI
where
to improve the performance of image registration. They improve
input sample loading by aligning the control grid with the voxel i = x/δx  − 1,
grid of the volume [6,18,19]. However, all these works suffer from
j = y/δy  − 1,
the intensive data movement of a large number of input samples
between the memory and the GPU, which is the main performance k = z/δz  − 1,
bottleneck of BSI implementations on a GPU [16]. u = x/δx − x/δx ,
Our goal in this work is to accelerate BSI on GPUs by allevi- v = y/δy − y/δy ,
ating the data movement bottleneck with optimization techniques
w = z/δz − z/δz ,
that enable a more efficient use of the on-chip memory resources.
To this end, we propose a GPU implementation of BSI with three B are the scalar B-spline coefficients [17] and φ are the control
key optimizations: a) a new workload partitioning scheme for GPU points. Each voxel is affected by four control points in each di-
execution threads that reduces the number of memory accesses, b) mension. Thus, in a 3D space, 4 × 4 × 4 control points, forming
a register-tiling approach that keeps input data close to the execu- a cube (see Fig. 1), affect the inner tile of voxels. In general, in N-
tion units, and c) the replacement of weighted summation with dimensional images, 4N control points affect each voxel.
linear interpolations, which reduces the computational load and
increases the accuracy. 2.1.1. Tiles
In order to show how our approach affects the performance Tiles are logical groups of voxels that share common properties.
and accuracy of image registration in a realistic scenario, we in- Based on Eq. (1), we define tiles of δ x × δ y × δ z dimensions. Fig. 2
tegrate our technique (publicly available1 ) to the FFD registration illustrates a tile in a 2D example. We make two observations: 1)
of NiftyReg [7]. NiftyReg is a lightweight medical image registra- the same control points, i.e., the ones surrounding the tile, affect
tion library. Recent works [20,21] use NiftyReg as a reference for all voxels inside the tile, and 2) control points of neighboring tiles
registration. overlap.
We complete our study with a pre-clinical evaluation of our From the implementation perspective, partitioning a volume
method. We use FFD with our GPU-accelerated BSI on 1) CT into tiles is a way of exploiting data reuse (i.e., reuse of control
scans of patient-specific liver phantom [47], and 2) MRI scans of points) in on-chip memories, when calculating the interpolated
a porcine liver model to compensate for a non-rigid soft tissue voxels. Thus, tiling saves memory traffic between off-chip and on-
deformation caused by pneumoperitoneum. Pneumoperitoneum is chip memories.
a surgical procedure to inflate the patient’s abdomen, which is
necessary for any abdominal laparoscopic surgery. Pneumoperi- 2.2. State-of-the-art GPU implementations of BSI
toneum, however, deforms the shape of the organs [22,23,35]. To
account for this deformation, we capture new images during the This section introduces the two state-of-the-art BSI methods
surgery (intra-operative) and use non-rigid image registration to and their respective GPU implementations, which we use as com-
match them with images before pneumoperitoneum (pre-operative parison points for our work.
images). We compute non-rigid image registration for pneumoperi- Texture Hardware (TH) Ruijters et al. [17,24] provide a texture
toneum with state-of-art implementations and with our BSI im- hardware method for BSI. They base their method on the obser-
plementation. Using our implementation results in a performance vation that the weighted additions of Eq. (1) can be replaced by
increase with the same accuracy as using the state-of-the-art im- a linear interpolation [16,17]. Linear interpolations are well-suited
plementations. for the GPU texture unit, that features a hardware interpolation
unit. The hardware interpolation unit calculates the interpolation
directly and it does not require separate accesses to off-chip global
memory of the GPU to load the input control points. Hardware
interpolation is fast but it has two main drawbacks. First, it has
only 8 bits of accuracy [8], which limits the resolution of the
1
https://github.com/oresths/niftyreg_bsi. interpolation. Second, the values that the hardware interpolation
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 3

thread blocks to each tile, with one thread for each voxel of the tile.
Tiling enables the reuse of control points, which are the same for
the whole tile, by keeping them in the fast on-chip shared memory.
NiftyReg [7], a lightweight open-source medical image registra-
tion library, also uses the thread per voxel method. NiftyReg con-
tains optimized implementations of BSI for both CPUs and GPUs.
It is open-source and well-maintained, with competitive perfor-
mance against other state-of-the-art implementations [20,21]. The
GPU implementation uses a simple, straightforward TV method,
which does not take advantage of tiling. The CPU implementation,
however, exploits tiling by applying multi-core and vectorization
optimizations.

3. Optimizing B-spline interpolation

This section presents our GPU implementation of BSI, which fol-


lows a different approach to the state-of-the-art implementations
(i.e., TH and TV). In our approach, we assign one thread per tile of
voxels, as we explain in Sections 3.1–3.3. In Section 3.5, we intro-
duce our implementations for CPU, which follow the GPU approach
partially.

3.1. Overview of our GPU implementation of BSI

Our GPU implementation of BSI is based on two key ideas.


First, an entire tile of voxels is assigned to a single GPU thread
(Thread per Tile, TT), in contrast to the one-thread, one-voxel ap-
proach. This TT assignment takes advantage of tiling in both on-
Fig. 1. The cube of 4 × 4 × 4 control points that affect a voxel/tile in a 3D control
point grid. Smaller cubes depict the grouping in trilinear interpolations. chip cache memory and registers: 1) tiling in cache memory mini-
mizes the reads from off-chip memory, by maximizing the overlap
of input control points, and 2) tiling in registers minimizes the ac-
unit fetches from the off-chip memory are a function of the ab- cesses to cache memory, by reusing the input control points for
solute position of each voxel. Therefore, TH cannot utilize custom many voxels.
caching schemes to aggregate data transfers for neighboring voxels Second, we replace the weighted sum of the basic formula of
(Appendix A). BSI with trilinear interpolations, in a similar way as TH does. We
Texture Hardware BSI is included in an easy-to-use library by calculate these trilinear interpolations using Fused Multiply-Add
Ruijters et al. [24] and is used in recent works [25,26]. (FMA) instructions, which the GPU instruction set contains [8].
Thread per Voxel (TV) This method assigns one thread per image FMA increases both accuracy and speed in regard to regular multi-
element, e.g., per voxel in the case of 3D images. plication and addition instructions.
Ellingwood et al. [6] present a GPU implementation of this We give an in-depth description of our optimizations in the
method that applies tiling (Section 2.1.1). They assign one or more next sections.

Fig. 2. A 2D space divided into tiles.


4 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

Fig. 3. Comparison of input loading and register optimization for Thread per Voxel with tiling (left) and Thread per Tile (right) for two neighboring tiles.

3.2. Thread per Tile (TT) 3.2.2. Register optimization


The second optimization technique that we apply to TT is based
In this section, we describe the optimization techniques that we on two main ideas: 1) we load the control points for all voxels of
deploy in our TT approach to BSI. We show how the input loading the tile from cache memory only once, and 2) we keep the loaded
and register optimizations reduce memory accesses. control points in registers, which are the fastest on-chip memory,
until thread execution finishes.
In TV, threads belonging to the same block work on individ-
3.2.1. Input loading optimization ual voxels of the same tile. For every voxel belonging to the tile,
The main idea is to reduce loads from global memory by taking the corresponding threads need to access exactly the same control
advantage of the overlap of tiles assigned to neighboring threads. points as all other threads of the block. Step 2 of Fig. 3 (left) illus-
Fig. 3 compares the TV approach with tiling (left), explained in trates the required data movement from shared memory to reg-
Section 2.2, to our TT approach (right). isters for a 2D example. In this example, each pixel is assigned
In TV, each block of threads works on a unique tile of voxels. to one thread, and for every four pixels the corresponding four
Thus, each block requires 4N input control points (Section 2.1.1). threads need to read (from shared memory to registers) sixteen
Therefore, for each tile, we need to move 4N control points from control points each (i.e., 4 × 16 reads for every four pixels).
global memory to shared memory. Step 1 in Fig. 3 (left) illustrates In TT, the one-thread, one-tile assignment minimizes the data
the required data movement from global memory to shared mem- movement between cache memory and registers. For all voxels be-
ory for a 2D example. In this example, we have two tiles and each longing to the tile, the corresponding thread needs to access from
tile is assigned to one block. The two tiles imply the movement of cache memory a unique set of control points that is different from
4 × 4 + 4 × 4 control points from global memory to shared mem- the set accessed by any other thread of the block (there is overlap,
ory. though). By utilizing register tiling, the thread keeps the control
In TT, we assign one thread per tile to take advantage of over- points in registers, which are faster than cache memory [27], to
lapping neighboring tiles. Step 1 of Fig. 3 (right) illustrates the re- process every voxel in the tile. Step 2 of Fig. 3 (right) illustrates
duction in data movement to cache memory, with the overlap in the reduction in data movement. For every four pixels, the cor-
the x-direction. Two tiles require only 4 × 5 control points. In responding thread needs to read only sixteen control points (i.e.,
3D medical images, the reduction in data movement is more no- 1 × 16 reads for every four pixels).
ticeable, because there is overlap in the three directions. As a re-
sult, our approach reduces the data movement from global mem- 3.3. Thread per Tile with Linear Interpolations (TTLI)
ory dramatically. TT requires about 12× and about 187× (for
5 × 5 × 5 tiles) fewer memory transfers in comparison to TV and We extend TT by reformulating the triple sum of Eq. (1) to tri-
TH (Appendix A). linear interpolations. The basic idea is that a linear interpolation
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 5

can replace an addition of two weighted addends. We can extend Table 1


Differences between GPU and CPU implementations ( means that an
this to three dimensions, where we combine eight addends into a
optimization technique is used in the CPU implementation).
trilinear interpolation [16].
We calculate a trilinear interpolation as a combination of seven Optimization VT VV
linear interpolations (in our implementation, we do not use the Input overlap Only in x-direction Only in x-direction
hardware interpolation unit as this would prevent us from increas- Register tiling Partially 
ing input data locality and output data accuracy (Section 2.2)). The Linear interpolation  

linear interpolations are beneficial to the performance of our ap-


proach because the compiler maps linear interpolations to FMA in-
structions. FMA instructions are preferable for two reasons. First,
tage of the several cores and the SIMD units (SSE/AVX) that CPUs
FMA is more accurate because it executes multiplication and addi-
have [29,30].
tion in the same step, with a single rounding. Second, FMA is faster
SIMD units pack many single values, which we call elements, in
because it executes both multiplication and addition with a single
a special register, called a vector, thus applying a form of register
instruction [28].
tiling.
Fig. 1 illustrates the 4 × 4 × 4 neighborhood of control points
Vector per Tile (VT) In this method, we parallelize by using
that affect a tile of voxels. Each one of the 2 × 2 × 2 colored
SIMD vectors to simultaneously process many voxels of a tile. Each
sub-cubes of control points corresponds to one trilinear interpo-
thread processes δ x voxels simultaneously. We iterate through the
lation. For each voxel in the tile, the respective thread calculates
y,z-dimensions of the tile, δ x voxels at a time. The drawback of this
each one of the eight trilinear interpolations. The arithmetic oper-
method is that a SIMD vector is not fully utilized if δ x , a user con-
ations that are needed for each trilinear interpolation (i.e., colored
figurable parameter, is not a multiple of the SIMD vector length.
sub-cube) are independent, thus enabling Instruction Level Paral-
Vector per Voxel (VV) In this method, we parallelize by using
lelism (ILP) [27].
SIMD vectors to simultaneously process each of the trilinear inter-
polations a single voxel requires. This means that, using the SIMD
3.4. Implementation details of TT and TTLI unit, each thread processes simultaneously all colored sub-cubes
a voxel requires (Fig. 1). Conveniently, the SIMD vector length is
Register tiling, which we employ in our approach, requires a equal to the number of sub-cubes.
careful management of the registers. We explain some of our im-
plementation decisions in the following paragraphs.
Register allocation The deformation field of a 3D image requires
4. Pre-clinical dataset acquisition
64 control points and each control point comprises three values,
one for each of the three coordinates (x, y, z). Therefore, we need
In order to test our implementations of BSI in a pre-clinical
3 × 64 = 192 registers for the control points only. The control point
application scenario, we perform a pre-clinical study where we
grid is aligned to the voxel grid and uniformly spaced, there-
use FFD. We create a dataset (publicly available) [31] which con-
fore we store the scalar B-spline coefficients in Look-Up-Tables
sists of two sets of subjects and imaging modalities: 1) a patient-
(LUTs). TT requires 235 registers in total, whereas TTLI requires
specific liver phantom [32] with DynaCT scanning, and 2) a porcine
255 registers.
model with MRI scanning, to validate the registration process in-
Thread block configuration The amount of required regis-
vivo. Table 2 lists the characteristics of the collected dataset.
ters limits the maximum active threads per Symmetric Multi-
In this section, we describe the dataset in detail. We present
processor (SM) to 256 [8]. We arrange threads to blocks of
evaluation results in Sections 6 and 7.
4 × 4 × 4 threads. We select this arrangement because a cube
Patient-specific phantom of liver The patient-specific liver phan-
is the geometrical structure that maximizes overlap and conse-
tom presents a total of five tumors and a blood vessel tree. The
quently minimizes memory transfers (i.e., minimizes Eq. (A.4) in
liver phantom used in our experiments was produced by the AR-
Appendix A).
TORG centre and Cascination® [32] and has been used by Tea-
Performance at low occupancy Shared and cache memories are
tini et al. for registration studies [33]. We performed three intra-
slower than registers, therefore TT keeps the control points in reg-
operative CT scans (Artis Zeego, Siemens®) (DynaCT) of the liver
isters permanently. We arrange input data in such a way that there
phantom in the OR. For each scan, we apply non-rigid deforma-
are no spills (although in TTLI we have to store a few control
tions to the phantom, which we try to correct through FFD (Phan-
points into shared memory). Due to the large amount of registers
tom 1, Phantom 2, Phantom 3). An example of the liver phantom
our approach requires, the occupancy of the GPU falls to 12.5% for
scans is visible in Fig. 4 a and b.
CUDA Compute Capabilities (CC) before 7.x and to 25% for newer
Porcine model We performed a porcine study to acquire pre-
CC [8]. Despite the low occupancy, we can maximize resource uti-
operative (without pneumoperitoneum) and intra-operative (post
lization by using ILP and avoiding the use of cache memories.
pneumoperitoneum) MRI scans. These were used to study the
Our approach uses a register-only approach to increase the perfor-
deformation that the liver undergoes due to pneumoperitoneum
mance substantially [27].
alone. We performed this study at Oslo University Hospital through
the use of a 3T Siemens MRI scanner, model Ingenia Philips® [34].
3.5. Application of our approach to CPUs We performed pneumoperitoneum at 14 mmHg. Both MRI scans
were performed with injection of contrast, as done in patients,
We can apply our TTLI approach to the CPU implementation of to improve imaging of the liver parenchyma and blood vessels
BSI. Table 1 summarizes the main differences with the GPU imple- (Flow rate 5.0 and Volume 11.0, based on the weight of the
mentations. Some optimizations are not fully applicable to the CPU animal at 55 kg). The MRI scans are thin sliced (1.5 mm in
implementation, because they are tailored to the GPU architec- Porcine 1 and 1 mm in Porcine 2) enhanced-T1 high-resolution
ture. GPUs allow for more fine-grained parallelism in comparison isotropic volume examination (e-THRIVE) scans. The deformation
to CPU, which makes GPUs more efficient with small 3D groups of of the liver due to pneumoperitoneum is visible in the differ-
tiles with regards to cache and register management. We develop ences between images (c) and (d) in Fig. 4 and further explored
two parallel implementations of BSI on CPUs, which take advan- in [35].
6 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

Table 2
Image characteristics.

Registration pair Resolution Voxel count (millions) Voxel Spacing

Phantom 1 512 × 228 × 385 44.94 0.49 × 0.49 × 0.49


Phantom 2 294 × 130 × 208 7.95 0.90 × 0.90 × 0.90
Phantom 3 294 × 130 × 208 7.95 0.90 × 0.90 × 0.90
Porcine 1 303 × 167 × 212 10.73 0.94 × 0.94 × 1.00
Porcine 2 267 × 169 × 237 10.70 0.94 × 0.94 × 1.00

Fig. 4. Medical images used for pre-clinical evaluation of our optimized image registration through FFD. (a) and (b) show two DynaCT scans of the liver phantom, and (c)
and (d) are MRI scans of the porcine model, respectively without (c) and with pneumoperitoneum applied (d).

5. B-spline interpolation evaluation measure the performance: 1) time per voxel is the execution time
necessary to interpolate a single voxel, and 2) speedup is the per-
In this section, we evaluate our BSI implementations on GPUs formance improvement over NiftyReg (TV).
and CPUs in terms of performance and accuracy, and compare Parameters We select five different tile sizes to evaluate the
them to state-of-the-art implementations. behavior of the algorithms under different parameters, namely
3 × 3 × 3, 4 × 4 × 4, 5 × 5 × 5, 6 × 6 × 6, 7 × 7 × 7. We
select these tile sizes because they are centered around 5 × 5 × 5,
5.1. Evaluation methodology
which is the default tile size for non-rigid registration in NiftyReg.
Configuration In our evaluation, we use one CPU and two GPUs.
The CPU is a quad-core Intel [email protected] GHz with Hyper- 5.2. GPU performance
Threading. We use gcc v5.4 compiler. To show the performance
and stability among different GPU generations, we use two GPUs Fig. 5 a and b show the average time per voxel for TH, NiftyReg
of different generations: 1) NVIDIA GeForce GTX 1050 (with Pascal (T V), T V-tiling, TT, and TTLI on the GTX 1050 and the RTX 2070
architecture [8]), and 2) NVIDIA GeForce RTX 2070 (with Turing ar- GPUs, respectively.
chitecture [36]). We use CUDA SDK v9.2 for the first GPU and v10.1 We make three main observations. First, TTLI is the fastest im-
for the second GPU. We use CUDA event API to acquire the timing plementation in all cases. Second, the time per voxel is almost in-
results. dependent of the tile size for all implementations except TV-tiling,
Comparison baseline We compare our approaches to the state- for which the thread block size changes with the tile size. The rea-
of-the-art BSI implementations (Section 2.2). For TH, we use the sons are three. 1) Bigger tiles leave more threads inactive at the
library from Ruijters et al. [24]. For TV, we create an implementa- borders of the image. 2) Bigger tiles decrease the coalescence of
tion that is based on the recent literature [6,7,19]. This implemen- GPU memory accesses. In our approach, a single thread stores an
tation of TV uses tiling and is tuned for the GPUs we use. We refer entire tile in the output (Fig. 3, Step 3). 3) If the number of SMs
to this implementation as TV-tiling. We also compare to the opti- does not divide the amount of blocks exactly, some SMs may re-
mized GPU implementation of the NiftyReg library [7], which does main idle (tail effect). In conclusion, the performance of our ap-
not use tiling, as GPU reference, and the optimized CPU implemen- proach in regards to different tile sizes, is a balance between the
tation of NiftyReg [7] as CPU reference. We refer to the NiftyReg acceleration that the reduction of data movement offers and the
implementations as NiftyReg (TV). deceleration that border effects and memory uncoalescence cause.
Dataset and metrics We measure the timing information of BSI Third, for all implementations the coefficient of variation (error
while applying registration on our dataset. We use two metrics to bars show the standard deviation across the images of our dataset)
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 7

Fig. 5. Average time per voxel of the five registration pairs for various tile sizes on GTX 1050 GPU (a) and RTX 2070 GPU (b). Error bars depict the standard deviation of
time per voxel.

Fig. 6. Average speedup over NiftyReg (TV) for the five registration pairs with different tile sizes on the GTX 1050 GPU (a) and the RTX 2070 GPU (b). Error bars depict the
standard deviation of the speedup.

is less than 3% which reflects that the image contents do not affect TT does not provide significant speedup over TV-tiling. The rea-
the performance. The reason is that BSI is regular, i.e., it operates son is that our TT approach reduces data movement significantly,
on all voxels uniformly. which makes TT compute-bound. We observe with the NVIDIA’s
Fig. 6 a and b show the average speedup over NiftyReg (TV) Visual Profiler [37] that the compute utilization of TT is at about
for TH, TV-tiling, TT, and TTLI on the GTX 1050 and the RTX 2070 90% of the peak. Since the amount of computation in TT is not re-
GPUs, respectively. duced with respect to TV-tiling, the potential improvement is lim-
We make two observations. First, our TTLI approach is 6.5× (up ited.
to 7×) faster than NiftyReg (TV), on average. TTLI outperforms the Reformulating the summation of TT to trilinear interpolations
second fastest (TT) by an average of 1.77× on GTX 1050 and 1.5× (Section 3.3) reduces the computational complexity of Eq. (1) to
on RTX 2070. Second, TTLI shows similar speedups over NiftyReg half (Appendix B) and increases the usage of FMA instructions. TTLI
(TV) on both Pascal architecture (GTX 1050) and Turing architec- is 50–80% faster than TT. After removing the computational inten-
ture (RTX 2070) GPUs, which demonstrates that our optimizations sity problem, TTLI is no longer compute-bound. The main bottle-
are widely applicable and performance-portable. neck is the uncoalescence of the output (Fig. 3, Step 3). In our ex-
periments, fixing the uncoalescence proved more computationally
costly than the uncoalescence itself.
5.2.1. Analysis of performance limitations Thread divergence, caused by the inactive threads at the bor-
This section describes the limitations that define the perfor- ders of the image, reduces the computation throughput for both
mance of our approach. TT and TTLI.
8 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

Fig. 7. Average time per voxel (a) and speedup (b) of BSI for various tile sizes using our implementation of BSI on CPUs. Error bars depict the standard deviation.

With 5 × 5 × 5 tile, TTLI achieves 670 GFLOP/s and 62 GB/s Table 3


Average absolute error of BSI approaches on GPUs with
on the GTX 10502 . The empirical limits [38] of the GTX 1050 are
respect to a high precision CPU implementation.
2091 GFLOP/s and 95 GB/s. We observe that TTLI is close to the
bandwidth limit, but not so close to the computation limit. Implementation Error (e−6 )

Texture Hardware 9245


Thread per Voxel (Tiling) 5.5
5.3. CPU performance
NiftyReg (TV) GPU 5.3
Thread per Tile 5.6
We apply our approach to BSI to our CPU implementations Thread per Tile (Interp.) 2.8
(Section 3.5). Fig. 7 a and b show respectively time per voxel and
speedup results of our CPU approaches for different tile sizes. Table 4
We make four observations. First, our CPU implementations Average absolute difference of BSI approaches on CPUs
(VT and VV) outperform the baseline NiftyReg (TV) by an aver- with respect to a high precision CPU implementation.
age of 4.12× and 3.30×, respectively. Second, for all implementa- Implementation Error (e−6 )
tions, larger tiles result in lower time per voxel, as they can take
NiftyReg (TV) CPU 6.0
more advantage of the CPU cache hierarchy. This effect is more Vector per Tile 3.0
pronounced in VT, which achieves a speedup of almost 5× for the Vector per Voxel 3.0
largest tiles. Third, the speedup of VT increases as the tile size in-
creases because bigger tiles fill more slots of the SIMD vectors. VT
is the fastest option when more than 3 slots are filled. Fourth, the show accuracy values in the same order of magnitude as CPU im-
speedup of VV does not increase, as the time per voxel of NiftyReg plementations.
decreases with faster rate than the time per voxel of VV. VV is the
recommended option only for 3 × 3 × 3 tiles.
6. Registration evaluation

5.4. Accuracy In this section, we evaluate the performance impact of our BSI
implementations on the overall registration process.
Our implementations employ FMA instructions, which are more
accurate than regular multiplications [8], in the calculation of lin- 6.1. Evaluation methodology
ear interpolations. In this section, we show the accuracy improve-
ments that stem from FMA instructions. We create a high precision To test the contribution of our BSI implementations to the to-
CPU implementation by using double precision arithmetic (64-bits tal time required for the registration of medical images, we inte-
floating point numbers) and we use this implementation as refer- grate our TTLI approach into NiftyReg3 [7]. The control points in
ence. NiftyReg correspond to a coarse deformation field. We calculate the
Tables 3 and 4 show respectively the average absolute error of fine deformation field (i.e., the displacement of all voxels) by inter-
all GPU implementations and all CPU implementations with re- polating the coarse deformation field using BSI. We compare the
spect to the high precision CPU implementation. total registration time with our BSI to the original NiftyReg reg-
We draw three conclusions. First, our implementations that em- istration, on our dataset presented in Section 4. We evaluate the
ploy FMA instructions (i.e., TTLI on GPUs, VT and VV on CPUs) are performance of non-rigid registration on two platforms: a) a quad-
almost two times more accurate than the rest. Second, TH is signif- core Intel [email protected] GHz CPU (with HyperThreading) and a
icantly less accurate than the rest of the implementations, as ex- GTX 1050 GPU, and b) a six-core Intel [email protected] GHz CPU (with
pected from the low accuracy of interpolation hardware [8]. TH is HyperThreading) and an RTX 2070 GPU.
3300× less accurate than TTLI. Third, most GPU implementations We set the tile size to 5 × 5 × 5, which is the default setting
in NiftyReg.
2
NVIDIA profiler (version 2019.4.0) does not provide metrics for counting FLOPs
3
on the RTX 2070. https://github.com/oresths/niftyreg_bsi
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 9

7. Clinical validation of image registration

In this section, we present the validation of our implementa-


tion of accelerated FFD on our pre-clinical dataset described in
Section 4.
Qualitative assessment We perform qualitative assessment of the
registration using a checkerboard validation procedure [40]. Our
method provides accurate registration for the parenchyma (the
outer shape of the liver is preserved correctly) for both the liver
phantom and porcine model. Tumors and vessel structures of the
phantom are consistent between images (Fig. 10) and approxi-
mately also vessel structures for the porcine model are correctly
registered (Fig. 11).
Quantitative assessment
We create normalized difference images between the output
of the registration and the target intra-operative image for three
Fig. 8. Time and speedup of registration with our improved BSI GPU approach on registration approaches: 1) affine, 2) proposed, and 3) original
GTX 1050.
NiftyReg (Figs. 12 and 13). Table 5 shows the mean absolute er-
ror (MAE) for all images of our dataset. As expected, the mismatch
to the target intra-operative image is greater with affine than with
non-rigid registration approaches. The two non-rigid registration
approaches perform almost equally (the average MAE across the
five image pairs is 0.2160 for affine, 0.1240 for our approach and
0.1249 for original NiftyReg.
In order to quantify how the different registration approaches
affect the accuracy of the registration as output images, we ap-
ply Structured Similarity Index Metric (SSIM) [41] to our dataset.
With the SSIM, we measure the similarity between the output
of the registration approach and the target intra-operative image
(Table 5).
We make three observations. First, the non-rigid registration
approaches have much higher similarity than the affine registra-
tion approach. Second, our approach and the original NiftyReg have
almost equal similarities. Third, our approach gives slightly better
Fig. 9. Time and speedup of registration with our improved BSI GPU approach on similarity than the original NiftyReg approach. Further evaluation
GTX 1050. of accuracy of the registration can be inferred from the original
studies performed by Modat et al. [7].

6.2. Performance evaluation

Figs. 8 and 9 show the total registration time and the speedup 8. Discussion
of our approach on the two platforms.
We draw two major conclusions. First, registration with our BSI In this work we optimize BSI and integrate it to FFD to accel-
approach is faster in all images on both platforms. The speedup of erate the performance of medical image registration. However, our
registration is 1.30×, on average, on the platform with a GTX 1050 improved BSI can also be used in generic image interpolation ap-
GPU, and 1.14× on the platform with an RTX 2070 GPU. Second, plications, e.g., image zooming [42], by using image pixels as the
although the performance improvement of our BSI approach is al- control points.
most the same for both GPUs, we do not observe the same results The performance of image registration can be further improved
for the entire image registration. The reason resides in Amdahl’s by merging the other steps of FFD with B-spline interpolation. By
law [39]: while BSI represents 27% of the total registration time on optimizing the rest of the registration process, the execution time
the platform with a GTX 1050 GPU, it takes only 15% on the plat- of the registration further diminishes, enabling new possibilities
form with an RTX 2070 GPU. As a result, the overall performance for fast intra-operative updates without intra-operative CT acqui-
impact on the registration workflow depends on the characteristics sitions, e.g., through liver models reconstructed with US [43] or
of the compute platform. through stereo video reconstructions [33].

Fig. 10. Comparison of registration through qualitative checkerboard assessment on liver phantom scans. (Left) shows the registration results using an affine registration.
(Right) shows the results of non-rigid FFD using our BSI implementation.
10 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

Fig. 11. Comparison of registration through qualitative checkerboard assessment on porcine liver scans. (Left) shows the registration results using an affine registration.
(Right) shows the results of non-rigid FFD using our BSI implementation.

Fig. 12. Comparison of registration through quantitative difference image assessment on liver phantom scans. (Left) shows results using an affine registration; (Center) shows
the results of non-rigid FFD using our BSI implementation; (Right) shows the results of non-rigid FFD using original NiftyReg.

Fig. 13. Comparison of registration through quantitative difference image assessment on porcine liver scans. (Left) shows results using an affine registration; (Center) shows
the results of non-rigid FFD using our BSI implementation; (Right) shows the results of non-rigid FFD using original NiftyReg.

Table 5
Mean absolute error (Left) on normalised outputs of affine registration and non-rigid
registration with our approach and original NiftyReg, using the intra-operative image
as reference. Structured Similarity Index Metric (Right) of the registration output, using
the intra-operative image as reference).

Registration MAE SSIM


pair
Affine Proposed NiftyReg Affine Proposed NiftyReg

Phantom 1 0.229 0.13 0.131 0.865 0.929 0.934


Phantom 2 0.234 0.172 0.179 0.916 0.952 0.946
Phantom 3 0.256 0.174 0.172 0.889 0.952 0.95
Porcine 1 0.201 0.072 0.072 0.797 0.912 0.911
Porcine 2 0.162 0.072 0.071 0.716 0.737 0.737

Average 0.2164 0.1240 0.1249 0.8368 0.8963 0.8956

The speedup of image registration through optimized FFD is im- 9. Conclusion


portant not only for pneumoperitoneum compensation, but also
for compensation of several other deformations that the liver com- This paper presents our approach to B-spline interpolation,
monly undergoes during surgery. If real-time registration is possi- which is optimized to reduce data movement. The key idea of our
ble, FFD can be used in IGS to compensate for deformations that approach is to assign one worker thread per tile of voxels. This
result from lifting the liver with a surgical instrument or resecting has two main advantages. First, data movement during input load-
liver ligaments (liver mobilization). ing is significantly reduced. Second, the input control points can
A limitation of our current implementations is that they work be kept in registers during the entire computation. To further en-
only with control point grids that are aligned to the voxel grid and hance the performance of our implementation, we rearrange the
uniformly spaced. Uniform spacing is usually sufficient for medi- weighted summation of control points into trilinear interpolations.
cal images [6,19]. Support for non-uniform grids is possible with This results in two key advantages. First, the trilinear interpolations
minimal changes (e.g., calculating B-spline basis functions weights reduce the computational load. Second, they increase the interpo-
on-the-fly). We leave this support for future work. lation accuracy.
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 11

Our experimental evaluation on two sets of subjects and imag- happen in L sized chunks. Hence, the total number of transfers re-
ing modalities shows that our BSI approach offers improved per- quired is
formance and accuracy with respect to state-of-the-art implemen-
23 × M
tations. TTLI, our best approach on GPUs, performs up to 7× faster (A.2)
L
in comparison to the other GPU implementations. Our imple-
mentations that use trilinear interpolations perform approximately c) A block per tile: When we use a block for each tile, for each
2× better than the other in regard to interpolation accuracy. tile we need to transfer N control points from global memory to
We integrate our BSI approach into the NiftyReg medical image shared memory. Each tile contains T voxels, thus the total number
registration library and validate it in a pre-clinical application sce- of tiles is M/T. Transfers happen in L sized chunks. Hence, the total
nario. Our approach improves the performance of non-rigid image number of transfers required is
registration by 30% and 14%, on average, on our two platforms with N×M
a GTX 1050 GPU and an RTX 2070 GPU, respectively. The improved
(A.3)
T ×L
performance reduces the computation time of image registration.
d) Blocks of tiles: When we have 3D blocks of tiles, and each block
Therefore, faster updates of the organ and its structures are possi-
contains l × m × n tiles, for each block we need to transfer (4 + l −
ble during IGS.
1 ) × (4 + m − 1 ) × (4 + n − 1 ) (Section 3.2.1) control points from
As a result, non-rigid registration of medical images can ben-
global memory to shared memory (or cache). Each block contains
efit from our BSI approach on GPUs to greatly enhance the per-
l × m × n tiles and each tile contains T voxels, thus the total num-
formance and accuracy of registration in time-critical applications
ber of blocks is M/(l × m × n × T). Transfers happen in L sized
(e.g., image guided surgery).
chunks. Hence, the total number of transfers required is
(4 + l − 1 ) × (4 + m − 1 ) × (4 + n − 1 ) × M
Declaration of Competing Interest (A.4)
l×m×n×T ×L
We wish to confirm that there are no known conflicts of inter- Observations We make the following four observations. First,
est associated with this publication and there has been no signifi- a hardware trilinear interpolation implementation requires fewer
cant financial support for this work that could have influenced its memory transfers than a no tiles implementation because 23 < N
outcome. in all cases. Second, a block per tile implementation requires fewer
memory transfers than a hardware trilinear interpolation imple-
mentation because N/T < 23 when T > 8. T > 8 is a rare case (T is
Acknowledgments
125 by default in NiftyReg). Third, a blocks of tiles implementation
requires fewer memory transfers than a block per tile implemen-
This work is supported by High Performance Soft-tissue Naviga-
tation because (4+l−1 )×(4+ m−1 )×(4+n−1 )
l×m×n
< N as long as a block con-
tion (HIPERNAV - H2020-MSCA-ITN-2016). HIPERNAV has received
tains more than one tile. Fourth, the CPU implementations are a
funding from the European Union’s Horizon 2020 research and in-
special case of Eq. (A.4), in which l = m = 1, i.e., each thread pro-
novation programme under grant agreement no 722068. The au-
cesses contiguous tiles in the x-axis direction.
thors would also like to thank the radiology staff at the Interven-
tion Centre, Oslo University Hospital, who collaborated to perform Appendix B. Computational complexity
the animal experiment on the porcine model. Juan Gómez-Luna
and professor Onur Mutlu would like to thank VMware and all In order to evaluate the arithmetic performance of TTLI and TT,
other industrial partners (especially Facebook, Google, Huawei, In- we perform the computational analysis of both implementations in
tel, Microsoft) of the SAFARI research group. this section.
TT For every voxel of the output image, we need to calculate
Appendix A. Off-chip memory to on-chip memory data the triple sum in Eq. (1). Each operand of the summation requires
movement the multiplication of one control point (φ ) with three weights (B).
Thus, each voxel requires
We use the external memory model [44] to describe the data (64 summands ) × (3 multiplications + 1 accumulation ) − 1 = 255
movement from off-chip memory to on-chip memory. We consider vector (φ is a 3D vector in deformation fields) arithmetic opera-
a 3D image. Let us define M as the total number of voxels, N = 64 tions. The calculation of Eq. (1) requires 4 + 4 + 4 = 12 scalar loads
as the number of control points, T as the number of voxels inside for the weights and 64 vector loads for the control points. If we
each tile, and L as the size, in words (words are 32-bits long, a use one weight for the Bl (u) · Bm (v) · Bn (w) product, instead of
common size for storing integer and real numbers), of transactions three individual weights, the required operations decrease to
into the cache (i.e., transactions between off- and on- chip mem- (64 summands ) × (1 multiplications + 1 accumulation ) − 1 = 127
ory). The L sized memory transfers of the three cases we are inter- (same as a parallel reduction) and the weights to be loaded
ested in are: increase to 4 × 4 × 4 = 64. This is not suitable for our register-only
a) No tiles: When we do not have tiles, for each of the M voxels, implementations, because there are not enough registers to store
we need to transfer N control points from global memory to shared the 64 weights and the use of one of the caches would impact the
memory. Transfers happen in L sized chunks. Hence, the total num- performance substantially (Section 3.4).
ber of transfers required is TTLI For every voxel of the output image, we reformulate the
summation of the 4 × 4 × 4 weighted control points to trilinear
N×M
(A.1) interpolations. We divide the 4 × 4 × 4 cubic neighborhood to
L
eight 2 × 2 × 2 sub-cubes, as in Fig. 1. Each sub-cube corresponds
b) Hardware trilinear interpolation: Each voxel is affected by the 43 to a trilinear interpolation. A trilinear interpolation requires seven
control points surrounding it. However, if we use the texture unit linear interpolations for its calculation. A linear interpolation has
to get their trilinear interpolations directly, only 23 loads are re- the form a + w ∗ (b − a ), which equals to a subtraction and a
quired [16]. Therefore, when we utilize the texture hardware for fused multiply-accumulate (FMA) operation. Thus, for the eight
loading the input, for each of the M voxels, we need to transfer sub-cubes and the ninth final sub-cube that is formed by the eight
23 control points from global memory to cache memory. Transfers results of the eight trilinear interpolations, we have
12 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431

(9 cubes ) × (7 linear interpolations ) × (2 operations ) = 126 opera- [22] J.S. Heiselman, L.W. Clements, J.A. Collins, J.A. Weis, A.L. Simpson, S.K. Gee-
tions for each voxel. varghese, T.P. Kingham, W.R. Jarnagin, M.I. Miga, Characterization and correc-
tion of soft tissue deformation in laparoscopic image-guided liver surgery,
Observations Without taking into consideration instruction dual- Journal of Medical Imaging (2) (2018), doi:10.1117/1.JMI.5.2.021203. In Press
issue, (n) equals to 255∗ (number of voxels) and 126∗ (number of [23] S.F. Johnsen, S. Thompson, M.J. Clarkson, M. Modat, Y. Song, J. Totz, K. Gu-
voxels) respectively. rusamy, B. Davidson, Z.A. Taylor, D.J. Hawkes, S. Ourselin, Database-based
estimation of liver deformation under pneumoperitoneum for surgical im-
age-guidance and simulation, Lect. Notes Comput. Sci. 9350 (2015) 450–458.
References [24] D. Ruijters, P. Thévenaz, GPU prefilter for accurate cubic B-spline interpolation,
Comput. J. 55 (1) (2010) 15–20, doi:10.1093/comjnl/bxq086.
[1] A. Bartoli, T. Collins, N. Bourdel, M. Canis, Computer assisted minimally in- [25] F. Andersson, M. Carlsson, V.V. Nikitin, Fast algorithms and efficient GPU im-
vasive surgery: is medical computer vision the answer to improving laparo- plementations for the radon transform and the back-projection operator rep-
surgery? Med. Hypotheses 79 (6) (2012) 858–863, doi:10.1016/j.mehy.2012.09. resented as convolution operators, SIAM J. Imaging Sci. 9 (2) (2016) 637–664.
007. [26] J. Carron, A. Lewis, Maximum a posteriori CMB lensing reconstruction, Phys.
[2] S. Bernhardt, S.A. Nicolau, L. Soler, C. Doignon, The status of augmented reality Rev. D 96 (6) (2017) 63510.
in laparoscopic surgery as of 2016, Medi. Image Anal. 37 (2017) 66–90, doi:10. [27] V. Volkov, Better performance at lower occupancy, Proc. GPU Technol. Conf.
1016/j.media.2017.01.007. (2010) 1–75.
[3] A. Teatini, T. Langø, B. Edwin, O. Elle, et al., Assessment and comparison of tar- [28] N. Whitehead, A. Fit-Florea, Precision & performance: floating point and IEEE
get registration accuracy in surgical instrument tracking technologies, in: 2018 754 compliance for NVIDIA GPUs, NVIDIA White Paper 21 (10) (2011) 767–775,
40th Annual International Conference of the IEEE Engineering in Medicine and doi:10.1111/j.1468-2982.20 05.0 0972.x.
Biology Society (EMBC), IEEE, 2018, pp. 1845–1848. [29] A. Fog, The Microarchitecture of Intel, AMD and VIA CPUs: an Optimization
[4] A. Sotiras, C. Davatzikos, N. Paragios, Deformable medical image registration: a Guide for Assembly Programmers and Compiler Makers, 2018th ed., Technical
survey, IEEE Trans. Med. Imaging 32 (7) (2013) 1153. University of Denmark, 2018.
[5] D. Rueckert, L.I. Sonoda, C. Hayes, D.L. Hill, M.O. Leach, D.J. Hawkes, Nonrigid [30] Intel, Intel intrinsics guide, 2019, (software.intel.com, retrieved January 17,
registration using free-form deformations: application to breast MR images., 2019).
IEEE Trans. Med. Imaging 18 (8) (1999) 712–721, doi:10.1109/42.796284. [31] O. Jakob Elle, A. Teatini, O. Zachariadis, Data for: accelerating B-spline inter-
[6] N.D. Ellingwood, Y. Yin, M. Smith, C.L. Lin, Efficient methods for implementa- polation on GPUs: application to medical image registration, Mendeley Data
tion of multi-level nonrigid mass-preserving image registration on GPUs and (2019), doi:10.17632/kj3xcd776k.1.
multi-threaded CPUs, Comput. Methods Programs Biomed. 127 (2016) 290– [32] A. Pacioni, M. Carbone, C. Freschi, R. Viglialoro, V. Ferrari, M. Ferrari,
300, doi:10.1016/J.CMPB.2015.12.018. Patient-specific ultrasound liver phantom: materials and fabrication method,
[7] M. Modat, G.R. Ridgway, Z.A. Taylor, M. Lehmann, J. Barnes, D.J. Hawkes, Int. J. Comput. Assist.Radiol. Surg. 10 (7) (2015) 1065–1075, doi:10.1007/
N.C. Fox, S. Ourselin, Fast free-form deformation using graphics processing s11548- 014- 1120- y.
units, Comput. Methods Programs Biomed. 98 (3) (2010) 278–284, doi:10.1016/ [33] A. Teatini, W. Congcong, P. Rafael, A.C. Faouzi, B. Azeddine, E. Bjørn, E.O. Jakob,
j.cmpb.20 09.09.0 02. Validation of stereo vision based liver surface reconstruction for image guided
[8] NVIDIA, CUDA C Programming Guide 9.0(2017). surgery, in: Colour and Visual Computing Symposium (CVCS), IEEE, 2018,
[9] E. Smistad, T.L. Falch, M. Bozorgi, A.C. Elster, F. Lindseth, Medical image seg- pp. 1–6.
mentation on GPUs–a comprehensive review, Med. Image Anal. 20 (1) (2015) [34] PHILIPS, Ingenia: instructions for use, 2014.
1–18. [35] A. Teatini, E. Pelanis, D.L. Aghayan, R.P. Kumar, R. Palomar, Å.A. Fretland, B. Ed-
[10] J. Gai, N. Obeid, J.L. Holtrop, X.-L. Wu, F. Lam, M. Fu, J.P. Haldar, W.H. Wen-mei, win, O.J. Elle, The effect of intraoperative imaging on surgical navigation for
Z.-P. Liang, B.P. Sutton, More impatient: a gridding-accelerated toeplitz-based laparoscopic liver resection surgery, Sci. Rep. 9 (1) (2019) 1–11.
strategy for non-cartesian high-resolution 3D MRI on GPUs, J. Parallel Distrib. [36] NVIDIA, Nvidia Turing GPU Architecture Whitepaper(2018).
Comput. 73 (5) (2013) 686–697. [37] NVIDIA, Profiler User’s Guide(September) (2017).
[11] S.S. Stone, J.P. Haldar, S.C. Tsao, W.-m.W. Hwu, B.P. Sutton, Z.-P. Liang, Accel- [38] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen,
erating advanced MRI reconstructions on GPUs, J. Parallel Distrib. Comput. 68 B. Cook, D. Doerfler, L. Oliker, et al., An empirical roofline methodology for
(10) (2008) 1307–1318, doi:10.1016/j.jpdc.2008.05.013. quantitatively assessing performance portability, in: 2018 IEEE/ACM Interna-
[12] H. Wang, H. Peng, Y. Chang, D. Liang, A survey of GPU-based acceleration tech- tional Workshop on Performance, Portability and Productivity in HPC (P3HPC),
niques in MRI reconstructions, Quant. Imaging Med. Surg. 8 (2) (2018) 196. IEEE, 2018, pp. 14–23.
[13] T. Kalaiselvi, P. Sriramakrishnan, K. Somasundaram, Survey of using GPU CUDA [39] G.M. Amdahl, Validity of the single processor approach to achieving large scale
programming model in medical image analysis, Inform. Med. Unlocked 9 computing capabilities, in: Proceedings of the April 18-20, 1967, spring joint
(2017) 133–144. computer conference, ACM, 1967, pp. 483–485.
[14] R. Palomar, J. Gómez-Luna, F.A. Cheikh, J. Olivares, O.J. Elle, High-performance [40] J.P. Pluim, S.E. Muenzing, K.A. Eppenhof, K. Murphy, The truth is hard to make:
computation of Bézier surfaces on parallel and heterogeneous platforms, Int. J. validation of medical image registration, in: International Conference on Pat-
Parallel Program. 46 (6) (2018) 1035–1062. tern Recognition (ICPR), IEEE, 2016, pp. 2294–2300.
[15] N. Satpute, R. Naseem, E. Pelanis, J. Gomez-Luna, F. Alaya Cheikh, O.J. Elle, [41] A. Hore, D. Ziou, Image quality metrics: PSNR vs. SSIM, in: 2010 20th Interna-
J. Olivares, GPU acceleration of liver enhancement for tumor segmentation, tional Conference on Pattern Recognition, IEEE, 2010, pp. 2366–2369.
Comput. Methods Programs Biomed. 184 (2020) 105285, doi:10.1016/j.cmpb. [42] M. Unser, Splines: a perfect fit for signal and image processing, IEEE Signal
2019.105285. Process. Mag. 16 (6) (1999) 22–38.
[16] C. Sigg, M. Hadwiger, Fast third-order texture filtering, GPU Gems 2 (2005) [43] L.W. Clements, J.A. Collins, Y. Wu, A.L. Simpson, W.R. Jarnagin, M.I. Miga,
313–329. Validation of model-based deformation correction in image-guided liver
[17] D. Ruijters, B.M. ter Haar Romeny, P. Suetens, Efficient GPU-based texture in- surgery via tracked intraoperative ultrasound: preliminary method and re-
terpolation using uniform B-splines, J. Graph. GPU Game Tools 13 (4) (2008) sults, in: Medical Imaging 2015: Image-Guided Procedures, Robotic Interven-
61–69, doi:10.1080/2151237X.2008.10129269. tions, and Modeling, 9415, International Society for Optics and Photonics, 2015,
[18] X. Du, J. Dang, Y. Wang, S. Wang, T. Lei, A parallel nonrigid registration al- p. 94150T.
gorithm based on B-spline for medical images, Comput. Math. Methods Med. [44] H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, W.-m. Hwu, Performance analysis and
(2016), doi:10.1155/2016/7419307. tuning for general purpose Graphics Processing Units (GPGPU), Synth. Lect.
[19] J.A. Shackleford, N. Kandasamy, G.C. Sharp, On developing B-spline registration Comput. Archit. 7 (2012) 1–96, doi:10.220 0/S0 0451ED1V01Y201209CAC020.
algorithms for multi-core processors, Phys. Med. Biol. 55 (21) (2010) 6329– [45] Nitin Satpute, Rabia Naseem, Rafael Palomar, Orestis Zachariadis, Juan Gómez-
6351, doi:10.1088/0031-9155/55/21/001. Luna, Faouzi Alaya Cheikh, Joaquín Olivares, Fast Parallel Vessel Segmentation,
[20] I. Peterlík, H. Courtecuisse, R. Rohling, P. Abolmaesumi, C. Nguan, S. Cotin, Computer Methods and Programs in Biomedicine (2020), doi:10.1016/j.cmpb.
S. Salcudean, Fast elastic registration of soft tissues under large deformations, 2020.105430.
Med. Image Anal. 45 (2018) 24–40. [47] Andrea Teatini, Javier Pérez de Frutos, Benjamin Eigl, Egidijus Pelanis,
[21] C.P. Lee, Z. Xu, R.P. Burke, R. Baucom, B.K. Poulose, R.G. Abramson, B.A. Land- Davit Ludwig Aghayan, Lai Marco, Rahul Prasanna Kumar, Rafael Palomar,
man, Evaluation of five image registration tools for abdominal CT: Pitfalls and Bjørn Edwin, Ole Jakob, O.J. Elle, Influence of sampling accuracy on augmented
opportunities with soft anatomy, in: Medical Imaging 2015: Image Processing, reality for laparoscopic image-guided surgery, Minimally Invasive Therapy &
9413, International Society for Optics and Photonics, 2015, p. 94131N. Allied Technologies (2020), doi:10.1080/13645706.2020.1727524.

You might also like