Accelerating B Spline Interpolation On GPUs AP 2020 Computer Methods and PR
Accelerating B Spline Interpolation On GPUs AP 2020 Computer Methods and PR
a r t i c l e i n f o a b s t r a c t
Article history: Background and Objective: B-spline interpolation (BSI) is a popular technique in the context of medical
Received 7 November 2019 imaging due to its adaptability and robustness in 3D object modeling. A field that utilizes BSI is Image
Revised 14 February 2020
Guided Surgery (IGS). IGS provides navigation using medical images, which can be segmented and recon-
Accepted 2 March 2020
structed into 3D models, often through BSI. Image registration tasks also use BSI to transform medical
imaging data collected before the surgery and intra-operative data collected during the surgery into a
Keywords: common coordinate space. However, such IGS tasks are computationally demanding, especially when ap-
Medical image registration plied to 3D medical images, due to the complexity and amount of data involved. Therefore, optimization
Medical image processing of IGS algorithms is greatly desirable, for example, to perform image registration tasks intra-operatively
Parallel computing
and to enable real-time applications. A traditional CPU does not have sufficient computing power to
GPU
achieve these goals and, thus, it is preferable to rely on GPUs. In this paper, we introduce a novel GPU im-
B-splines
plementation of BSI to accelerate the calculation of the deformation field in non-rigid image registration
algorithms.
Methods: Our BSI implementation on GPUs minimizes the data that needs to be moved between mem-
ory and processing cores during loading of the input grid, and leverages the large on-chip GPU register
file for reuse of input values. Moreover, we re-formulate our method as trilinear interpolations to reduce
computational complexity and increase accuracy. To provide pre-clinical validation of our method and
demonstrate its benefits in medical applications, we integrate our improved BSI into a registration work-
flow for compensation of liver deformation (caused by pneumoperitoneum, i.e., inflation of the abdomen)
and evaluate its performance.
Results: Our approach improves the performance of BSI by an average of 6.5× and interpolation accu-
racy by 2× compared to three state-of-the-art GPU implementations. Through pre-clinical validation, we
demonstrate that our optimized interpolation accelerates a non-rigid image registration algorithm, which
is based on the Free Form Deformation (FFD) method, by up to 34%.
Conclusion: Our study shows that we can achieve significant performance and accuracy gains with our
novel parallelization scheme that makes effective use of the GPU resources. We show that our method
improves the performance of real medical imaging registration applications used in practice today.
© 2020 Published by Elsevier B.V.
https://doi.org/10.1016/j.cmpb.2020.105431
0169-2607/© 2020 Published by Elsevier B.V.
2 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431
thread blocks to each tile, with one thread for each voxel of the tile.
Tiling enables the reuse of control points, which are the same for
the whole tile, by keeping them in the fast on-chip shared memory.
NiftyReg [7], a lightweight open-source medical image registra-
tion library, also uses the thread per voxel method. NiftyReg con-
tains optimized implementations of BSI for both CPUs and GPUs.
It is open-source and well-maintained, with competitive perfor-
mance against other state-of-the-art implementations [20,21]. The
GPU implementation uses a simple, straightforward TV method,
which does not take advantage of tiling. The CPU implementation,
however, exploits tiling by applying multi-core and vectorization
optimizations.
Fig. 3. Comparison of input loading and register optimization for Thread per Voxel with tiling (left) and Thread per Tile (right) for two neighboring tiles.
Table 2
Image characteristics.
Fig. 4. Medical images used for pre-clinical evaluation of our optimized image registration through FFD. (a) and (b) show two DynaCT scans of the liver phantom, and (c)
and (d) are MRI scans of the porcine model, respectively without (c) and with pneumoperitoneum applied (d).
5. B-spline interpolation evaluation measure the performance: 1) time per voxel is the execution time
necessary to interpolate a single voxel, and 2) speedup is the per-
In this section, we evaluate our BSI implementations on GPUs formance improvement over NiftyReg (TV).
and CPUs in terms of performance and accuracy, and compare Parameters We select five different tile sizes to evaluate the
them to state-of-the-art implementations. behavior of the algorithms under different parameters, namely
3 × 3 × 3, 4 × 4 × 4, 5 × 5 × 5, 6 × 6 × 6, 7 × 7 × 7. We
select these tile sizes because they are centered around 5 × 5 × 5,
5.1. Evaluation methodology
which is the default tile size for non-rigid registration in NiftyReg.
Configuration In our evaluation, we use one CPU and two GPUs.
The CPU is a quad-core Intel [email protected] GHz with Hyper- 5.2. GPU performance
Threading. We use gcc v5.4 compiler. To show the performance
and stability among different GPU generations, we use two GPUs Fig. 5 a and b show the average time per voxel for TH, NiftyReg
of different generations: 1) NVIDIA GeForce GTX 1050 (with Pascal (T V), T V-tiling, TT, and TTLI on the GTX 1050 and the RTX 2070
architecture [8]), and 2) NVIDIA GeForce RTX 2070 (with Turing ar- GPUs, respectively.
chitecture [36]). We use CUDA SDK v9.2 for the first GPU and v10.1 We make three main observations. First, TTLI is the fastest im-
for the second GPU. We use CUDA event API to acquire the timing plementation in all cases. Second, the time per voxel is almost in-
results. dependent of the tile size for all implementations except TV-tiling,
Comparison baseline We compare our approaches to the state- for which the thread block size changes with the tile size. The rea-
of-the-art BSI implementations (Section 2.2). For TH, we use the sons are three. 1) Bigger tiles leave more threads inactive at the
library from Ruijters et al. [24]. For TV, we create an implementa- borders of the image. 2) Bigger tiles decrease the coalescence of
tion that is based on the recent literature [6,7,19]. This implemen- GPU memory accesses. In our approach, a single thread stores an
tation of TV uses tiling and is tuned for the GPUs we use. We refer entire tile in the output (Fig. 3, Step 3). 3) If the number of SMs
to this implementation as TV-tiling. We also compare to the opti- does not divide the amount of blocks exactly, some SMs may re-
mized GPU implementation of the NiftyReg library [7], which does main idle (tail effect). In conclusion, the performance of our ap-
not use tiling, as GPU reference, and the optimized CPU implemen- proach in regards to different tile sizes, is a balance between the
tation of NiftyReg [7] as CPU reference. We refer to the NiftyReg acceleration that the reduction of data movement offers and the
implementations as NiftyReg (TV). deceleration that border effects and memory uncoalescence cause.
Dataset and metrics We measure the timing information of BSI Third, for all implementations the coefficient of variation (error
while applying registration on our dataset. We use two metrics to bars show the standard deviation across the images of our dataset)
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 7
Fig. 5. Average time per voxel of the five registration pairs for various tile sizes on GTX 1050 GPU (a) and RTX 2070 GPU (b). Error bars depict the standard deviation of
time per voxel.
Fig. 6. Average speedup over NiftyReg (TV) for the five registration pairs with different tile sizes on the GTX 1050 GPU (a) and the RTX 2070 GPU (b). Error bars depict the
standard deviation of the speedup.
is less than 3% which reflects that the image contents do not affect TT does not provide significant speedup over TV-tiling. The rea-
the performance. The reason is that BSI is regular, i.e., it operates son is that our TT approach reduces data movement significantly,
on all voxels uniformly. which makes TT compute-bound. We observe with the NVIDIA’s
Fig. 6 a and b show the average speedup over NiftyReg (TV) Visual Profiler [37] that the compute utilization of TT is at about
for TH, TV-tiling, TT, and TTLI on the GTX 1050 and the RTX 2070 90% of the peak. Since the amount of computation in TT is not re-
GPUs, respectively. duced with respect to TV-tiling, the potential improvement is lim-
We make two observations. First, our TTLI approach is 6.5× (up ited.
to 7×) faster than NiftyReg (TV), on average. TTLI outperforms the Reformulating the summation of TT to trilinear interpolations
second fastest (TT) by an average of 1.77× on GTX 1050 and 1.5× (Section 3.3) reduces the computational complexity of Eq. (1) to
on RTX 2070. Second, TTLI shows similar speedups over NiftyReg half (Appendix B) and increases the usage of FMA instructions. TTLI
(TV) on both Pascal architecture (GTX 1050) and Turing architec- is 50–80% faster than TT. After removing the computational inten-
ture (RTX 2070) GPUs, which demonstrates that our optimizations sity problem, TTLI is no longer compute-bound. The main bottle-
are widely applicable and performance-portable. neck is the uncoalescence of the output (Fig. 3, Step 3). In our ex-
periments, fixing the uncoalescence proved more computationally
costly than the uncoalescence itself.
5.2.1. Analysis of performance limitations Thread divergence, caused by the inactive threads at the bor-
This section describes the limitations that define the perfor- ders of the image, reduces the computation throughput for both
mance of our approach. TT and TTLI.
8 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431
Fig. 7. Average time per voxel (a) and speedup (b) of BSI for various tile sizes using our implementation of BSI on CPUs. Error bars depict the standard deviation.
5.4. Accuracy In this section, we evaluate the performance impact of our BSI
implementations on the overall registration process.
Our implementations employ FMA instructions, which are more
accurate than regular multiplications [8], in the calculation of lin- 6.1. Evaluation methodology
ear interpolations. In this section, we show the accuracy improve-
ments that stem from FMA instructions. We create a high precision To test the contribution of our BSI implementations to the to-
CPU implementation by using double precision arithmetic (64-bits tal time required for the registration of medical images, we inte-
floating point numbers) and we use this implementation as refer- grate our TTLI approach into NiftyReg3 [7]. The control points in
ence. NiftyReg correspond to a coarse deformation field. We calculate the
Tables 3 and 4 show respectively the average absolute error of fine deformation field (i.e., the displacement of all voxels) by inter-
all GPU implementations and all CPU implementations with re- polating the coarse deformation field using BSI. We compare the
spect to the high precision CPU implementation. total registration time with our BSI to the original NiftyReg reg-
We draw three conclusions. First, our implementations that em- istration, on our dataset presented in Section 4. We evaluate the
ploy FMA instructions (i.e., TTLI on GPUs, VT and VV on CPUs) are performance of non-rigid registration on two platforms: a) a quad-
almost two times more accurate than the rest. Second, TH is signif- core Intel [email protected] GHz CPU (with HyperThreading) and a
icantly less accurate than the rest of the implementations, as ex- GTX 1050 GPU, and b) a six-core Intel [email protected] GHz CPU (with
pected from the low accuracy of interpolation hardware [8]. TH is HyperThreading) and an RTX 2070 GPU.
3300× less accurate than TTLI. Third, most GPU implementations We set the tile size to 5 × 5 × 5, which is the default setting
in NiftyReg.
2
NVIDIA profiler (version 2019.4.0) does not provide metrics for counting FLOPs
3
on the RTX 2070. https://github.com/oresths/niftyreg_bsi
O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431 9
Figs. 8 and 9 show the total registration time and the speedup 8. Discussion
of our approach on the two platforms.
We draw two major conclusions. First, registration with our BSI In this work we optimize BSI and integrate it to FFD to accel-
approach is faster in all images on both platforms. The speedup of erate the performance of medical image registration. However, our
registration is 1.30×, on average, on the platform with a GTX 1050 improved BSI can also be used in generic image interpolation ap-
GPU, and 1.14× on the platform with an RTX 2070 GPU. Second, plications, e.g., image zooming [42], by using image pixels as the
although the performance improvement of our BSI approach is al- control points.
most the same for both GPUs, we do not observe the same results The performance of image registration can be further improved
for the entire image registration. The reason resides in Amdahl’s by merging the other steps of FFD with B-spline interpolation. By
law [39]: while BSI represents 27% of the total registration time on optimizing the rest of the registration process, the execution time
the platform with a GTX 1050 GPU, it takes only 15% on the plat- of the registration further diminishes, enabling new possibilities
form with an RTX 2070 GPU. As a result, the overall performance for fast intra-operative updates without intra-operative CT acqui-
impact on the registration workflow depends on the characteristics sitions, e.g., through liver models reconstructed with US [43] or
of the compute platform. through stereo video reconstructions [33].
Fig. 10. Comparison of registration through qualitative checkerboard assessment on liver phantom scans. (Left) shows the registration results using an affine registration.
(Right) shows the results of non-rigid FFD using our BSI implementation.
10 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431
Fig. 11. Comparison of registration through qualitative checkerboard assessment on porcine liver scans. (Left) shows the registration results using an affine registration.
(Right) shows the results of non-rigid FFD using our BSI implementation.
Fig. 12. Comparison of registration through quantitative difference image assessment on liver phantom scans. (Left) shows results using an affine registration; (Center) shows
the results of non-rigid FFD using our BSI implementation; (Right) shows the results of non-rigid FFD using original NiftyReg.
Fig. 13. Comparison of registration through quantitative difference image assessment on porcine liver scans. (Left) shows results using an affine registration; (Center) shows
the results of non-rigid FFD using our BSI implementation; (Right) shows the results of non-rigid FFD using original NiftyReg.
Table 5
Mean absolute error (Left) on normalised outputs of affine registration and non-rigid
registration with our approach and original NiftyReg, using the intra-operative image
as reference. Structured Similarity Index Metric (Right) of the registration output, using
the intra-operative image as reference).
Our experimental evaluation on two sets of subjects and imag- happen in L sized chunks. Hence, the total number of transfers re-
ing modalities shows that our BSI approach offers improved per- quired is
formance and accuracy with respect to state-of-the-art implemen-
23 × M
tations. TTLI, our best approach on GPUs, performs up to 7× faster (A.2)
L
in comparison to the other GPU implementations. Our imple-
mentations that use trilinear interpolations perform approximately c) A block per tile: When we use a block for each tile, for each
2× better than the other in regard to interpolation accuracy. tile we need to transfer N control points from global memory to
We integrate our BSI approach into the NiftyReg medical image shared memory. Each tile contains T voxels, thus the total number
registration library and validate it in a pre-clinical application sce- of tiles is M/T. Transfers happen in L sized chunks. Hence, the total
nario. Our approach improves the performance of non-rigid image number of transfers required is
registration by 30% and 14%, on average, on our two platforms with N×M
a GTX 1050 GPU and an RTX 2070 GPU, respectively. The improved
(A.3)
T ×L
performance reduces the computation time of image registration.
d) Blocks of tiles: When we have 3D blocks of tiles, and each block
Therefore, faster updates of the organ and its structures are possi-
contains l × m × n tiles, for each block we need to transfer (4 + l −
ble during IGS.
1 ) × (4 + m − 1 ) × (4 + n − 1 ) (Section 3.2.1) control points from
As a result, non-rigid registration of medical images can ben-
global memory to shared memory (or cache). Each block contains
efit from our BSI approach on GPUs to greatly enhance the per-
l × m × n tiles and each tile contains T voxels, thus the total num-
formance and accuracy of registration in time-critical applications
ber of blocks is M/(l × m × n × T). Transfers happen in L sized
(e.g., image guided surgery).
chunks. Hence, the total number of transfers required is
(4 + l − 1 ) × (4 + m − 1 ) × (4 + n − 1 ) × M
Declaration of Competing Interest (A.4)
l×m×n×T ×L
We wish to confirm that there are no known conflicts of inter- Observations We make the following four observations. First,
est associated with this publication and there has been no signifi- a hardware trilinear interpolation implementation requires fewer
cant financial support for this work that could have influenced its memory transfers than a no tiles implementation because 23 < N
outcome. in all cases. Second, a block per tile implementation requires fewer
memory transfers than a hardware trilinear interpolation imple-
mentation because N/T < 23 when T > 8. T > 8 is a rare case (T is
Acknowledgments
125 by default in NiftyReg). Third, a blocks of tiles implementation
requires fewer memory transfers than a block per tile implemen-
This work is supported by High Performance Soft-tissue Naviga-
tation because (4+l−1 )×(4+ m−1 )×(4+n−1 )
l×m×n
< N as long as a block con-
tion (HIPERNAV - H2020-MSCA-ITN-2016). HIPERNAV has received
tains more than one tile. Fourth, the CPU implementations are a
funding from the European Union’s Horizon 2020 research and in-
special case of Eq. (A.4), in which l = m = 1, i.e., each thread pro-
novation programme under grant agreement no 722068. The au-
cesses contiguous tiles in the x-axis direction.
thors would also like to thank the radiology staff at the Interven-
tion Centre, Oslo University Hospital, who collaborated to perform Appendix B. Computational complexity
the animal experiment on the porcine model. Juan Gómez-Luna
and professor Onur Mutlu would like to thank VMware and all In order to evaluate the arithmetic performance of TTLI and TT,
other industrial partners (especially Facebook, Google, Huawei, In- we perform the computational analysis of both implementations in
tel, Microsoft) of the SAFARI research group. this section.
TT For every voxel of the output image, we need to calculate
Appendix A. Off-chip memory to on-chip memory data the triple sum in Eq. (1). Each operand of the summation requires
movement the multiplication of one control point (φ ) with three weights (B).
Thus, each voxel requires
We use the external memory model [44] to describe the data (64 summands ) × (3 multiplications + 1 accumulation ) − 1 = 255
movement from off-chip memory to on-chip memory. We consider vector (φ is a 3D vector in deformation fields) arithmetic opera-
a 3D image. Let us define M as the total number of voxels, N = 64 tions. The calculation of Eq. (1) requires 4 + 4 + 4 = 12 scalar loads
as the number of control points, T as the number of voxels inside for the weights and 64 vector loads for the control points. If we
each tile, and L as the size, in words (words are 32-bits long, a use one weight for the Bl (u) · Bm (v) · Bn (w) product, instead of
common size for storing integer and real numbers), of transactions three individual weights, the required operations decrease to
into the cache (i.e., transactions between off- and on- chip mem- (64 summands ) × (1 multiplications + 1 accumulation ) − 1 = 127
ory). The L sized memory transfers of the three cases we are inter- (same as a parallel reduction) and the weights to be loaded
ested in are: increase to 4 × 4 × 4 = 64. This is not suitable for our register-only
a) No tiles: When we do not have tiles, for each of the M voxels, implementations, because there are not enough registers to store
we need to transfer N control points from global memory to shared the 64 weights and the use of one of the caches would impact the
memory. Transfers happen in L sized chunks. Hence, the total num- performance substantially (Section 3.4).
ber of transfers required is TTLI For every voxel of the output image, we reformulate the
summation of the 4 × 4 × 4 weighted control points to trilinear
N×M
(A.1) interpolations. We divide the 4 × 4 × 4 cubic neighborhood to
L
eight 2 × 2 × 2 sub-cubes, as in Fig. 1. Each sub-cube corresponds
b) Hardware trilinear interpolation: Each voxel is affected by the 43 to a trilinear interpolation. A trilinear interpolation requires seven
control points surrounding it. However, if we use the texture unit linear interpolations for its calculation. A linear interpolation has
to get their trilinear interpolations directly, only 23 loads are re- the form a + w ∗ (b − a ), which equals to a subtraction and a
quired [16]. Therefore, when we utilize the texture hardware for fused multiply-accumulate (FMA) operation. Thus, for the eight
loading the input, for each of the M voxels, we need to transfer sub-cubes and the ninth final sub-cube that is formed by the eight
23 control points from global memory to cache memory. Transfers results of the eight trilinear interpolations, we have
12 O. Zachariadis, A. Teatini and N. Satpute et al. / Computer Methods and Programs in Biomedicine 193 (2020) 105431
(9 cubes ) × (7 linear interpolations ) × (2 operations ) = 126 opera- [22] J.S. Heiselman, L.W. Clements, J.A. Collins, J.A. Weis, A.L. Simpson, S.K. Gee-
tions for each voxel. varghese, T.P. Kingham, W.R. Jarnagin, M.I. Miga, Characterization and correc-
tion of soft tissue deformation in laparoscopic image-guided liver surgery,
Observations Without taking into consideration instruction dual- Journal of Medical Imaging (2) (2018), doi:10.1117/1.JMI.5.2.021203. In Press
issue, (n) equals to 255∗ (number of voxels) and 126∗ (number of [23] S.F. Johnsen, S. Thompson, M.J. Clarkson, M. Modat, Y. Song, J. Totz, K. Gu-
voxels) respectively. rusamy, B. Davidson, Z.A. Taylor, D.J. Hawkes, S. Ourselin, Database-based
estimation of liver deformation under pneumoperitoneum for surgical im-
age-guidance and simulation, Lect. Notes Comput. Sci. 9350 (2015) 450–458.
References [24] D. Ruijters, P. Thévenaz, GPU prefilter for accurate cubic B-spline interpolation,
Comput. J. 55 (1) (2010) 15–20, doi:10.1093/comjnl/bxq086.
[1] A. Bartoli, T. Collins, N. Bourdel, M. Canis, Computer assisted minimally in- [25] F. Andersson, M. Carlsson, V.V. Nikitin, Fast algorithms and efficient GPU im-
vasive surgery: is medical computer vision the answer to improving laparo- plementations for the radon transform and the back-projection operator rep-
surgery? Med. Hypotheses 79 (6) (2012) 858–863, doi:10.1016/j.mehy.2012.09. resented as convolution operators, SIAM J. Imaging Sci. 9 (2) (2016) 637–664.
007. [26] J. Carron, A. Lewis, Maximum a posteriori CMB lensing reconstruction, Phys.
[2] S. Bernhardt, S.A. Nicolau, L. Soler, C. Doignon, The status of augmented reality Rev. D 96 (6) (2017) 63510.
in laparoscopic surgery as of 2016, Medi. Image Anal. 37 (2017) 66–90, doi:10. [27] V. Volkov, Better performance at lower occupancy, Proc. GPU Technol. Conf.
1016/j.media.2017.01.007. (2010) 1–75.
[3] A. Teatini, T. Langø, B. Edwin, O. Elle, et al., Assessment and comparison of tar- [28] N. Whitehead, A. Fit-Florea, Precision & performance: floating point and IEEE
get registration accuracy in surgical instrument tracking technologies, in: 2018 754 compliance for NVIDIA GPUs, NVIDIA White Paper 21 (10) (2011) 767–775,
40th Annual International Conference of the IEEE Engineering in Medicine and doi:10.1111/j.1468-2982.20 05.0 0972.x.
Biology Society (EMBC), IEEE, 2018, pp. 1845–1848. [29] A. Fog, The Microarchitecture of Intel, AMD and VIA CPUs: an Optimization
[4] A. Sotiras, C. Davatzikos, N. Paragios, Deformable medical image registration: a Guide for Assembly Programmers and Compiler Makers, 2018th ed., Technical
survey, IEEE Trans. Med. Imaging 32 (7) (2013) 1153. University of Denmark, 2018.
[5] D. Rueckert, L.I. Sonoda, C. Hayes, D.L. Hill, M.O. Leach, D.J. Hawkes, Nonrigid [30] Intel, Intel intrinsics guide, 2019, (software.intel.com, retrieved January 17,
registration using free-form deformations: application to breast MR images., 2019).
IEEE Trans. Med. Imaging 18 (8) (1999) 712–721, doi:10.1109/42.796284. [31] O. Jakob Elle, A. Teatini, O. Zachariadis, Data for: accelerating B-spline inter-
[6] N.D. Ellingwood, Y. Yin, M. Smith, C.L. Lin, Efficient methods for implementa- polation on GPUs: application to medical image registration, Mendeley Data
tion of multi-level nonrigid mass-preserving image registration on GPUs and (2019), doi:10.17632/kj3xcd776k.1.
multi-threaded CPUs, Comput. Methods Programs Biomed. 127 (2016) 290– [32] A. Pacioni, M. Carbone, C. Freschi, R. Viglialoro, V. Ferrari, M. Ferrari,
300, doi:10.1016/J.CMPB.2015.12.018. Patient-specific ultrasound liver phantom: materials and fabrication method,
[7] M. Modat, G.R. Ridgway, Z.A. Taylor, M. Lehmann, J. Barnes, D.J. Hawkes, Int. J. Comput. Assist.Radiol. Surg. 10 (7) (2015) 1065–1075, doi:10.1007/
N.C. Fox, S. Ourselin, Fast free-form deformation using graphics processing s11548- 014- 1120- y.
units, Comput. Methods Programs Biomed. 98 (3) (2010) 278–284, doi:10.1016/ [33] A. Teatini, W. Congcong, P. Rafael, A.C. Faouzi, B. Azeddine, E. Bjørn, E.O. Jakob,
j.cmpb.20 09.09.0 02. Validation of stereo vision based liver surface reconstruction for image guided
[8] NVIDIA, CUDA C Programming Guide 9.0(2017). surgery, in: Colour and Visual Computing Symposium (CVCS), IEEE, 2018,
[9] E. Smistad, T.L. Falch, M. Bozorgi, A.C. Elster, F. Lindseth, Medical image seg- pp. 1–6.
mentation on GPUs–a comprehensive review, Med. Image Anal. 20 (1) (2015) [34] PHILIPS, Ingenia: instructions for use, 2014.
1–18. [35] A. Teatini, E. Pelanis, D.L. Aghayan, R.P. Kumar, R. Palomar, Å.A. Fretland, B. Ed-
[10] J. Gai, N. Obeid, J.L. Holtrop, X.-L. Wu, F. Lam, M. Fu, J.P. Haldar, W.H. Wen-mei, win, O.J. Elle, The effect of intraoperative imaging on surgical navigation for
Z.-P. Liang, B.P. Sutton, More impatient: a gridding-accelerated toeplitz-based laparoscopic liver resection surgery, Sci. Rep. 9 (1) (2019) 1–11.
strategy for non-cartesian high-resolution 3D MRI on GPUs, J. Parallel Distrib. [36] NVIDIA, Nvidia Turing GPU Architecture Whitepaper(2018).
Comput. 73 (5) (2013) 686–697. [37] NVIDIA, Profiler User’s Guide(September) (2017).
[11] S.S. Stone, J.P. Haldar, S.C. Tsao, W.-m.W. Hwu, B.P. Sutton, Z.-P. Liang, Accel- [38] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen,
erating advanced MRI reconstructions on GPUs, J. Parallel Distrib. Comput. 68 B. Cook, D. Doerfler, L. Oliker, et al., An empirical roofline methodology for
(10) (2008) 1307–1318, doi:10.1016/j.jpdc.2008.05.013. quantitatively assessing performance portability, in: 2018 IEEE/ACM Interna-
[12] H. Wang, H. Peng, Y. Chang, D. Liang, A survey of GPU-based acceleration tech- tional Workshop on Performance, Portability and Productivity in HPC (P3HPC),
niques in MRI reconstructions, Quant. Imaging Med. Surg. 8 (2) (2018) 196. IEEE, 2018, pp. 14–23.
[13] T. Kalaiselvi, P. Sriramakrishnan, K. Somasundaram, Survey of using GPU CUDA [39] G.M. Amdahl, Validity of the single processor approach to achieving large scale
programming model in medical image analysis, Inform. Med. Unlocked 9 computing capabilities, in: Proceedings of the April 18-20, 1967, spring joint
(2017) 133–144. computer conference, ACM, 1967, pp. 483–485.
[14] R. Palomar, J. Gómez-Luna, F.A. Cheikh, J. Olivares, O.J. Elle, High-performance [40] J.P. Pluim, S.E. Muenzing, K.A. Eppenhof, K. Murphy, The truth is hard to make:
computation of Bézier surfaces on parallel and heterogeneous platforms, Int. J. validation of medical image registration, in: International Conference on Pat-
Parallel Program. 46 (6) (2018) 1035–1062. tern Recognition (ICPR), IEEE, 2016, pp. 2294–2300.
[15] N. Satpute, R. Naseem, E. Pelanis, J. Gomez-Luna, F. Alaya Cheikh, O.J. Elle, [41] A. Hore, D. Ziou, Image quality metrics: PSNR vs. SSIM, in: 2010 20th Interna-
J. Olivares, GPU acceleration of liver enhancement for tumor segmentation, tional Conference on Pattern Recognition, IEEE, 2010, pp. 2366–2369.
Comput. Methods Programs Biomed. 184 (2020) 105285, doi:10.1016/j.cmpb. [42] M. Unser, Splines: a perfect fit for signal and image processing, IEEE Signal
2019.105285. Process. Mag. 16 (6) (1999) 22–38.
[16] C. Sigg, M. Hadwiger, Fast third-order texture filtering, GPU Gems 2 (2005) [43] L.W. Clements, J.A. Collins, Y. Wu, A.L. Simpson, W.R. Jarnagin, M.I. Miga,
313–329. Validation of model-based deformation correction in image-guided liver
[17] D. Ruijters, B.M. ter Haar Romeny, P. Suetens, Efficient GPU-based texture in- surgery via tracked intraoperative ultrasound: preliminary method and re-
terpolation using uniform B-splines, J. Graph. GPU Game Tools 13 (4) (2008) sults, in: Medical Imaging 2015: Image-Guided Procedures, Robotic Interven-
61–69, doi:10.1080/2151237X.2008.10129269. tions, and Modeling, 9415, International Society for Optics and Photonics, 2015,
[18] X. Du, J. Dang, Y. Wang, S. Wang, T. Lei, A parallel nonrigid registration al- p. 94150T.
gorithm based on B-spline for medical images, Comput. Math. Methods Med. [44] H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, W.-m. Hwu, Performance analysis and
(2016), doi:10.1155/2016/7419307. tuning for general purpose Graphics Processing Units (GPGPU), Synth. Lect.
[19] J.A. Shackleford, N. Kandasamy, G.C. Sharp, On developing B-spline registration Comput. Archit. 7 (2012) 1–96, doi:10.220 0/S0 0451ED1V01Y201209CAC020.
algorithms for multi-core processors, Phys. Med. Biol. 55 (21) (2010) 6329– [45] Nitin Satpute, Rabia Naseem, Rafael Palomar, Orestis Zachariadis, Juan Gómez-
6351, doi:10.1088/0031-9155/55/21/001. Luna, Faouzi Alaya Cheikh, Joaquín Olivares, Fast Parallel Vessel Segmentation,
[20] I. Peterlík, H. Courtecuisse, R. Rohling, P. Abolmaesumi, C. Nguan, S. Cotin, Computer Methods and Programs in Biomedicine (2020), doi:10.1016/j.cmpb.
S. Salcudean, Fast elastic registration of soft tissues under large deformations, 2020.105430.
Med. Image Anal. 45 (2018) 24–40. [47] Andrea Teatini, Javier Pérez de Frutos, Benjamin Eigl, Egidijus Pelanis,
[21] C.P. Lee, Z. Xu, R.P. Burke, R. Baucom, B.K. Poulose, R.G. Abramson, B.A. Land- Davit Ludwig Aghayan, Lai Marco, Rahul Prasanna Kumar, Rafael Palomar,
man, Evaluation of five image registration tools for abdominal CT: Pitfalls and Bjørn Edwin, Ole Jakob, O.J. Elle, Influence of sampling accuracy on augmented
opportunities with soft anatomy, in: Medical Imaging 2015: Image Processing, reality for laparoscopic image-guided surgery, Minimally Invasive Therapy &
9413, International Society for Optics and Photonics, 2015, p. 94131N. Allied Technologies (2020), doi:10.1080/13645706.2020.1727524.