A Survey of GPU-based Acceleration Techniques in M
A Survey of GPU-based Acceleration Techniques in M
net/publication/324060154
CITATIONS READS
42 3,509
4 authors, including:
Dong Liang
Chinese Academy of Sciences
404 PUBLICATIONS 5,332 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Smart robotic tools for neuroscience and medical applications View project
All content following this page was uploaded by Dong Liang on 21 February 2020.
Correspondence to: Prof. Dong Liang. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.
Email: [email protected].
Abstract: Image reconstruction in magnetic resonance imaging (MRI) clinical applications has become
increasingly more complicated. However, diagnostic and treatment require very fast computational
procedure. Modern competitive platforms of graphics processing unit (GPU) have been used to make high-
performance parallel computations available, and attractive to common consumers for computing massively
parallel reconstruction problems at commodity price. GPUs have also become more and more important for
reconstruction computations, especially when deep learning starts to be applied into MRI reconstruction.
The motivation of this survey is to review the image reconstruction schemes of GPU computing for MRI
applications and provide a summary reference for researchers in MRI community.
Keywords: Graphics processing unit (GPU); magnetic resonance imaging (MRI); reconstruction
Submitted Nov 28, 2017. Accepted for publication Mar 05, 2018.
doi: 10.21037/qims.2018.03.07
View this article at: http://dx.doi.org/10.21037/qims.2018.03.07
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 197
500 4000
400 3000
300
2000
200
1000
100
0 0
2006 2008 2010 2012 2014 2016 2018 2006 2008 2010 2012 2014 2016 2018
Figure 1 Comparison speed of calculation (FLOPS) and speed of data movement (bandwidth) (GB/s) between GPUs and CPUs with years.
Figure taken with kind permission from Ref. (10).
languages such as Sh/RapidMind, Brook and Accelerator (10,11). Therefore, this development shift between GPUs
(5,6). But they are hard to be applied for the common and CPUs gives a new motivation for researchers to re-
programmers without corresponding programming training. consider parallelizing their computations of medical image
To further provide real convenience to GPU programmers, applications on GPU frameworks. Through directly
three libraries, which are NVIDIA’s Compute Unified inputting the data-parallel computation part onto GPUs,
Device Architecture (CUDA), Microsoft’s DirectCompute the number of physical computers within a computer
and Apple/Khronos Group’s OpenCL, provided more can be greatly reduced to minimum. The benefits are
feasible GPU programming frameworks for programmers not only reducing computer cost, but also requiring less
to ignore the requiring full and explicit conversion of the maintenance, space, power, and cooling for whole system
data to the graphical forms and take advantages of the high- operation cost inside any institutes, schools or hospitals.
performance computing speeds on GPUs (7). Actually, a
group at SGI Corporation has firstly implemented a GPU
GPU computing
computing of image recon processing on an Onyx primitive
workstation using the RealityEngine2.5 in 1994 (8). The physical architectures and processing model of GPU
Because of the graphics-hardware limitations at that time, and CPU are very different, which is the main reason
the SGI graphics-hardware implementation is 100 times the computing power of GPU is much faster than CPU.
slower than a single-core CPU processor in 2004 (8). As seen in Figure 2 (12), GPU has the features that can
However, the performance developments of recent single- provide many data-parallel, high memory bandwidths and
core CPU processors have been much slower than the deeply multi-threaded cores for large number of simple
multi-core of GPU processors. Today, GPUs have been computation tasks, but CPU just can provide the limited
a standard hardware part of the current computers for cores for high-complex computation tasks. Architecturally,
graphics processing and are further designed as relatively as seen in Figure 2, the structure comparison between CPUs
independent frameworks for processing data parallel and GPUs in GLOP/s computation capability is that GPUs
problems, which can assign individual element data to are highly specialized for compute-intensive, highly parallel
separate logical cores for complex processing [as seen in (9)]. computation hardware structure and design that over 80%
As seen in Figure 1, it presents the evolution of bandwidth of transistors are contributed for data processing rather
and computation abilities between GPUs and CPUs in GB/s than the data caching and flow control functions; on the
and GFLOP/s (i.e., billions of data movement speed per contrary, CPUs are designed as a few cores with many cache
second; billions of floating point operations per second memories for easily handling complex software threads
under single and double precision situations) (10). If GPU at one time. For instance, a general GPU can have 100+
is compared on a chip-to-chip basis against CPUs, GPUs processing cores which can handle thousands of software
can have much better capability on both key indexes, speed threads simultaneously. In theory, the GPUs’ performance
of calculation (FLOPS) and speed of data movement (GB/s) can be accelerated to process thousands of software threads
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
198 Wang et al. GPU-based acceleration in MRI reconstructions
CPU GPU
Figure 2 Comparison of GPU and CPU devotes more transistors to data processing (12). ALU, arithmetic logical unit; GPU, graphics
processing unit; DRAM, dynamic random access memory.
by 100× over CPUs’ alone processing. Because of the all the capabilities of GPUs, such as shared memory and
special GPU architectures which are different from general scattered writes; thirdly, the code portability is constrained
CPU architectures, GPU codes can easily run algorithms in by the specific hardware features of some graphics
parallel. But in most cases of general CPU-based algorithm, extensions (8).
there is no algorithm which can be suitable for GPUs, In order to solve the drawbacks, there are four major
because the general algorithms are suitable for the general commercial framework solutions, CUDA, OpenCL,
CPU architectures. The problems require that the general Stream and DirectCompute, which have been deployed
algorithms should be redesigned into new, more power- and to generate parallel higher-performance codes for GPU.
cost-efficient parallel algorithms for GPUs’ special features. Among them, CUDA was developed recently by NVIDIA;
The parallel codes can certainly bring higher-performance OpenCL is an open standard library that was developed
for developed algorithms, and simultaneously bring more by Khronos Group; Stream was developed by AMD (ATI
difficult debugging problems than general codes. chips); DirectCompute was developed by Microsoft (2).
As we know, GPUs are primitively developed to accelerate Among the solutions, CUDA is the solution that is most
the image processing speed of graphics cards. When widely used to rewrite algorithms to be GPU-enabled and
graphics mode is turned on, based on the GPUs’ API such as efficient by programmers in computer graphics, image
OpenGL or DirectX, programmers can implement shading processing, computer vision, computational fluid dynamics
programs to custom the graphics pipelines of GPUs at run (CFD) and many more fields. The primary advantage of the
time, using high-level shading languages such as NVIDIAC CUDA framework is that it can easily bring the C/C++-like
for Graphics (CG), OpenGL Shading Language (GLSL), or development environment and the parallel capabilities of
Microsoft High-Level Shading Language (HLSL), Adobe GPU acceleration for programmers, but does not require
Graphics Assembly Language (AGAL), Sony PlayStation programmers to have lots of detailed knowledge of GPU
Shader Language (PSSL), etc., which are originally designed hardware architectures. Although they are much helpful for
for real-time rendering (2). Although early GPU computing programmers to employ GPUs into the application, there
programs have achieved impressive accelerations on are still some more consumable software packages based
medical image processing (13-16), they suffered from some on CUDA and OpenCL libraries for the programmers
drawbacks as follows. Firstly, the GPU computing coding who are not familiar with GPU programming and have
is very difficult for entry-level programmer to develop the finite C/C++ parallel programming experiences. Here,
qualified code, because they need to be defined in terms of several popular libraries should be mentioned, Thrust,
graphics concepts, for instance, vertices, texture coordinates, cuFFT, cuSOLVER, cuSPARSE, and cuDNN, which
and fragments; secondly, acceleration performances of GPU are widely used in applications ranging in the fields of
computing codes are compromised by the lack of access to signal processing and image processing (17,18). For
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 199
example, Thrust is a derivative C++ template library for there are urgent speed requirements from doctors and
parallel GPU platforms based on the well-known CPU- scientists to review the patients’ images without too long
based Standard Template Library (STL). It can provide waiting for reconstruction processing. Currently, GPU
programmer a shortcut to easily utilize prototype demos computing has been increasingly investigated for clinical
for high performance CUDA applications with minimal MRI reconstruction applications (Table 1). According to
programming efforts through its high-level interfaces fully statistics, there are lots of papers about GPU, MRI and
interoperable with technologies such as C/C++, CUDA, reconstruction which are published from 2005 to 2016,
etc. (18). The cuFFT library provides a simple software as seen in Figure 3. This plot illustrates prevalence of
interface based on the well-known Cooley-Tukey and GPU-based methods in the field of MRI reconstruction.
Bluestein algorithms to get accurate Fourier transformation It is explicit that GPU-accelerated MRI reconstructions
(FT) results faster than ever, and its speed of computing became much more applicable especially after the release
fast Fourier transforms (FFTs) is up to about 10× faster of of NVIDIA’s CUDA in 2007 (47). Actually, the number
computing discrete Fourier transforms for any complex of publications related to GPU and MRI always increases
or real-valued data sets. The cuSOLVER library provides very quickly, but recently, the publications about GPU,
a collection of dense and sparse direct solvers which can MRI and reconstruction do not grow up synchronously.
deliver significant accelerations for computer vision, CFD, The growth is slowing down, because most of the GPU-
and linear optimization applications. The cuSPARSE accelerated algorithms about typical MRI reconstruction
library including a sparse triangular solver provides a have been studied well and implemented on GPUs. Among
collection of basic linear algebra subroutines used for sparse these GPU-accelerated methods, they can roughly be
matrices can deliver up to about 8× faster performance divided into three categories of MRI reconstruction with
than the well-known Intel Math Kernel Library (MKL). GPU computing, FT, parallel imaging (PI), compressed
As a GPU-accelerated version of the complete standard sensing (CS), and deep learning, which are going to be
library, it can deliver 6× to 17× faster performance than introduced as follows. A summarization of GPU-based MRI
the Intel MKL. The cuDNN library is a GPU-accelerated reconstruction method is presented in the Table 1. Here,
library for deep neural networks (NNs), which can provide it is default to apply GPUs to accelerate deep learning
highly tuned implementations for standard routines such as applications, because GPUs are suitable for deep learning
forward and backward convolution, pooling, normalization, calculations. Otherwise, the CUDA library and hardware
and activation layers (18). The cuDNN library allows of NVIDIA Corp. have occupied the field of the GPU
researchers to focus on training designed NNs and computation. It seems that the NVIDIA Tesla cards are not
developing software applications rather than spending faster than the NVIDIA GeForce cards, but it is an illusion
much time on realizing low-level GPU performance tuning. and NVIDIA Tesla cards are more powerful than NVIDIA
Now, cuDNN has been widely applied into many deep GeForce cards. Actually, the speed-up factors of GPU-
learning frameworks, such as, Caffe, TensorFlow, Theano, based MRI reconstruction depend on the system platforms,
Torch, and so on. Actually, these libraries have enough GPU-implementation and reconstruction algorithms.
ability to provide enough supports for most computations
of algorithm operations in magnetic resonance image
FT
reconstruction. Therefore, providing the competitive
ability of high-performance parallel computing supported Most MR imaging methods are designed based on Fourier
by these libraries, GPU-based computing algorithms encoding, so that methods in the MRI reconstructions
have been applied into the comprehensive applications of include basic FFT. The FFT implementation on CPUs has
magnetic resonance image reconstruction, due to GPUs’ already been quite efficient, but its version on GPUs can be
super-powerful parallel computing ability with multi-thread accelerated faster than on CPUs. Actually, Sumanaweera
capabilities and multi-core architectural structures (19). et al. presented to implement the Cartesian FFT method
as a multi-pass decimation-in-time butterfly algorithm on
GPUs in the book of Ref. (19). They presented several
Magnetic resonance imaging (MRI) reconstruction
specific approaches for obtaining higher performance,
In clinical applications, MRI reconstruction calculations using two pbuffers and balancing some computing loads
have become more and more complex for computers, so from fragment processors into the vertex processors
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
200 Wang et al. GPU-based acceleration in MRI reconstructions
(21) Non-uniform FFT (nonequispaced FFT) NVIDIA GeForce GTX 8800 NFFT library 21–85×
(22) Non-uniform FFT (conjugate gradient solver) NVIDIA GeForce GTX 8800 CUDA library 10×
(25) Gridding (reverse gridding of PROPELLER) NVIDIA GeForce GTX 8800 CUDA library 8×
(26) Gridding (reverse gridding optimization) NVIDIA Tesla C2050 CUDA library 6–30×
(27,28) Gridding (conjugate gradient linear solver) NVIDIA Tesla M2070 CUDA library 26×
(29) Parallel imaging (Cartesian SENSE, k-t SENSE) NVIDIA GeForce GTX 8800 CUDA library 3–108×
(30) Parallel imaging (radial SENSE) NVIDIA GeForce GTX 280 CUDA library 10–12×
(31) Parallel imaging (radial iterative SENSE) NVIDIA GeForce GTX 280 CUDA library 2×
(32) Parallel imaging (radial GRAPPA) NVIDIA Tesla M2090 CUDA library –
(33) Parallel imaging (GRAPPA operator gridding) NVIDIA GeForce GTX 780 CUDA library 6–30×
(34) Parallel imaging (radial ART) NVIDIA GeForce GTX 580 CUDA library 15×
(35) Compressed sensing (conjugate gradient solver) NVIDIA GeForce GTX 280 CUDA library 200×
(36) Compressed sensing (split Bregman regularization) NVIDIA Tesla C2050 CUDA library 10×
(37) Compressed sensing (3D Radial Cardiac MRI) NVIDIA GeForce GTX 480 CUDA library 34–54×
(38) Compressed sensing (ADMM algorithm) NVIDIA GeForce GTX 650 CUDA library 30×
(39) Compressed sensing (SENSE-type acquisition) NVIDIA Tesla C2050 CUDA library 3×
(40) Compressed sensing (L1-ESPIRiT algorithm) NVIDIA Tesla K20m CUDA library 3–15×
(41) Compressed Sensing (cloud computing) Amazon Elastic Compute Gadgetron 2–10×
Cloud
(42) Compressed sensing (field-compensated recon) NVIDIA GeForce GTX 280 CUDA library 81–284×
(43) Deep learning (convolutional neural network) NVIDIA GeForce GTX TITAN CUDA library –
(44) Deep learning (variational network) NVIDIA Tesla M40 CUDA library –
(45) Deep learning (residual regression, deep CNN) NVIDIA GeForce GTX 1080 CUDA library –
(46) Deep learning (manifold approximation, DNN) NVIDIA Tesla P100 CUDA library –
MRI, magnetic resonance imaging; GPU, graphics processing unit; FFT, fast Fourier transform; CNN, convolutional neural network; DNN,
deep neural network; HLSL, High Level Shading Language; CUDA, Compute Unified Device Architecture.
and rasterizers. And they briefly illustrated the high- Bessel window gridding algorithm and a Ram-Lak filtered
performance experiments of MRI reconstruction and back-projection method. Their primary results showed
ultrasonic imaging. Then, Schiwietz et al. described the recon time and image quality for two GPU-based
an efficient GPU-based implementation of the non- reconstruction algorithms that were comparable with
Cartesian FFT (20), which was written in primitive and the CPU-based implementations for radial trajectories.
underlying Microsoft’s DirectX with C/C++ and HLSL. Otherwise, Sørensen et al. also presented a fast parallel
They implemented a look-up-table (LUT)-based Kaiser- GPU-accelerated algorithm to compute the nonequispaced
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 201
1200
Currently, NVIDIA has released their easy-to-use CUDA
1000 framework in which they realized the cuFFT library (49),
which is an optimized GPU-based implementation of the
800
FFT. There are two separate libraries: cuFFT and cuFFTW.
600 The cuFFT library is designed to provide easy-to-use
high-performance FFT computations only on NVIDIA
400
GPU cards. While, the cuFFTW library is a porting
200 tool that is provided to apply FFTW into users’ projects
with a minimum amount of effort. Both two libraries can
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 provide the features as follows (18): an O(nlogn) algorithm
Year
for different input data sizes; single-precision (i.e., 32-bit
Figure 3 The plot shows that cumulative articles published for floating point) and double-precision (i.e., 64-bit floating
GPU, MRI and reconstruction from 2005 to 2016. GPU, graphics point) computations; complex and real-valued digital
processing unit; MRI, magnetic resonance imaging. input and output; execution of multiple 1D, 2D and 3D
transforms simultaneously; most in-place and out-of-place
FFT; arbitrary intra- and inter-dimension element strides.
Here, Figure 4 shows a current example of using CUDA’s
cuFFT library to calculate two-dimensional FFT, as similar
as Ref. (49). They simply are delivered into general codes,
which can bring the GPU-accelerated computation power
for arbitrary projects.
Actually, Stone et al. have presented an anatomically
constrained MR reconstruction algorithm based on
NVIDIA’s CUDA library for non-Cartesian MR data (22).
What is more, their algorithm could find the solution for a
quasi-Bayesian estimation problem that is a typical problem
in MRI reconstructions. Their results showed that their
algorithm could reduce the recon time for an advanced
non-uniform reconstruction of the in vivo data from
23 minutes on a quad-core CPU to about 1 minute on the
Quadro Plex cluster, which can be applied to accelerate
MR reconstruction into many clinical applications. Besides,
Figure 4 Computing 2D FFT of size NX × NY using CUDA’s Yang et al. presented optimized interpolators to approximate
cuFFT library (49). FFT, fast Fourier transform; NX, the number the non-uniform FT of a finitely supported function in the
along X axis; NY, the number along Y axis. inversion of non-Cartesian data (23). According to their
simulation applications, their interpolators could provide
iterative non-Cartesian inversion algorithms which could
FFT (21), in which the key step is implementing the reduce memory demands on memory limited early GPU
convolution step in transform that was the most time systems. Moreover, Guo et al. improved a grid-driven
consuming part before. The authors claimed that their interpolation algorithm for PROPELLER trajectory in
GPU-accelerated convolution was up to 85 times faster real-time non-Cartesian applications (24). Their GPU-
than the open source NUFFT library (48), when using based method could be about 9 times faster than their
two MRI data-sets sampled by radial and spiral trajectories implementation on CPU, and it could achieve compatible
to estimate the algorithm performances. Actually, before motion correction accuracy and image quality. Moreover,
the NVIDIA CUDA library appeared in June 2007, any Yang et al. also presented a CUDA-based algorithm to raise
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
202 Wang et al. GPU-based acceleration in MRI reconstructions
the reconstruction efficiency of conventional PROPELLER implemented in GPUs (29-33). For example, the GPU-
trajectory (25). They developed a reverse gridding algorithm based implementations of Cartesian SENSE and k-t
to reduce computation complexity. But different from the SENSE have been presented by Hansen et al. in Ref. (29).
conventional gridding algorithm which generated a grid They focused on the inversion problems of SENSE recon
window for every trajectory, their algorithm calculated a and solved them for each set of aliased pixels in image-
trajectory window for every grid. The contribution of each space or x-f space, since these problems generally were the
k-space point in the convolution window was accumulated most time-consuming steps in the SENSE and SENSE
for this grid. Their experiments illustrated that its recon derivative reconstructions. Here, Sørensen et al. presented
speed was 7.5 times faster than that of conventional gridding a GPU-based reconstruction algorithm to enable real time
algorithm. Besides, Obeid et al. proposed a modified GPU- reconstruction of sensitivity encoded none-Cartesian radial
based gridding method to perform gridding using a graphics imaging (e.g., radial SENSE) (30). They claimed their
processor (GPU) to achieve up to 29× acceleration for algorithms could be used for real-time recon applications,
three-dimensional gridding (26). Their solution was to allow because of using a moving buffer scheme to buffer the
bins to contain a variable number of sample points within interval between data acquisition and image display. In
them, without sacrificing rapid access. Furthermore, an addition, Sørensen et al. also have further described their
image reconstruction GPU-accelerated software toolkit for real-time iterative SENSE GPU-based reconstruction
reconstructing data from arbitrary 3D trajectories has been to reduce the reconstruction time in the isotropic whole-
released in Ref. (27,28). It is named as the Illinois Massively heart imaging application, an important protocol in
Parallel Acquisition Toolkit for Image reconstruction with simplifying cardiac MRI (31). They have shown that the 3D
ENhanced Throughput in MRI (IMPATIENT MRI). datasets (256 slices) could be reconstructed in 5–6 minutes.
In their toolkit, they removed computational bottlenecks As an important PMRI method, GRAPPA also has been
using a gridding method to accelerate the computation of implemented on GPUs. For example, Saybasili et al.
data structures by the previous routine. Furthermore, they presented an automatically distributed hybrid (multi-node,
enhanced the capabilities of off-resonance correction and multi-GPU), low-latency through-time radial GRAPPA
multi-sensor PI reconstruction, with speeding up 200 times reconstruction pipeline in Ref. (32). Actually, they proposed
more than before (27). And they gained much efficient a combined CPU- and GPU-based computation framework
trajectories for high spatial and temporal resolution in the to use multi-threaded CPU and GPU programming on
applications (28). multiple nodes (32). In their implementation, the master
node forwards raw data was to each node for partial
processing, because GRAPPA generally requires using all
PI
coil data to separately recon coil by coil. Each node could
PI (PMRI) techniques are employed to reconstruct under- distribute the task to its local GPUs, and send its partial
sampled data in k-space by attaining complementary image results back to the master node after reconstructions.
information from multiple receive coils. There are a lot of After that, all image results were combined and sent to the
PMRI reconstruction techniques that have been proposed (50). scanner for display. Their implementation claimed their
Currently, among them, the most well-known PMRI reconstruction performances on 32 coils could achieve
techniques are mainly SMASH (51), SENSE (52), and 42 ms acquisition time, 11.2 ms reconstruction time for
GRAPPA (53). Most of these techniques require additional under-sampled radial datasets, and their methods could
coil sensitivity maps to remove the artifact’s effect, because be utilized into more challenging reconstruction scenarios
of acquisition data under-sampled in the k-space. If current which have larger numbers of acquisition coils, higher
PMRI methods are simply analyzed, they can be roughly- acceleration rates, or more GPUs than before. Furthermore,
classified into two types (50): one is the reconstruction Inam et al. proposed an acceleration method for Self-
procedure in image space which includes unfolding calibrating GRAPPA operator gridding by using massively
operation and inverse procedure, for example, SENSE (52); parallel architecture of GPUs (33). The LUTs were used to
another is the reconstruction procedure in k-space, which pre-calculate all possible combinations of gridding weight
has kernel calculation and recovery procedures of missing as well as avoid the race condition among the CUDA kernel
k-space data, for instance, SMASH (51) and GRAPPA (53). threads. Firstly, they used the LUT-based optimized kernels
SENSE and SENSE derivative methods have been of CUDA to pre-calculate all the possible combinations of
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 203
2D-gridding weight sets, after that they applied appropriate between GPUs and CPUs. Recently, because of popular
weight sets to shift the radial acquisitions to the nearest CS, many studies have applied GPU-accelerated computing
Cartesian grid locations. They claimed that their GPU- for fast CS MRI reconstruction, which seemed to be
based method could typically achieve 6× to 30× speed-up ideally suited for CS recon (54). For instance, Smith et al.
without compromising the image quality. also have presented a GPU-accelerated Bregman solver
Actually, the above methods mainly attempted to transfer to accelerate 2D CS reconstruction in Ref. (36). They
the classical PMRI methods based on CPUs into the demonstrated that their combination of the split Bregman
GPU-based PMRI recon methods. However, the PMRI method and GPU computing could achieve the rapid
algorithms based on parallel GPUs should be re-designed convergence and massive parallel computation of real-
according to the GPUs’ features. For example, Li et al. time CS reconstruction for small-to-moderate size images.
implemented a GPU-accelerated algebraic reconstruction Their GPU-accelerated iterative reconstruction method
technique (ART) reconstruction in Ref. (34) to apply could reconstruct two-dimensional 1,0242 data matrices
to recover images with radial cardiac cine acquisitions. with a factor of up to 27 and spend about 0.3 seconds or
They mainly compared the reconstructed Cine images of less; even there is no available GPU VRAM. Nam et al. also
their radial ART method with filtered back-projection at proposed a parallelized GPU-accelerated implementation
multiple under-sampling levels. Their results illustrated of an iterative CS reconstruction algorithm for 3D radial
that GPU-accelerated ART could get comparable results data acquisitions in both phantom and in vivo whole-heart
in comparison with conjugate gradient SENSE in parallel coronary data sets (37). To reduce the time-cost operations
radial MR imaging, and could also reduce artifacts and of gridding and regridding operations, operations could be
maintain image sharpness comparing with general filtered performed in a parallel manner for every measured radial
back-projection methods. Actually, the classical PMRI point, suited for CUDA implementation. Comparing with
methods are not time-cost iterative methods; there are not the efficacy of the general CPU-implementation, their
huge improvements for GPU-based classical PMRI recon GPU-implemented CS reconstruction could improve
methods if the scenario is not extreme. However, for some image recon quality in terms of vessel sharpness and
nonlinear problems or complex iterative reconstructions suppress noise-like artifacts, and reduce the running time
in PMRI applications, GPU-based recon methods can of CS reconstruction to 34.3–53.9 times less than CPU-
bring lots of improvements if recon methods are designed based C/C++ implementation. In addition, Chang et al.
according to the GPUs’ structure features. presented an efficient GPU-based method for CS-MRI
reconstruction for 3D multichannel data (38). They built
a highly-parallelized framework to compute the CS-MRI
CS
reconstructions of simultaneous multiple-channel 3D-CS
Recently, to solve a minimization problem in MRI reconstructions. The results of simulated data and in vivo
reconstruction, CS was studied and started to be applied data showed that the proposed efficient method can
into MR applications (54). Because the NVIDIA CUDA significantly shorten the reconstruction run-time by a factor
library is more and more perfect to support for GPU of 30. Even in some clinical applications, the 3D multi-slice
computing, the complex sparse reconstruction methods are CS reconstruction of the proposed method allowed to be
more easily implemented on GPUs without considering performed in less than 1 second.
the hardware constrains on GPUs. Actually, there are Otherwise, there are also some other papers studying CS
recently some papers studying CS MR recon methods MR recon methods on several parallel architectures of multi-
on GPU architectures (35-42). For example, Zhuo et al. core CPUs and multi-core GPUs (39-41). They presented
presented a GPU-accelerated regularization reconstruction the huge potential speed-up ability on the architectures.
method with compensations for susceptibility-induced field For instance, Kim et al. investigated an inexact quasi-
inhomogeneity effects, which are incorporating a quadratic Newton CS reconstruction algorithm on several parallel
regularization term (35). In their experiments, they realized processing architectures that included CPUs, GPUs, Intel’s
a GPU-based spatial regularization with sparse matrices, Many Integrated Core (MIC) architecture, etc. (39). They
which of the entire procedure is enabled to be performed have claimed lots of experiments on different parallel
on GPUs and avoid the memory bandwidth bottlenecks architectures (multi-core CPUs, GPUs, MIC, etc.). Among
which are associated with frequent communications their experiments (39), their reference implementations on
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
204 Wang et al. GPU-based acceleration in MRI reconstructions
the 4-core Core i7 were able to reconstruct a 256×160×80 calculation time, while it could still guarantee acceptable
8-channel data of the neuro vasculature with a speedup of accuracy to compensate MR field inhomogeneity.
10× under-sampled data set in 56 seconds; the recon time
could be reduced further to 32 seconds on the 6-core Core
Deep learning (DL)
i7; the CUDA-based implementation could reduce the
reconstruction time to 16 seconds on NVIDIA GTX480; Recent developments of DL in NNs have brought lots
the recon time even could reduce to 12 seconds on the of breakthrough improvements in many areas (55-58).
Intel’s Knights Ferry (KNF) of the MIC architecture. All Because of the time-cost training and multiple-layer NNs,
their experiments showed that their CS algorithm could GPUs are very suitable for solving the massive calculation
bring huge benefits from those throughput-oriented problems of DL (55). Although there are several attempts
architectures. Apart from that, Sabbagh et al. studied to at creating fast NN-specific hardware, GPUs brought a real
accelerate the non-linear CS reconstruction problem in cheap way to implement DL in lots of applications. GPUs
cardiac MRI solved by iterative optimization algorithms can be employed at not only the fast matrix and vector
and to facilitate the migration of CS reconstruction in the multiplications, but also for NN training, and speeding up
clinical applications (40). Their experiments employed DL by a factor of 50 and more (55). Currently, GPU-based
3D steady-state free precession MRI images from five DL started to be applied into some MR applications (43-46)
patients, and compared the speed and recon image quality as follows to solve the problems of MRI reconstruction.
on different parallel platforms, such as CPU, CPU with Wang et al. firstly proposed a DL method to accelerate
OpenMP, and GPU. Their recon results showed that the MR reconstruction (43). They built a big dataset of
mean reconstruction time was 13.1±3.8 minutes on the existing high-quality images, and trained an off-line 3-layer
CPU platform, 11.6±3.6 minutes on the CPU platform convolutional neural network (CNN) as the complex
with OpenMP, and 2.5±0.3 minutes for the CPU platform mapping between MR images from zero-filled and fully-
with OpenMP plus GPU (40). And their image qualities sampled k-space data. Actually, the trained network can
estimated by image subtraction were very similar, which are predict the under-sampled data, when solving an online
comparable on different parallel architectures. Otherwise, constrained reconstruction problem. Although the off-line
the modern cloud-computing conception also has been training can take roughly 3 days, it took less than 1 second
applied into time-cost MR reconstructions. Cloud- for every online reconstruction-based GPU. The in vivo
computing generally needs to support most of modern results illustrated that the proposed method can restore fine
parallel architectures, GPUs are one of them. For example, details and have great potential for effective MR imaging.
Xue et al. utilized the open source Gadgetron framework to Hammernik et al. presented an efficient approach to
support distributed computing for image reconstruction and learn a variational network which can remove typical under-
demonstrated that a multi-node version of the Gadgetron sampling artifacts and restore important image details, such
which could provide nonlinear image reconstruction with as the natural appearance of anatomical structures (44).
clinically acceptable latency (41). Actually, their framework They considered that their trained models were highly
was a cloud-enabled version of the Gadgetron on three efficient and are also well-suited for parallel computation on
different distributed computing platforms ranging from GPUs, due to their structural simplicity. And their approach
a heterogeneous collection of commodity computers to illustrated that they achieved superior results than many
the commercial Amazon Elastic Compute Cloud (41). commonly used reconstruction methods.
They claimed that they could provide nonlinear, CS Lee et al. expressed a novel deep residual learning
reconstructions of cardiac and neuron imaging applications algorithm to recover images from highly under-sampled
with low reconstruction latency. Besides, Zhuo et al. proposed k-space data (45). Here, they formulated a traditional CS
an GPU-implemented reconstruction algorithm with MR problem as a residual regression problem, and designed a
field inhomogeneity compensation into calculating magnetic deep CNN to learn the aliasing artifacts. They trained the
field maps and its gradients for iterative CG reconstruction NN using the magnitude of MR images by a stochastic
algorithms on NVIDA CUDA-enabled GPUs (42). If gradient descent method with momentum based on the
comparing with their CPU-based implementations, their MatConvNet toolbox (59) and NVIDIA GTX 1080 GPUs.
GPU-based implementations could hugely reduce the They expressed that their algorithm took only about 30 ms
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 205
after their deep CNN has been well trained, with much GPU-based applications of MRI reconstruction, they have
better reconstruction performance compared to many been gradually recognized and widely applied. Although
existing GRAPPA and CS algorithms. the early GPU programming was constrained and not
Zhu et al. proposed an automated robust NN as a friendly, the developments of GPU programming have
generalized reconstruction framework which can exploit provided more easy-to-use libraries and frameworks for
the universal function approximation of multi-layer programmers. GPUs have played more and more important
perception regression and the manifold learning properties roles in medical imaging, image reconstruction and image
demonstrated by auto-encoders (46). They implemented analysis in the clinical applications. Despite lots of successful
a unified reconstruction framework with a deep neural applications have been performed in the recon community
network (DNN) feed-forward architecture composed of of GPU-based medical imaging, there still remains long-
fully-connected layers followed by a sparse convolutional standing unsolved solution problems.
auto-encoder. And they built their NN parameters which Firstly, GPUs’ parallel architectures require re-designing
were trained to minimize squared loss and updated by a the pipeline of the reconstruction algorithms. Although
stochastic gradient descent, computed with the Tensorflow there are many libraries to assist people to employ GPUs,
toolbox (60) and 2 NVIDIA Tesla P100 GPUs. And their the algorithms pre-optimized before GPU programming
results show to be over a lot of acquisition strategies, and still can bring huge improvements than any easy-to-use
have excellent immunity to noise and artifacts. libraries. It is better to consider the parallel structures in
With fast developments of GPUs and DL in NNs, an any custom-designed algorithms for GPU computing. In
exciting epoch of MRI reconstruction has started. Although addition, the hybrid architectures based on GPU computing
it is still early to say the DL reconstruction approaches will and traditional ×86 CPU-based high-performance
replace currently used clinical methods, the development of computing clusters are more and more popular, even the
the DL approaches has illustrated huge potential to promote cloud computing appears in industry. While software and
the technology developments of MRI reconstruction and hardware trends are not the primary problems of medical
change these community. image computing, the ability that is efficiently employing
more sophisticated algorithms as faster technology emerges
is still an important driving force, largely precluding any
Conclusions
kind of convergence in algorithms (47).
Except the above GPU-accelerated MR reconstructions, In the future, the computing efficiency of the custom-
there are also a few relative applications of unclassical designed optimized algorithms, especially MRI reconstruction
reconstructions which attempted to apply GPU based on GPUs and DL, should be synthetically considered
implementation applications. For example, Johnson et al. as the sequential and parallel procedure, and the low-cost
proposed a GPU-based iterative decomposition of water Internet computation and storage services should be seriously
and fat with echo asymmetry and least-squares (IDEAL) considered.
reconstruction scheme (61). They estimated the fat-
water parameters and compared Brent’s method with
Acknowledgements
golden section search to optimize the unknown MR field
inhomogeneity parameter (psi) in the IDEAL equations. Funding: This work was partially supported by the National
They claimed that their algorithm was made more robust to Natural Science Foundation of China (No. 61471350,
fat-water ambiguities using a modified planar extrapolation 81729003), the Basic Research Program of Shenzhen
of psi method (61). Their experiments show that fat-water (JCYJ20150831154213680), and the Key Laboratory
reconstruction time of their GPU-implementation methods for Magnetic Resonance and Multimodality Imaging of
could be quickly and robustly reduced with a factor of 11.6 Guangdong Province (2014B030301013).
on a GPU in comparison to CPU-based reconstruction.
Nowadays, GPU has been one of the standard tools in
Footnote
high-performance computing (2). More and more GPUs
have been applied into more and more applications because Conflicts of Interest: The authors have no conflicts of interest
of their parallel computing ability and low cost. Among the to declare.
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
206 Wang et al. GPU-based acceleration in MRI reconstructions
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
Quantitative Imaging in Medicine and Surgery, Vol 8, No 2 March 2018 207
30. Sørensen TS, Atkinson D, Schaeffter T, Hansen MS. Real- Distributed MRI reconstruction using Gadgetron-based
time reconstruction of sensitivity encoded radial magnetic cloud computing. Magn Reson Med 2015;73:1015-25.
resonance imaging using a graphics processing unit. IEEE 42. Zhuo Y, Wu XL, Haldar JP, Hwu WM, Liang ZP, Sutton
Trans Med Imaging 2009;28:1974-85. BP. Accelerating iterative field-compensated MR image
31. Sørensen TS, Prieto C, Atkinson D, Hansen MS, reconstruction on GPUs. Proc. IEEE International
Schaeffter T. GPU accelerated iterative SENSE Symposium on Biomedical Imaging (ISBI), 2010:820-3.
reconstruction of radial phase encoded whole-heart MRI. 43. Wang S, Su Z, Ying L, Peng X, Zhu S, Liang F, Feng D,
Proc. ISMRM, Stockholm, Sweden, 2010:2869. Liang D. Accelerating magnetic resonance imaging via
32. Saybasili H, Herzka DA, Barkauskas K, Seiberlich N, deep learning. Proc. IEEE International Symposium on
Griswold MA. A Generic, Multi-Node, Multi-GPU Biomedical Imaging (ISBI), 2016:514-7.
Reconstruction Framework for Online, Real-Time, Low- 44. Hammernik K, Knoll F, Sodickson D, Pock T. Learning
Latency MRI. Proc. 21st Meet Int Soc Magn Reson Med, a variational model for compressed sensing MRI
Salt Lake City, Utah, USA; 2013:838. reconstruction, Proc. the International Society of Magnetic
33. Inam O, Qureshi M, Malik SA, Omer H. GPU- Resonance in Medicine (ISMRM), 2016.
Accelerated Self-Calibrating GRAPPA Operator Gridding 45. Lee D, Yoo J, Ye JC. Deep residual learning for
for Rapid Reconstruction of Non-Cartesian MRI Data. compressed sensing MRI. Proc. IEEE International
Applied Magnetic Resonance 2017;48:1055-74. Symposium on Biomedical Imaging (ISBI), 2017.
34. Li S, Chan C, Stockmann JP, Tagare H, Adluru G, Tam 46. Zhu B, Liu JZ, Rosen BR, Rosen MS. Neural Network
LK, Galiana G, Constable RT, Kozerke S, Peters DC. MR Image Reconstruction with AUTOMAP: Automated
Algebraic reconstruction technique for parallel imaging Transform by Manifold Approximation. Proc. the
reconstruction of undersampled radial data: application to International Society of Magnetic Resonance in Medicine
cardiac cine. Magn Reson Med 2015;73:1643-53. (ISMRM), 2017.
35. Zhuo Y, Sutton B, Wu XL, Haldar J, Hwu WM, Liang ZP. 47. Eklund A, Dufort P, Forsberg D, LaConte SM. Medical
Sparse regularization in MRI iterative reconstruction using image processing on the GPU - past, present and future.
GPUs. Proc. International Conference on Biomedical Med Image Anal 2013;17:1073-94.
Engineering and Informatics (BMEI), 2010:578-82. 48. Fessler J, Sutton B. Nonuniform fast Fourier transforms
36. Smith D, Gore J, Yankeelov T, Welch E. Real-Time using min-max interpolation. IEEE Trans Signal Process
Compressive Sensing MRI Reconstruction Using 2003:51:560-74.
GPU Computing and Split Bregman Methods. Proc. 49. cuFFT User Guide. Available online: http://docs.nvidia.
International Journal of Biomedical Imaging, 2012:864827. com/cuda/cufft/index.html
37. Nam S, Akçakaya M, Basha T, Stehning C, Manning WJ, 50. Blaimer M, Breuer F, Mueller M, Heidemann RM,
Tarokh V, Nezafat R. Compressed sensing reconstruction Griswold MA, Jakob PM. SMASH, SENSE, PILS,
for whole-heart imaging with 3D radial trajectories: a GRAPPA: how to choose the optimal method. Top Magn
graphics processing unit implementation. Magn Reson Reson Imaging 2004;15:223-36.
Med 2013;69:91-102. 51. Sodickson DK, Manning WJ. Simultaneous acquisition
38. Chang CH, Yu X, Ji JX. Compressed sensing MRI of spatial harmonics (SMASH): fast imaging with
reconstruction from 3D multichannel data using GPUs. radiofrequency coil arrays. Magn Reson Med
Magn Reson Med 2017;78:2265-74. 1997;38:591-603.
39. Kim D, Trzasko JD, Smelyanskiy M, Haider CR, Manduca 52. Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P.
A, Dubey P. High-performance 3D compressive sensing SENSE: sensitivity encoding for fast MRI. Magn Reson
MRI reconstruction. Conf Proc. IEEE Eng Med Biol Soc Med 1999;42:952-62.
2010;2010:3321-4. 53. Griswold MA, Jakob PM, Heidemann RM, Nittka
40. Sabbagh M, Uecker M, Powell A, Leeser M, Moghari M. M, Jellus V, Wang J, Kiefer B, Haase A. Generalized
Cardiac MRI compressed sensing image reconstruction autocalibrating partially parallel acquisitions (GRAPPA).
with a graphics processing unit. Proc. International Magn Reson Med 2002;47:1202-10.
Symposium on Medical Information and Communication 54. Lustig M, Donoho D, Pauly JM. Sparse MRI: The
Technology (ISMICT), Worcester, MA, 2016. application of compressed sensing for rapid MR imaging.
41. Xue H, Inati S, Sørensen TS, Kellman P, Hansen MS. Magn Reson Med 2007;58:1182-95.
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208
208 Wang et al. GPU-based acceleration in MRI reconstructions
55. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S,
2015;521:436-44. Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz
56. Schmidhuber J. Deep learning in neural networks: an R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga
overview. Neural Netw 2015;61:85-117. R, Moore S, Murray D, Olah C, Schuster M, Shlens J,
57. Knoll F. Leveraging the potential of neural networks for Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V,
image reconstruction. Proc. the International Society of Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg
Magnetic Resonance in Medicine (ISMRM), 2017. M, Wicke M, Yu Y, Zheng X. Tensorflow: Large-scale
58. Després P, Jia X. A review of GPU-based medical image machine learning on heterogeneous distributed systems.
reconstruction. Phys Med 2017;42:76-92. arXiv preprint arXiv:1603.04467, 2016.
59. Vedaldi A, Lenc K. MatConvNet: Convolutional 61. Johnson DH, Narayan S, Flask CA, Wilson DL. Improved
Neural Networks for MATLAB. Proc. of the 23rd ACM fat-water reconstruction algorithm with graphics hardware
international conference on Multimedia, 2015:689-92. acceleration. J Magn Reson Imaging 2010;31:457-65.
60. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro
© Quantitative Imaging in Medicine and Surgery. All rights reserved. qims.amegroups.com Quant Imaging Med Surg 2018;8(2):196-208