Accelerating XD GRASP MRImage Reconstruction
Accelerating XD GRASP MRImage Reconstruction
Image Reconstruction
Author:
Thomas Yung Supervisors:
ty516 Prof. Wayne Luk
Dr. Andreas Wetscherek
CID: Marco Barbone
01203300
i
Acknowledgments
I give my thanks to my supervisors Wayne Luk (ICL) and Andreas
Wetscherek (ICR) for their support and guidance through the project.
Their knowledge and expertise has been invaluable.
I would also like to thank my close friends and family for their continuous
support and belief in me. At times it felt like you carried me through.
Lastly, I thank my sister, Louisa, and Nana, Diane, who have stuck by me
through the best and worst of times and are the people I treasure most in
my life.
ii
Contents
1 Introduction 2
2 Background 5
2.1 MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Non-Uniform Fast Fourier Transform (NUFFT) . . . . 8
2.3 Supporting Mathematical Processes . . . . . . . . . . . . . . . 9
2.3.1 L2-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Conjugate Gradient Method . . . . . . . . . . . . . . . 10
2.3.4 Backtracking Line Search . . . . . . . . . . . . . . . . 11
2.4 XD-GRASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Image Similarity Measures . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Mean Squared Error . . . . . . . . . . . . . . . . . . . 14
2.5.2 Structural Similarly . . . . . . . . . . . . . . . . . . . . 15
2.6 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . 16
2.6.1 Components . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Optimisation Targets . . . . . . . . . . . . . . . . . . . 18
2.6.3 Optimisation Strategies . . . . . . . . . . . . . . . . . . 18
2.6.4 Xilinx R VU9P . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Optimisation 30
4.1 Data Precision . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Maximising CPU Utilisation . . . . . . . . . . . . . . . . . . . 30
4.3 Explicit Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Flatten, Merge, Repeat . . . . . . . . . . . . . . . . . . . . . . 32
iii
4.5 Data Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Replicating Hardware . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Optimal Parameters for GPU . . . . . . . . . . . . . . . . . . 36
5 Implementation 39
5.1 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Kernel Breakdown . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 CombineAcrossCoilsKernel . . . . . . . . . . . . . . . 41
5.3.2 TransposedMultiplicationKernel . . . . . . . . . . . 43
5.4 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Performance Evaluation 49
6.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Image Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Performance Improvements . . . . . . . . . . . . . . . . . . . . 53
6.3.1 Armdahl’s Law . . . . . . . . . . . . . . . . . . . . . . 54
6.4 Kernel Performance . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 FPGA Utilisation . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.6 GPU Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.7 Bridging the Gap . . . . . . . . . . . . . . . . . . . . . . . . . 59
Appendices 62
1
1 Introduction
Britons once voted cancer as being the most feared disease [1], with statistics
showing that ‘every 2 minutes’ a person in the UK is diagnosed with a form
of cancer [2]. The need for a fast, efficient way of treating cancer has never
been more important.
In radiotherapy, a cancer treatment in which doses of radiation are tar-
geted at an area of cancerous cells in order to kill them, physicists and
dosimetrists plan a radiation dose using a three-dimensional image of the
patient’s interior, which reveals the size, shape and location of the tumour.
However, during the delay between scanning the patient and executing the
plan, which is typically in the order of weeks, the tumour may grow and the
anatomy of the patient may change (due to weight loss), causing doses to be
given on the side of safety – they are weaker and, consequently, multiple are
given throughout a therapy course.
In order to address the problems presented by the time delay, adaptive
radiotherapy aims to provide a real-time dose recommendation requiring cal-
culations, in this case the image reconstruction, to complete fast enough for
the suggestion to be applied whilst the patient is in treatment. Henceforth,
the objective of this project is to accelerate MR image reconstruction and to
learn whether it is possible for MRI reconstruction to execute fast enough
for adaptive radiotherapy.
For the patient, alongside speed, the comfort of the treatment is also
important. Physical movement in the patient, from breathing, causes re-
constructed MR images to be blurred when a time-indifferent reconstruction
algorithm is used and, whilst the patient could be instructed to hold their
breath, this is not always a practical or possible solution. In addition, a
scan taken during a breath-hold produces images only valid for that specific
breath-hold, meaning the patient would have to be instructed to reproduce
the same breath-hold during the treatment, for the images were to be used
reliably – a further impracticality. Physiological motion can also cause vari-
ation in the location of the tumour, motivating the need for MR images with
temporal resolution (an additional dimension to 3D images). XD-GRASP
[3] offers the ability to reconstruct 4D MRI scan images from data obtained
under free-breathing, which allows the patient to be as comfortable as possi-
ble during an emotionally stressful process, making the algorithm a suitable
choice for optimisation.
Thus, this project accelerates XD-GRASP by using an FPGA and a GPU
2
in order to provide the best treatment possible to the patient. To achieve
this, this report outlines the following work:
• Section 4: Optimisation
– The limits of the used hardware are assessed for 4D MRI recon-
struction
– Hardware requirements for real-time 4D MRI reconstruction are
discussed
3
(equivalently around 159 minutes or 2.66 hours). This time is an approxima-
tion to the total running time of the existing XD-GRASP implementation, as
the algorithm make take longer or shorter depending on the image data and
whether or not it requires a larger number of iterations (determined by the
algorithm) to construct clear images. Acceleration of this algorithm should
aim to achieve a similar reconstruction in the order of minutes, to be used in
adaptive radiotherapy.
Ultimately, producing a system capable of executing XD-GRASP in sat-
isfactory time (approaching real-time) for adaptive radiotherapy is the goal.
This project aims for iteratively reconstructed, respiratory-resolved (4D) MR
images in the order of seconds. Once achieved, adaptive radiotherapy may
allow cancer treatment to be completed during fewer treatment sessions and
thus reduce the turmoil endured by the patient.
4
2 Background
This section introduces the key concepts and algorithms used in present day
MRI data processing. It begins by outlining the basic principals behind MRI
data collection and transformation to images usable by doctors, physicists
and dosimetrists. Following this, detail is provided on the underlying math-
ematical concepts which help achieve such transformations. The algorithm
to be accelerated, Golden-angle Radial MRI with Reconstruction of Extra
Motion-State Dimensions Using Compressed Sensing (XD-GRASP) [3], is
outlined along with the basic components of FPGAs and the performance
benefits they bring.
2.1 MRI
The invention of magnetic resonance imaging (MRI) allowed doctors to per-
form non-invasive observations of their patients’ internal organs and tissues
[4]. The scanner performs the scan (the mechanics of which go beyond the
scope of this project) to produce data that can be reconstructed into a 3D
image.
An MRI scanner detects signal from hydrogen atoms bound in biological
tissue to obtain a signal [5]. This signal is then measured using receiver coils
and reflects the interior of the patient.
The data obtained from MRI scanners is obtained in the frequency do-
main referred to as k-space. A location in k-space is represented by its dis-
tance from the origin and angle (anticlockwise from the positive real-axis).
The intensity found at a location in k-space data reflects the corresponding
wave’s (in image space) contribution to the image [6] (see Figure 1).
5
(a)
Figure 1: A location in k-space with distance d from the origin and angle θ.
(a) (b)
Figure 2: Image space realisations of one data location with amplitude d from the origin
(a) and with amplitude 2d from the origin (b).
6
The intensities of the k-space locations are used to perform a weighted
sum of the waves, producing an image, as can be seen in Figure 3.
(a) (b)
Figure 3: Corresponding representations of the same MRI data in image space (a) and
k-space (b).
7
execute non-uniform Fast Fourier Transform, a variant of the aforementioned
Fourier Transform.
• Type 1:
N
X −1
f [k1 , k2 ] = c[j] · ek1 ·x[j] + k2 ·y[j]
(2)
j=0
• Type 2:
X
c[j] = f [k1 , k2 ] · e−(k1 ·x[j] + k2 ·y[j])
(3)
k1 ,k2
8
2.3 Supporting Mathematical Processes
This section introduces mathematical processes which will be of use and
describes their role in XD-GRASP.
2.3.1 L2-Norm
The L2-Norm [15] provides a reduction of a complex vector x to a single
value based upon the norms of its coefficients:
v
u n
uX
||x||2 = t |xk |2 (4)
k=1
The L2-Norm is used in XD-GRASP for the objective function during the
conjugate gradient method (Section 2.3.3) and in the decision to early-return
from the conjugate gradient method.
9
(a) Contour view (b) Surface view
When the initial guess is in a region with relatively neutral gradient (for
example as in Figure 4), gradient descent is slow to converge on a minimum.
However, since this drawback is data-dependent, it is not possible to gener-
alise about the number of iterations to limit the algorithm to. In addition,
gradient descent does not indicate the magnitude of a parameters shift – only
in which direction. For this reason gradient descent is used in conjunction
with other algorithms to determine the magnitude of the parameters update
(see Section 2.3.4).
10
Figure 5: The conjugate gradient method (red) is more direct than gradient descent (green)
in finding a minimum.
11
Where:
m = ∇f (x)T p
The key benefit that backtracking line search brings is alleviating the load
on computational resources. When searching for the optimal parameters with
respect to a cost function, backtracking line search avoids a large number of
small steps by beginning with larger steps and then reducing the step size.
The cost function for this project is given in Section 2.4, Equation 6, as the
task at hand is posed as an optimisation problem.
As with gradient descent and the non-linear conjugate gradient method,
this algorithm may be terminated early by constraining it to a maximum
number of iterations.
2.4 XD-GRASP
Feng et al. proposed the use golden-angle radial sampling, compressed sens-
ing (GRASP) [21] (a technique for reconstructing a signal from fewer sam-
ples), and parallel imaging1 to produce an algorithm capable of processing
an MRI signal into an image under free-breathing.
Building on the algorithm proposed by iGRASP [23], Feng et al. refined
their approach further by exploiting sparsity along other eXtra Dimensions
(XD), such as the total difference/variation between different phases of the
respiratory cycle [3].
Data is obtained as radial samples/spokes, a vector along which measure-
ments are taken, on a stream of k-space data in three dimensions (2D radial
samples are stacked at even intervals in the z-dimension). Consequently
each spoke corresponds to a different time. In order to group these spokes
by the stage in the respiratory cycle in which they were taken, XD-GRASP
includes a pre-processing step, in which a respiratory signal is obtained from
the k-space data observed.
As per the golden-angle sampling scheme, the spokes intersect where kx =
ky = 0 (k-space coordinates) – this point is the central profile of the spoke
(Figure 6. The central profiles are taken for every spoke at each z-dimension
interval and used to determine the respiratory motion signal using principal
component analysis [24].
1
Using multiple receiver coils in the MRI scan [22].
12
Figure 6: Golden-angle sampling of a 2D area. The placement of spokes depend on the
the previous spoke, with the new orientation of the spoke being that of the previous spoke
plus 111.25◦ . Samples are taken along a spoke.
With this signal, a sample spoke can be categorised by the stage of res-
piratory motion it was taken at, thus grouping the data. Here, sorting is
performed according to the phase of the respiratory cycle instead of the time
of acquisition. As seen in Figure 7, the goal is to group the spokes that were
sampled in the same respiratory phase – denoted by the coloured dashed
boxes.
Figure 7: Respiratory motion over time with the amplitude gating denoted by the dashed
boxes.
13
XD-GRASP then aims to find d which minimises the following:
Where d is the image series (one image reconstructed from each temporal
frame), S is the sparsifying transform applied along the extra respiratory-
state, m is the k-space data, C is the coil-sensitivities and F is the non-
uniform Fourier Transform operator (Section 2.2.1). λ is a search parameter
which controls the weighting given to the sparsifying transform.
As before, the conjugate gradient method (Section 2.3.3) is used to achieve
this minimisation.
Note that in the paper [3], the minimisation function stated is for the
cardiac imaging use case and thus includes a further sparsifying transform
in the cardiac motion dimension, which is not the focus of this project and
therefore omitted.
14
(a) nMSE = 0 (b) nMSE = 0.0151 (c) nMSE = 0.0056
Figure 8: Normalised mean squared error, rounded to four decimal places, ranks image
(c) (a pixelated version of the reference image (a)) above (b) (a brighter version of the
reference image).
4σab x̄b̄
= 2
(10)
(σa2 + σb ) [(ā)2 + (b̄)2 ]
Where:
N N
1 X 1 X
ā = ai , b̄ = bi
N i=1 N i=1
N N
1 X 1 X
σa2 = (ai − ā)2 , σb2 = (bi − b̄)2
N − 1 i=1 N − 1 i=1
N
1 X
σab = (ai − x̄)(bi − ȳ)
N − 1 i=1
In images the output of this quality index (and likewise the relationship
between pixels) is space variant – pixels in a similar region are more likely
15
to be similar. Thus in order to apply this metric to an image, rather than
processing the whole image in one calculation, it is more appropriate to
obtain local quality indexes through a sliding window, producing M local
quality indexes. The overall quality index is an average of the local indexes:
M
1 X
Q= Qj (11)
M j=1
Wang’s findings showed the quality index ordering of distorted images was
equivalent with the mean ranking of the same images when shown, alongside
the original, to human subjects. Whilst the subject ordering is subjective, it
displays that the quality index can characterise the features humans consider
when analysing images.
For the same images as tested for the mean squared error, the results for
SSIM accurately reflect the similarity between the images:
Figure 9: Structural similarity ranks image (b) above (c) when either is compared to the
reference image (a).
16
to easy debugging. To this end, Maxeler [29] provide MaxCompiler, a tool
which abstracts hardware details into a high-level description language and
further, a simulator to allow the validity of hardware without the need for
synthesis.
In this project an FPGA is used to accelerate the non-NUFFT parts of
the algorithm, as it allows a custom architecture to be designed which can
take advantage of parallelism in the algorithm design [30], something less
possible on general-purpose hardware such as CPUs and GPUs.
2.6.1 Components
There are four key hardware components on an FPGA, according to Maxeler
terminology.
• DSP DSPs are the hardware for multiplication. They are rectan-
gular multipliers, accepting two inputs of size 18 bits and 27 bits on
the VU9P, with pre- and post-adders, to allow additions with no fur-
ther computational burden. If multiplication for higher dimensions is
required, DSPs are used in combination with one another.
17
2.6.2 Optimisation Targets
When optimising an algorithm for an FPGA, there are two main bottlenecks
to consider: communication and computation.
Accesses to on-board memory are fast, regardless of sequential or random
locations. However, accesses to DRAM are relatively slow. The memory
controller generated by MaxCompiler defines a unit of access to DRAM,
referred to as the ‘burst size’.
In accessing DRAM, attention must be given to the size of data being
accessed (preferably utilising a whole burst at a time) and access pattern
(sequential vs random). Should fewer bits than a burst be required, a full
burst is read and the excess bits are discarded, an inefficiency.
In order to run efficiently, the algorithm should aim to maximise the
utilised hardware resources. In the case of DSPs, for example, if there are
100 DSPs on the board, it is possible to achieve 100 multiplications in one
clock cycle.
18
2.6.4 Xilinx R VU9P
For this project, the architecture is implemented using MaxCompiler version
2019.2 and Vivado 2018.3. The FPGA device targeted is a Maxeler’s MAX5C
Dataflow Engine (DFE). The MAX5CDFE is based on the Xilinx VU9P
14nm/16nm FinFET FPGA, consisting of:
• LUTs: 1,182,240
• FFs: 2,364,480
• BRAMs: 2,160
• DSPs: 6,840
19
Whilst this work could be used for adaptive radiotherapy, the reconstruc-
tion algorithms implemented do not support radial sampling schemes, such
as that used in XD-GRASP, and further have limited temporal resolution.
20
3 Design Flow and Performance Modelling
Before optimisations can be made, it is important to identify the areas of
the algorithm which consume the most time and thus give an indication of
the bottlenecks. To this end, a software model and a performance model
were produced to provide a framework for strategies to be evaluated upon,
assisting the planning phase of this project. The model also maximises the
speed-up whilst minimising the code-written to achieve said speed-ups and
is parametric making it easily extended to other (even future) platforms.
21
• Implement on the FPGA: The optimised design can be implemented in
MaxJ (Maxeler’s high-level description language) and synthesised onto
the FPGA.
3.2 XD-GRASP
Excluding the pre-processing step of the algorithm, which sorts the data into
respiratory phases, the following steps of XD-GRASP were to be accelerated:
22
3.4 Modelling the Algorithm
The performance model reflects the dataflow of the algorithm through para-
metric equations and the orchestration of parallel computation on the hard-
ware used.
With an initial C++ implementation, an initial performance model could
be produced to model the algorithm steps (steps which would be ported onto
an FPGA). This brought the ability to observe the time cost for different
stages of the algorithm and therefore highlight those which held back the
execution time the most. Once these bottlenecks were identified, the algo-
rithm design was changed, with validation from a modified software model,
to mitigate the bottlenecks.
The optimisations made to the algorithm, for acceleration on an FPGA,
do not necessarily accelerate a CPU implementation. Reducing the num-
ber of multiplications does not map to the same performance increment on
different platforms. Therefore, rather than implementing changes in the soft-
ware model and measuring performance of the it executing, the performance
model was used to predict the performance increase or decrease of changes
to both the algorithm and parameters of execution (such as the data-type
and hardware resources). The software model only checked the correctness of
optimisations. Unlike a C++ implementation, the performance model could
also anticipate how the hardware characteristics impact the execution time
as the FPGA execution time is predictable and runs for a manually set, fixed
number of clock cycles.
Following the model of a C++ implementation of XD-GRASP, the mod-
elled FPGA kernels (Section 3.5) were incorporated along with their compute
times. Thus the orchestration of the FPGA kernels and NUFFT had to be
emulated in the performance model to arrive at an overall algorithm execu-
tion time estimate, which took into account concurrent execution of the CPU
and FPGA (discussed further in Section 4).
When determining a predicted execution time for the algorithm, bench-
marks for the NUFFT had to be measured for both the CPU and GPU
implementations. These are discussed further in Section 6.1. These times
highlighted the NUFFT as a large bottleneck, with a comparison of the total
NUFFT time (= #invocations × benchmark) and the measured C++ im-
plementation performance revealing that 64.96% of the total execution time
is spent on the NUFFT.
23
3.4.1 Modelling Example
For example, to model the initial reconstruction, the following algorithm
steps are considered:
24
(a) A naive implementation.
Figure 10: A comparison of a naive and optimised implementation to compute the initial
reconstruction.
25
using:
E
Tcompute ≈ + F ill T ime (12)
C · Ec
Where E is the number of elements to be processed, Ec is the number
of elements processed per clock cycle and C is the clock frequency, in Hertz.
The fill time, the time taken for the FPGA pipeline to fill with data, is
assumed to be negligible in this calculation. It remains constant for various
input sizes, so becomes negligible as the input size grows.
Since, on an FPGA for this algorithm’s workload, the number of DSPs is
a more constraining resource than LUTs or FFs, the compute time equation,
Equation 12, can be simplified. This calculation can be performed by con-
sidering the number of (effective) multiplications performed during a kernel
execution and the number of DSPs allocated to the kernel (as this reflects
how many multiplications can occur simultaneously in one clock cycle). Sim-
plifying Equation 12 gives:
#M ultiplications
Tcompute ≈ (13)
#DSP s Allocated × Clock F requency
This model setup provided insight into which kernels demanded the most
time and hardware resources and further allowed the quick evaluation of
FPGA configurations which allocated fewer or greater resources to particu-
lar kernels. Thus, in planning the design of the algorithm components on
the FPGA, the bottlenecks could be mitigated by re-allocating hardware re-
sources, in most cases DSPs and B/URAM, to them. The effectiveness of
DSP allocation was enabled by the hardware replication optimisation out-
lined in Section 4.
The communication time required for kernel inputs and outputs to be
transferred to and from the FPGA was modelled by dividing the size of
the data by the PCIe 3 data rate (15700MB/s), as shown in Equation 14.
After the introduction of explicit actions, explained in Section 4.3, this cost
was omitted. The explicit actions API gave finer control, allowing, with
an asynchronous call, the communication cost of one kernel to overlapped
with the compute time of another kernel invocation. The introduction, of
explicit actions meant the compute time shifted from being I/O bound to
being compute bound.
Data Size Data Size
Tcommunicate = = (14)
Data T ransf er Rate 15700
26
As the model is parametric, it can be used to anticipate the performance
achieved with different hardware. The model characterises the influence that
the FPGA’s hardware components have on the kernel execution times and
therefore can be used to predict the hardware required to achieve a particular
performance.
As with modelling the algorithm, modelling the FPGA usage was itera-
tive, following the optimisations discussed in Section 4.
27
(a)
(b)
Figure 11: In sub-figure (a), the NUFFT can only begin after the first output of Kernel A
is produced, at t1 . Similarly, Kernel B can only begin after the first output of the NUFFT
has completed, at t2 . Sub-figure (b) holds the same constraint for t1 however Kernel B
cannot begin until all of the tasks for Kernel A have completed (t2 ), invalidating the a
performance model used for (a).
By inspecting each case like this in the performance model, a bound for
the maximum speed-up of the NUFFT that the model is still valid for can be
calculated. To achieve this, consider Figure 11 (a). When speeding up the
NUFFT, or equivalently shortening the arrow for it, there will come a point
where the overall execution time is no longer equal to the time taken for
one invocation of Kernel A, three NUFFT operations and one invocation of
Kernel B. It is a this point that the performance model calculation is invalid.
Note that the CPU NUFFT and GPU NUFFT implementations of XD-
GRASP both use the same calculations for overall execution time in the per-
formance model. Therefore, the maximum speed-up of the NUFFT which
the model is still valid for can be mapped to either case. The speed-up bound
for the CPU NUFFT is: 136x. In other words, if the NUFFT operations ac-
celerate by less than 136 times, the performance model will still be valid. The
28
performance model can be used to identify cases where the time taken for
kernels and the NUFFT is similar and thus highlight potential bottlenecks
to be addressed once NUFFT no longer dominates time.
The modelling phase of the project draws focus to the slower parts of the
algorithm which are constrained by memory or computation demands. Fur-
ther it provided a platform for speculative improvements and optimisations
to be made.
29
4 Optimisation
In order to achieve the desired speed-up of the algorithm, the strategies
outlined in this section were enacted. The data precision was reduced in
order to reduce the size of the data and computation required. The hardware
utilisation was increased by paralleling the CPU and FPGA. Data transfer
times were addressed by improving data locality and using explicit actions.
The compute time was reduced by merging multiple functionalities into a
single FPGA kernel, allowing them to be executed in parallel, and replicating
hardware. Lastly, optimal library parameters were found to balance the
trade-off between image accuracy and performance.
30
parts of the algorithm were accelerated using an FPGA and were executed
in parallel with the NUFFT. Therefore, the total time spent on the NUFFT
acts as a lower bound to the total execution time. This means, if the NUFFT
is accelerated, the XD-GRASP system proposed by this project will accom-
modate these accelerations, bringing down the total execution time with no
further changes necessary. The bounds for this are discussed in Section 6.
There are, inevitably, times when the CPU cannot be computing NUFFT
results as the input for the operation is not yet available. To minimise this
stalling time, the workload of appropriate FPGA kernels were reduced such
that they output the minimum data required to allow a NUFFT operation
to begin (in cases where the kernel output was one of the NUFFT inputs).
Not only does this reduce the idle time of the CPU, but it also increases the
proportion of the time the FPGA and CPU are computing simultaneously,
meaning the time cost of the FPGA can be covered by the NUFFT time,
bring down the length of the overall execution.
Once CPU computation time was maximised, and consequently a major-
ity of FPGA work occurring at the same time as the NUFFT, the proportion
of algorithm time spent performing the NUFFT rose to 99.77% (as antici-
pated by the performance model).
31
4.4 Flatten, Merge, Repeat
Following on from the CPU utilisation optimisations, when determining the
functionality required for the FPGA kernels, it became important for the
FPGA to compute as much as possible whilst the NUFFT was executing. To
achieve this, the following steps were applied to the original C++ code:
1. Flatten: Inline methods, starting with the deepest method calls for
chains of nested calls
3. Repeat: Flatten and merge again, until no further code reductions are
possible
This kind of optimisation is not possible for the original MATLAB im-
plementation, unless all of the operators are re-implemented, showing the
necessity of a C++ implementation of XD-GRASP in addition to the com-
putation overhead of MATLAB itself.
The optimisations compound over iterations, as the flatter code can be-
come simpler (in terms of the number of statements), making it easier to
merge with other code. In some cases for XD-GRASP, such as for the to-
tal variation operator (the understanding of which is not required for this
project), the effective transformation could be computed by hand to reduce
a two step process to one.
All of this, however, comes at a cost. The resulting code is often unrecog-
nisable from the original algorithm as the boarders between mathematical
operations become blurred. Whilst this puts maintainability at risk, this
optimisation opens the door to a similar process for FPGA kernels.
32
Figure 12: A simple case of flattening and merging for example code.
FPGAs are able to perform computations, even within the same kernel,
simultaneously. To fully utilise this dataflow parallelism, the kernels’ func-
tionalities were increased to include extra computation for purposes beyond
the original use of the kernel. For example, when computing an array result,
the kernel may accumulate the sum of the elements in the resulting array
in parallel, to avoid a second iteration over the data. The ‘extra’ function-
ality pushed onto a kernel arose from a similar flattening approach as was
used for the C++ code. However in this case, rather than merging loops of
code, kernels with similar iteration patterns over the same data had their
functionalities merged, outputting two results in parallel.
33
taken to transfer data from LMem to the kernel is covered by the time taken
to transfer data from the CPU to the kernel as the architecture has been
optimised such that both transfers occur at the same time. Since the datasets
are the same size, this property can be confirmed by a comparison of the
bandwidths from CPU to FPGA (via the PCIe bus) and from DRAM to
B/URAM. These bandwidths are 15750MB/s and 45000MB/s. Therefore,
it was possible to overlook double buffering here (Section 2.6) as it did not
yield any benefit. The exception to this is MultiplicationKernel which
has both inputs streamed from LMem. However this is called relatively few
times and has a low time cost. Thus double buffering its inputs and outputs
was not a priority.
In addition, the data stored in LMem is organised such that the reads
are sequential, the most efficient access pattern, avoiding the cost of random
access.
Whilst it would have been most optimal to store the datasets in FMem,
due to the quicker access time than LMem, the memory requirements of the
kernel computations were prioritised over persisting data, as this would allow
for greater replicated hardware (Section 4.6). Further, there is an FMem
requirement for the FIFO queues used for dataflow scheduling. In order for a
kernel to compute, it requires its inputs at the same time. In order to ensure
the inputs are ready simultaneously, FIFO queues are used to equalise the
latencies of the inputs, as demonstrated in Figure 13.
34
Figure 13: The use of FIFO queues by an FPGA to ensure data arrives at a kernel at the
same time.
35
compute time previously stated (Equation 12) becomes:
E
Tc ≈ + F ill T ime (15)
pFactor · C · Ec
36
KW OSF SW MSE nMSE SSIM Total Time
1 1.25 16 3.3308 0.0001 0.9322 8.5147
1 1.25 24 3.3407 0.0001 0.9319 8.8257
1 2.0 16 3.3967 0.0001 0.9173 8.8512
1 1.25 8 3.3308 0.0001 0.9322 8.9768
1 2.0 24 3.3967 0.0001 0.9173 9.3052
1 2.0 8 3.3967 0.0001 0.9173 9.3349
3 2.0 16 0.9791 0.0000 0.9918 9.7737
3 1.25 16 1.8403 0.0000 0.9708 9.8679
3 1.25 8 1.8403 0.0000 0.9708 10.3848
3 2.0 8 0.9791 0.0000 0.9918 10.4832
3 1.25 24 1.84 0.0000 0.9708 10.5530
3 2.0 24 0.9791 0.0000 0.9918 11.2008
7 2.0 16 1.2521 0.0000 0.9915 16.4491
7 2.0 8 1.2521 0.0000 0.9915 16.6098
7 1.25 8 3.2759 0.0000 0.9627 18.3329
7 2.0 24 1.2521 0.0000 0.9915 18.3366
7 1.25 16 3.2758 0.0000 0.9627 18.5891
7 1.25 24 3.2759 0.0000 0.9627 20.7342
Table 2: The NUFFT benchmarks, along with SSIM scores, for different gpuNUFFT
configurations of kernel widths (KW), sector widths (SW) and oversampling factors (OSF),
ordered by ‘Total Time’.
It can be seen that the biggest influence of time is the kernel width,
with it consistently being inversely proportional to the time. There does not
seem to appear to be a strong connection between the other parameters and
the accuracy or time and thus once a kernel width of 3 was selected as a
compromise between accuracy and speed, the most accurate configuration
was chosen. The respective scores for the pixel intensity differences, MSE
and nMSE, were all small enough to disregard and use SSIM (Section 2.5) as
the sole indicator of accuracy. The parameter search space was as follows:
• Kernel width: 1, 3, 7
37
It is worth noting that, during the parameter search, the MSE and SSIM
scores reflect a closeness to the reference, reconstructed using its own NUFFT
parameters. Weaker scores do not necessarily reflect a poor image, as can
be confirmed by qualitative analysis, however stronger scores provide more
robust evidence that the image quality has not been compromised.
38
5 Implementation
With the optimisations to XD-GRASP outlined in Section 4, this section
describes how the various hardware was used to build the accelerated XD-
GRASP system, capable of reconstructing 320x320 images for 8 respiratory
phases.
Implementing the two types of NUFFT on an FPGA would have been
infeasible for this project’s time-frame and therefore existing alternative ac-
celerators for it, namely a GPU, were sought out instead. Further, the
NUFFT entails a convolution which, if implemented via a matrix multiplica-
tion, GPUs become unbeatable at executing, due to their SIMT architectures.
5.1 Systems
Using a CPU for the NUFFT operations and an FPGA for everything else,
the algorithm was spread across these components and coordinated by the
CPU to make the two-prong heterogeneous system. As mentioned in Section
4, one large factor in achieving the speedup was to utilise the CPU as much
as possible.
The CPU triggers a set of asynchronous FPGA tasks, which are executed
sequentially on the FPGA, and waits for the completion of tasks before per-
forming NUFFT operations. Following the completion of a single NUFFT,
the CPU then triggers the next FPGA tasks (corresponding to the output of
the NUFFT) and, again, waits for the result. In some cases, the CPU waits
for a full batch of FPGA tasks to complete, as there is no available work to
do (NUFFTs).
The addition of a GPU does not change the orchestration mechanics of
the two-prong heterogeneous system. In this case, rather than performing the
NUFFT itself, the CPU unloads this work onto the GPU synchronously, to
give a three-prong heterogeneous system. There is no benefit to asynchronous
NUFFT calls as the execution time is still dominated by the NUFFT. Thus
a simple swap of NUFFT operator is required, with no further changes to
the code.
To produce a 4D reconstruction (with dimensions in width, height, depth
and time), 2D images (width and height) are reconstructed for all the respira-
tory phases (the time dimension) at once. The missing dimension (depth) is
obtained through sequential runs of the 2D+time system on data for different
slices of the scan data.
39
The pre-processing step of XD-GRASP has been omitted as effort is better
spent accelerating the part of the algorithm which is executed for every slice
(2D image) of the 4D reconstruction.
An aerial view of the FPGA implementation, produced by MaxCompiler,
is available in Appendix A.
5.2 Coordination
In order to orchestrate the concurrent computation of the FPGA and CPU, a
synchronisation primitive was added (Listing 1) which allowed FPGA tasks to
be called asynchronously and the resources of which to be freed once waited-
for. Using this, the CPU could batch-trigger FPGA tasks and then proceed
to begin the NUFFT operations as soon as possible, using async task t to
ensure the input to the NUFFT was ready.
1 typedef struct {
2 // The thread to wait for
3 std :: thread * run ;
4
5 // The arrays to be freed once ‘run ‘ completes
6 void ** f re ea bl eD ep en de nt s ;
7
8 // The size of the fr eea bl eD ep en de nt s entry
9 int n Fr e e ab l e De p e n de n t s ;
10
11 // Ad - hoc identifier to instruct how to free an array
12 int * dependentsTypes ;
13 } async_task_t ;
Listing 1: Synchronisation primitive
40
• ku: The sorted (by respiratory phase) k-space trajectory
Note that the u suffix reflects data which has been sorted by respiratory
phase and when the above datasets are used by kernels, access to them is
offset to allow the appropriate values to be read.
In addition, there are several dimensions are defined here, to be used in
the mathematical definitions of kernels:
• nSamples: The number samples acquired by each coil for one respira-
tory phase (nSamples = nx · nline)
• nP ixels: The number of pixels in the image (nP ixels = npx · npy)
5.3.1 CombineAcrossCoilsKernel
During the transformation from k-space to image space, the measurements
taken across multiple receiver coils during one respiratory phase on the MRI
scanner are combined by this kernel. The input data, x is transposed before
the computation is performed, making use of FMem to do so. To achieve
41
this, there are nc separate memories in use, to allow nc parallel reads and
writes.
nc
|b1i,c |2
P
c=0
outi = P
nc ·C
tmpi,c · b1i,c
c=0
>
tmp = x
(
nx·π
2·nline
halve === 1
C= nx·π
nline
otherwise
• Input:
• Output:
• Hardware requirements:
42
5.3.2 TransposedMultiplicationKernel
As the name suggests, the inputs and multiplied together before being output
in a transposed ordering. In addition, the temporal variance between x and
xNext is output.
• Input:
• Output:
• Hardware requirements:
– DSPs: 19 × pFactor + 1
– FMem: ≈6.5536MB∗
43
∗
As pFactor grows, the number of reads per kernel memory increases caus-
ing MaxCompiler to replicate the memory during compilation. This is a
side effect to be mitigated in the future however, as such, means a definite
relationship for the FMem requirement and pFactor cannot be given.
5.4 Dataflow
The dataflow for the algorithm implementation is shown here for the three-
prong system. In order to understand the dataflow for the two-prong system,
with no GPU, the division between CPU and GPU can be removed and these
sections merged.
Note that in these diagrams, crossing a dashed line denoting the boundary
between CPU and FPGA means a PCIe data transfer. Only more significant
datasets are shown.
44
Figure 16: The overall system dataflow.
45
Figure 17: The conjugate gradient method dataflow.
46
Figure 18: The gradient method dataflow.
47
Figure 19: The objective method dataflow.
48
6 Performance Evaluation
In order to assess the success of this project, both the accuracy of the recon-
struction and the computation time must be considered. A solution which
produces clear images in a satisfiably short amount of time is regarded as
successful.
Results in this section consider the reconstruction of a 2D image, for 8
respiratory phases. For each implementation below, the expected 4D recon-
struction time can be calculated by multiplying the given reconstruction time
by the depth of the 3D image.
There are four classes of algorithm implementation to consider (to be
referred to by letter ID):
(a) Demo MATLAB implementation (the reference)
6.1 Benchmarking
First, the methods for measuring the execution of the FPGA, and conse-
quently using these results to compute an execution time for XD-GRASP,
need to be outlined.
Each kernel was executed and timed independently 200 times during each
benchmark run, and, after 8 benchmark runs, the mean execution time and
variance could be calculated.
The benchmarks for the CPU and GPU NUFFT operations were mea-
sured similarly.
49
Using these mean times and the number of invocations of each kernel
for XD-GRASP and the NUFFT, whether on the CPU or GPU, the total
execution time of XD-GRASP was calculated. This calculation followed the
performance model calculations closely, with a simple replacement of kernel
execution times producing a valid result for implementations (d) and (e).
The NUFFT benchmarks are displayed here (Table 4). The kernel bench-
marks are discussed in Section 6.4.
50
use here, providing statistical evidence that the algorithm has been correctly
implemented and optimisations have not led to incorrect reconstructions.
Whilst the inconsistencies between different hardware’s floating point
handling lead to small differences, it can be seen from the results (Table
5) that this bears minimal influence on the output. Values are produced
from the mean of the scores across all respiratory phases. Results are not
available for configurations utilising an FPGA with Dataset 1 as the im-
plementation and optimisations are specific to the dimensions of Dataset 2.
To adjust the XD-GRASP implementation for the dimensions of Dataset 1,
the FPGA implementation would have to be redesigned to accommodate the
different data sizes, since the hardware utilisation has been maximised for a
specific set of dimensions.
51
(a) SSIM = 1 (b) SSIM = 0.9999 (c) SSIM = 0.9906
52
As can be seen from the image comparisons, the accuracy of the recon-
structions has not been compromised during acceleration, with qualitative
judgement matching quantitative analysis.
The speed increase from MATLAB to pure C++ is over 2.6x, consistent
with the expectation from the finer-grained control enabled by the language.
Further still, utilising an FPGA, with the NUFFT on the CPU, resulted in
a 1.54x speed-up against the CPU implementation, with the NUFFT being
53
the bottleneck. A more detailed breakdown of the execution times shown in
Table 7 reveals that 0.06% of the total execution time is incurred solely by
the FPGA, with the bulk of the time coming from the NUFFT execution on
CPU – one of the aims during optimisation.
In an effort to reduce the NUFFT time requirement further, the addition
of a GPU, using optimal parameters discussed in Section 4, pushes the speed-
up to 2.53x (when compared with the CPU-only implementation) and 1.88x
(when compared with the CPU/GPU implementation). Despite the perfor-
mance increase with the extra hardware, the execution of the algorithm is
still dominated by the NUFFT time requirement, incurring the same non-
NUFFT time cost as with CPU-based NUFFT.
54
Original time of accel’d code
Speed-up of accel’d code =
Accel’d time of accel’d code
8.66166 (17)
=
0.11223
= 77.18
Using 17, the theoretical speed-up and limit of theoretical speed-up can
be obtained:
(For CPU NUFFT)
Non-NUFFT Time
Prop’n of time occupied by accel’d code =
Total Time
8.66166 (18)
=
24.72233
= 0.3504
1
Theoretical speed-up = 0.3504
(1 − 0.3504) + 77.18 (19)
= 1.5286
1
Limit of theoretical speed-up =
1 − 0.3504 (20)
= 1.5393
1
Theoretical speed-up = 0.4698
(1 − 0.4698) + 77.18 (22)
= 1.8648
55
1
Limit of theoretical speed-up =
1 − 0.4698 (23)
= 1.8862
According to Armdahl’s Law [37], applied to the CPU and FPGA imple-
mentation (d), in comparison with the C++ implementation (b), the the-
oretical speedup is 1.4698. Further, the limit of the theoretical speed-up
is 1.5393, which the observed speed-up (1.5241) lies beneath. If the data
transfer times are omitted, the anticipated speed-up rises to 1.5383. The
calculations for this are shown in Equations 18, 19 and 20.
As a machine with the same CPU and GPU used for benchmarking was
not available (discussed in Section 6.6.), Armdahl’s Law can only be applied
as an approximation. In this case, the theoretical speed-up is 1.7502 and
the limit of the theoretical speed-up is 1.8862. The observed speed-up is
1.8558. If the data transfer times are omitted, the anticipated speed-up rises
to 1.8843. The calculations for this are shown in Equations 21, 22 and 23.
The current system is limited by the performance of the NUFFT as this is
not accelerated by an FPGA. This would be the next focus for optimisation
when accelerating XD-GRASP further. Even with the maximum speed-up
attainable, the system cannot achieve real-time reconstruction as the time
demand of the NUFFT is too high.
If, however, the NUFFT operations completed in 0 time, the total execu-
tion time becomes 0.1122s, a speed-up of 580x over the MATLAB implemen-
tation and 220x over the C++ implementation, satisfying real-time recon-
struction. This suggests that no additional FPGA acceleration is required
and supports that the NUFFT should be accelerated next, to bridge the gap
between the current execution time and desired real-time performance.
56
CPU FPGA Perf. Model
Kernel
Bench. (s) Bench. (s) Expected (s)
CombineAcrossCoils 0.022510 0.00000778 0.00068675
Grad 0.007986 0.00000604 0.00000763
Multiplication 0.000038 0.00000710 0.00000891
Objective 0.000395 0.00000912 0.00001113
PostNUFFTType2 0.000076 0.00000995 0.00001113
TransposedMultiplication 0.002824 0.00000917 0.00010708
Update 0.000972 0.00000735 0.00002133
Table 8: A comparison of kernel benchmarks and expectations.
In most cases, the difference between the expected and observed kernel
execution times is reasonably small and can be attributed to the generali-
sations the performance model makes about kernel execution. However, for
CombineAcrossCoils the difference is surprisingly large and requires further
investigation into why this kernel completes in such an impressive time, in
comparison to the performance model expectation.
To ensure the validity of the FPGA kernel benchmarks, the variance for
each kernel was computed (Table 9).
FPGA
Kernel Variance
Bench. (s)
CombineAcrossCoils 0.00000778 1.77E-12
Grad 0.00000604 7.39E-13
Multiplication 0.00000710 4.51E-12
Objective 0.00000912 4.58E-12
PostNUFFTType2 0.00000995 3.17E-12
TransposedMultiplication 0.00000917 1.70E-12
Update 0.00000735 2.17E-13
Table 9: The mean and variance for kernel benchmarks.
57
pFactor configurations causing the FPGA compilation to fail to meet timing
at 200MHz without architecture specific optimisation aimed to shorten the
critical path, the path hardware with the largest delay3 . Meeting timing
requires an FPGA layout to be found such that the hardware components,
and the connections between them, can be placed in such a configuration
that meets timing requirements, such as inputs of a hardware component
arriving at the same time.
58
6.7 Bridging the Gap
The performance difference between the proposed system and a system ca-
pable of real-time reconstruction is one that can be closed. As mentioned,
the acceleration of the NUFFT would be one way in which the total recon-
struction time could be dramatically reduced. In terms of hardware, this
would not require any changes to the system, since the GPU used is signif-
icantly under-utilised (less than 40% utilisation at its peak). Thus, rather
than hardware changes, software changes would unlock the full potential of
the system.
However, since the power demands of CPUs and GPUs are larger than
that of an FPGA [31], it may be desirable to transfer all of the computa-
tion onto an FPGA, for uses when a low-power solution is required (such as
portable reconstruction). In this case, the FPGA would likely require more
hardware resources, mainly B/URAM, as this is became a more scarce re-
source in the current system.
59
7 Conclusion and Future Work
This report presents the modelling and optimisation approach for accelerat-
ing XD-GRASP using an FPGA and, further, a GPU. The use of FPGAs have
successfully reduced the execution time of XD-GRASP and consequently pro-
vided a framework for the NUFFT to be optimised for, in this use case. The
speed-up reaches 4.05x, when compared to the original MATLAB implemen-
tation, and 1.5384x over a C++ implementation, which lies close to the limit
posed by Armdahl’s Law, 1.5393. Likewise, the addition of a GPU achieves
speed-ups of 6.63x (against MATLAB) and 1.8843x (against a CPU/GPU
implementation) approaching the limit of 1.8862x. The work alleviates the
computation burden of the processes of XD-GRASP, excluding NUFFT, by
performing them in parallel with the largest bottleneck of the algorithm.
60
level. Thus, a 4D reconstruction can be achieved in quicker time, rather
than sequentially reconstructing 2D images for the respiratory phases as the
current system does. In addition, this would better utilise the hardware, as
multiple kernels could operate at a given time.
To make this reconstruction more user-friendly, the system could output
the initial reconstruction as soon as possible, for preliminary viewing by the
doctors, whilst the iterative improvements are made to the reconstruction
(updating the view after each iteration).
61
Appendices
A Aerial View of the FPGA Implementation
62
B Further Kernel Details
The kernels not covered in Section 5.1 are discussed below.
B.1 GradKernel
This kernel performs the the final step of calculating the gradient during
the conjugate gradient method, for one respiratory phase. Simultaneously,
it computes the sum of the element-wise multiplication of the coefficients of
res and their conjugates (the dot product of res and its conjugate).
In the cases where x == xPrev or x == xNext, the input denoting the
respiratory phase, r, ensures the correct calculation, which does not make
use of the equivalent input (i.e. xPrev and xNext respectively).
v1 = f (xi − xPrevi )
v2 = f (xNexti − xi )
x
f (x) = √
x · x̄ + l1Smooth
nP
X ixels
conjMultRes = |resi · res
¯ i|
i=0
• Input:
63
– r: The index of the current respiratory phase (= r)
– l1Smooth: A conjugate gradient smoothing parameter
– TVWeight dim1: A conjugate gradient parameter for the weight
of the temporal variance in calculation
• Output:
• Hardware requirements:
– DSPs: 42 × pFactor
– FMem: 0
B.2 MultiplicationKernel
This kernel simply multiplies its inputs to produce the input for the type 1
NUFFT which produces the initial reconstruction.
• Input:
– kdatau: As stated above, for one respiratory phase and coil only
– wu: As stated above, for one respiratory phase only
64
• Output:
• Hardware requirements:
– DSPs: 6 × pFactor
– FMem: 0
B.3 ObjectiveKernel
As the final step of the objective function in backtracking line search, this
kernel produces a single score for the output of the type 2 NUFFT, for one
respiratory phase and coil.
nSamples
X
out = zi · z¯i
i=0
zi = xi · wui − kdataui
• Input:
• Output:
65
• Hardware requirements:
– DSPs: 12 × pFactor
– FMem: 0
B.4 PostNUFFTType2Kernel
During the computation of the gradient, this kernel is used to process the
output of the type 2 NUFFT.
outi = xi · wui − kdataui · wui
• Input:
• Output:
• Hardware requirements:
– DSPs: 12 × pFactor
– FMem: 0
66
Figure 25: The dataflow of PostNUFFTType2Kernel.
B.5 UpdateKernel
Element-wise, this kernel produces the sum of the element of a, scaled by
scaleA, and b, scaled by scaleB. Further, the sum of element-wise multipli-
cations between a and the result, out, is output.
The name for this kernel derives from its most common use, updating
the reconstruction using the gradient. However, the uses of this kernel go
beyond this.
• Input:
– a: A vector
– scaleA: A scalar
– b: A vector
– scaleB: A scalar
• Output:
• Hardware requirements:
67
– DSPs: 24 × pFactor
– FMem: 0
68
References
[1] YouGov;. Date accessed: 17/01/2020. Available from:
https://yougov.co.uk/topics/lifestyle/articles-reports/
2011/08/15/cancer-britons-most-feared-disease.
[3] Feng L, Axel L, Chandarana H, Block KT, Sodickson DK, Otazo R. XD-
GRASP: Golden-angle radial MRI with reconstruction of extra motion-
state dimensions using compressed sensing. Magnetic Resonance in
Medicine. 2016;75(2):775–788. ID: TN wj10.1002/mrm.25665.
[5] Ramani R. Functional MRI : basic principles and emerging clinical ap-
plications in anesthesiology and the neurological sciences. Oxford Uni-
versity Press; 2019. ID: 44IMP ALMA DS51115923850001591; Includes
bibliographical references and index.
[8] Lustig M, Donoho D, Pauly JM. Sparse MRI: The application of com-
pressed sensing for rapid MR imaging. Magnetic Resonance in Medicine.
2007;58(6):1182–1195.
69
[11] Barnett A, Magland J. FINUFFT;. Date accessed: 17/01/2020. Avail-
able from: https://finufft.readthedocs.io/en/latest/index.
html.
[19] Dennis JE. Numerical methods for unconstrained optimization and non-
linear equations; 1996. ID: 44IMP ALMA DS5153880780001591; In-
cludes bibliographical references (p. 364-370) and indexes.
70
[23] Feng L, Grimm R, Block KT, Chandarana H, Kim S, Xu J, et al.
Golden-angle radial sparse parallel MRI: Combination of compressed
sensing, parallel imaging, and golden-angle radial sampling for fast and
flexible dynamic volumetric MRI. Magnetic Resonance in Medicine.
2014;72(3):707–717. ID: TN wj10.1002/mrm.24980.
[26] Wang Z, Bovik AC. Mean squared error: Love it or leave it? A new
look at Signal Fidelity Measures. IEEE Signal Processing Magazine.
2009;26(1):98–117. ID: TN ieee s4775883.
[27] Wang Z, Bovik AC. A universal image quality index. IEEE Signal
Processing Letters. 2002;9(3):81–84. ID: TN ieee s995823.
[31] Siddiqui MF, Reza AW, Shafique A, Omer H, Kanesan J. FPGA imple-
mentation of real-time SENSE reconstruction using pre-scan and Emaps
sensitivities. Magnetic Resonance Imaging. 2017;44:82 – 91. Avail-
able from: http://www.sciencedirect.com/science/article/pii/
S0730725X17301674.
71
[32] Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P. SENSE:
Sensitivity encoding for fast MRI. Magnetic Resonance in Medicine.
1999;42(5):952–962. ID: TN wjAID-MRM16¿3.0.CO; ID: 2-S.
[34] Stone SS, Haldar JP, Tsao SC, Hwu WMW, Sutton BP, Liang ZP. Ac-
celerating advanced MRI reconstructions on GPUs. Journal of Parallel
and Distributed Computing. 2008;68(10):1307–1318.
72