0% found this document useful (0 votes)
9 views

Algorithm and Architecture Optimization For 2D Dis

Uprauprs

Uploaded by

Mr Grach
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Algorithm and Architecture Optimization For 2D Dis

Uprauprs

Uploaded by

Mr Grach
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hindawi

International Journal of Reconfigurable Computing


Volume 2018, Article ID 1403181, 17 pages
https://doi.org/10.1155/2018/1403181

Research Article
Algorithm and Architecture Optimization for 2D Discrete
Fourier Transforms with Simultaneous Edge Artifact Removal

Faisal Mahmood ,1 Märt Toots,2 Lars-Göran Öfverstedt,2 and Ulf Skoglund2


1
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
2
Structural Cellular Biology Unit, Okinawa Institute of Science and Technology (OIST), Okinawa, Japan

Correspondence should be addressed to Faisal Mahmood; [email protected]

Received 18 December 2017; Revised 11 May 2018; Accepted 10 June 2018; Published 6 August 2018

Academic Editor: João Cardoso

Copyright © 2018 Faisal Mahmood et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora
of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. This leads
to cross-shaped artifacts in the frequency domain due to spectral leakage. These artifacts can have critical consequences if the
DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGA-based
solution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications. Standard approaches
for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate
a surge in computation by significantly increasing the image size. We use a periodic plus smooth decomposition-based approach
that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the so-called
“intermediate storage” or “memory wall” problem, which is due to limited on-chip memory, increasingly large image sizes, and
strided column-wise external memory access. We propose a “tile-hopping” memory mapping scheme that significantly improves
the bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to 53%. We tested our
proposed optimizations on a PXIe-based Xilinx Kintex 7 FPGA system communicating with a host PC, which gives us the advantage
of further expanding the design for biomedical applications such as electron microscopy and tomography. We demonstrate that
our proposed optimizations can lead to 2.8× reduced FPGA and DRAM energy consumption when calculating high-throughput
4096 × 4096 2D FFTs with simultaneous edge artifact removal. We also used our high-performance 2D FFT implementation to
accelerate filtered back-projection for reconstructing tomographic data.

1. Introduction and real-time or near real-time operations, specifically for


machine vision applications [5]. Image sizes for many of
Discrete Fourier Transform (DFT) is a commonly used and these applications have also increased over the years, further
vitally important function for a vast variety of applications contributing to the problem.
including, but not limited to, digital communication systems, The Cooley-Tukey fast Fourier transform (FFT) algo-
image processing, computer vision, biomedical imaging, and rithm [6], first proposed in 1965, reduces the complexity of
biometrics [1, 2]. Fourier image analysis simplifies computa- DFTs from O(𝑛2 ) to O(𝑛 log 𝑛) for a 1D DFT. However, in
tions by converting complex convolution operations in the the case of 2D DFTs, 1D FFTs have to be computed in two
spatial domain to simple multiplications in the frequency dimensions, increasing the complexity to O(𝑛2 log 𝑛), thereby
domain. Due to the fundamental nature of 2D DFTs, they making 2D DFTs a significant bottleneck for real-time
are commonly used in a variety of image processing appli- machine vision applications [7]. Recently, there has been sub-
cations such as tomographic image reconstruction [3], non- stantial effort to achieve high-performance implementations
linear interpolation, texture analysis, tracking, image quality of multidimensional FFTs to overcome this constraint [5, 7–
assessment, and document analysis [4]. Because of their 14]. Due to their inherent parallelism and reconfigurability,
computational complexity, DFTs often become a computa- Field Programmable Gate Arrays (FPGAs) are attractive
tional constraint for applications requiring high-throughput targets for accelerating FFT computations. Being a highly
2 International Journal of Reconfigurable Computing

flexible platform, FPGAs can fully exploit the parallel nature performance is achieved by using state-of-the-art hardware
of the FFT algorithm. 2D FFTs are generally calculated in for high-bandwidth communication between the FPGA and
stages where all elements of the first stage must be available CPU modules, and by utilizing an efficient memory map-
before the second stage can be calculated. This creates the so- ping scheme. The current work also analyzes the energy
called “intermediate storage” problem associated with strided consumption of the proposed paradigm. Moreover, we also
external memory access, specifically for large datasets. show how the real-time 2D FFTs can be used for tomographic
While calculating 2D DFTs, it is assumed that the image reconstructions.
is periodic, which is usually not the case. The nonperiodic
nature of the image leads to artifacts in the Fourier transform, Paper Outline. The paper follows FPGA image processing
usually known as edge artifacts or series termination errors. design methodology outlined in [16, 17], which involves
These artifacts appear as several crosses of high-amplitude carefully profiling the software solution to understand com-
coefficients in the frequency domain, as seen in [15, 16]. putational bottlenecks and overcoming them through careful
Such edge artifacts can be passed to subsequent stages of reformulation of the algorithm within a parallel hardware
processing, and in biomedical applications they may lead framework. Section 2 gives a comprehensive background
to critical misinterpretations of results. Efficiently removing of high-performance 2D FFTs using FPGAs, the DRAM
such artifacts without compromising resolution is a major intermediate storage problem, and edge artifacts. Section 3
problem. Moreover, simultaneously removing these spurious presents PSD in detail which is the implementation objective.
artifacts while calculating the 2D FFT adds to the existing Section 4 presents OPSD, an optimized solution to reduce the
complexity of the 2D FFT kernel. latency of the serial part of the algorithm which limits overall
performance. Section 5 presents a memory mapping scheme
Contributions. In this paper we present solutions for a that can reduce the column-wise strided external memory
high-performance 2D DFT with simultaneous edge artifact access. Section 6 presents experimental results and explains
removal (EAR) for applications which require high frame target selection in detail, experimental setup, and bench-
rate 2D FFTs such as real-time medical imaging systems and marks results. Section 7 presents filtered back-projection as
machine vision for control. Our proposed optimizations lead a proposed application. Section 8 presents conclusions.
to a high-performance solution for removing edge artifacts
while the transform is being calculated, thus preventing
time consuming and possibly erroneous postprocessing steps. 2. Background
Moreover, the proposed optimizations reduce the overall
energy consumption. This work builds on our previous work 2.1. High-Performance 2D FFTs Using FPGAs. There are
presented in [8]. Major contributions include the following: several resource-efficient, high-throughput implementation
(1) We propose optimized periodic plus smooth decom- approaches of multidimensional DFTs on a variety of differ-
position (OPSD) as an optimization for standard ent platforms. Many of these methods are software-based and
periodic plus smooth decomposition (PSD) for edge have been optimized for efficient performance on general-
artifact removal (Section 4). purpose processors (GPPs), for example, Intel MKL [11],
FFTW [9], and Spiral [10]. Implementations on GPPs can
(2) Based on OPSD, we propose an architecture that be readily adapted for a variety of scenarios. However, GPPs
can reduce the access to DRAM and can decrease consume more power as compared to dedicated hardware
the number of 1D FFT invocations by performing and are not ideal for real-time embedded applications.
column-by-column operations on the fly (Section 4). Several application-specific integrated circuit- (ASIC-) based
(3) Since OPSD is heavily dependent on efficient FPGA- approaches have also been proposed [18–20], but since it is
based 2D FFT implementation which is limited not easy to modify ASICs, they are not cost-effective solutions
by DRAM access problems, we design a memory for rapid prototyping of image processing systems. Graphical
mapping scheme based on “tile-hopping”, which Processing Units (GPUs) on the other hand can achieve
can reduce row activation overhead while accessing relatively high throughput but are energy inefficient and limit
columns of data from the DRAM (Sections 5 and 6). the portability of large-scale imaging systems.
(4) The proposed OPSD and memory “tile-hopping” Due to their inherent parallelism and reconfigurability,
optimizations also lead to better energy performance FPGAs are attractive for accelerating FFT computations,
as compared to row-major access (Section 6.4). since they fully exploit the parallel nature of the FFT algo-
rithm. FPGAs are particularly an attractive target for medical
(5) We use our implementation as an accelerator for
and biomedical imaging apparatus and instruments such
filtered back-projection (FBP), an analytical tomo-
as electron microscopes and tomographic scanners. Such
graphic reconstruction method, and show that for
devices do not have to be manufactured in bulk to justify
large datasets our 2D FFT with edge artifact removal
application-specific solutions and require high bandwidth.
(EAR) can significantly improve reconstruction run
Moreover, increasing mobility and portability constitute a
time (Section 7).
future objective for many medical imaging systems. FPGAs
As compared to our previous work [8], the current are also more efficient for prototyping machine vision appli-
implementation achieves better runtime, i.e., 1.5ms as com- cations since they are relatively more fine-grained when
pared to 32.4ms for a 512 × 512 image. This increased compared to GPPs and GPUs and can serve as a bridge
International Journal of Reconfigurable Computing 3

Optional,
for streaming access
Step 1 Step 2

Intermediate Storage
1D FFTs

Storage
n n

1D FFTs
n n
Row-by-Row 1D FFTs Column-by-Column 1D FFTs
(a) Dataset level view
Step 1: Row Access from DRAM (Fast - Streaming) Step 2: Column Access from DRAM (Slow - Strided)

DRAM Row Buffer DRAM Row Buffer


Accessing Each Row to
Fast Streaming Row read/write just one
Access, An entire row element (Worst Case
to data is read into the Scenario)
Row Buffer and
utilized.
Jumping between n DRAM
Rows to access a single column
(Worst Case Scenario)
(b) Memory level view

Figure 1: (a) An overview of row-column decomposition (RCD) for 2D FFT implementation. Intermediate storage is required because all
elements of the row-by-row operations must be available for column-by-column processing. (b) An overview of strided column-wise access
from DRAM as compared to trivial row-wise access. An entire row of elements must be read into the row buffer even to access a single element
within a specific row.

between general-purpose and application-specific accelera- datasets have to be offloaded to external DRAM because
tion solutions. only a portion of the dataset that fits on the chip can be
operated on at a given time. For complex image processing
2.2. DRAM Intermediate Storage Problem. There have been applications, this means repeated storage and access to the
several high-throughput 2D FFT FPGA-based implementa- external memory during every stage of processing.
tions over the past few years. Most of these rely on repeated As shown in Figure 2(a) DRAM hierarchy from top to
invocations of 1D FFTs by row and column decomposition bottom is rank, chip, bank, row, and column. Each DRAM
(RCD) with efficient use of memory [5, 7, 12, 21, 22]. RCD bank (Figure 2(b)) has a row buffer that holds the most
makes use of the fact that a 2D Fourier transform is separable recently referred row. There is only one row buffer per bank
and can be implemented in stages; i.e., a row-by-row 1D which means only one row from the data-grid can be accessed
FFT can be proceeded by a column-by-column 1D FFT with at once. Before accessing a row, it has to be activated by
intermediate storage (Figure 1). Most of the previous RCD- transferring the contents from internal capacitor storage into
based 2D FFT FPGA implementation approaches have two a set of parallel sense amplifiers. The row buffer is the so-called
major design challenges: (1) The 1D FFT implementation “fast buffer”, because when a row is activated and placed in the
needs to have a reasonably high throughput and needs to buffer, any element can be accessed at random.
be resource efficient. Moreover, spatial parallelism needs to If a new row has to be activated and accessed into the
be exploited by running several 1D FFTs simultaneously. (2) row buffer, a row buffer miss occurs and requires a higher
External DRAM needs to be efficiently addressed and to have latency, 𝐴 푚푖푠푠 (Figure 2(c)). On the contrary if the desired
a high bandwidth. row is already in the buffer, a row buffer hit or page hit occurs
Since the column-by-column 1D FFT requires data from and the latency to access elements is substantially lower, 𝐴 ℎ푖푡 .
all rows, intermediate storage becomes a major problem for This implies 𝐴 푚푖푠푠 = 𝐴 ℎ푖푡 + 𝐶푟 , where 𝐶푟 is the overhead
large datasets. Many implementations rely on local memory associated with accessing a new row to read a specific
such as resource-implemented block RAM for intermediate element (Figure 2(c)) [12]. There is also overhead involved in
storage which is not possible for large datasets [21]. Large writing the row back to the data-grid (precharge), say, 𝐶푤 .
4 International Journal of Reconfigurable Computing

DRAM Module
Rank 1
Rank 0
Chip 0
Bank 1 Bank 2

Bank 3 Bank s
DRAM BANK

Row Address
Chip 1

Decoder
I/O Bus

Bank 1 Bank 2 Data Array


Bank 3 Bank s

Row Address
Chip p I/O
Row Buffer
Bank 1 Bank 2

Bank 3 Bank s
Address Column Address Column Address
Decoder

(a) DRAM hierarchy (b) Single DRAM bank

Decode Row Buffer Miss


Row
Row
Precharge
Row Buffer Hit Extra
Latency
Activate
Row
Decode
Column

(c) Row buffer hit and miss

Figure 2: (a) An overview of the DRAM hierarchy. (b) Image showing the structure of a single DRAM bank. (c) Flow chart explaining
additional latency introduced when a new row has to be referred to in the row buffer to access a specific element.

However, both 𝐶푟 and 𝐶푤 can be concealed by interleaving filtering techniques. These techniques are usually computa-
(switching between banks). Since row-wise access is trivial, tionally intensive, involve an increase in image size, and also
the row-by-row 1D FFT part of RCD-based 2D FFT is easily tend to modify the transform.
accomplished. However, once the row-by-row 1D FFT is The most common approach is by ramping the image
stored in the DRAM in standard row-major order, to access at corner pixels to slowly attenuate the edges. Ramping is
a single column, each row of the DRAM has to be accessed usually accomplished by an apodization function such as
into the row buffer, rendering the read process extremely a Tukey (tapered cosine) or a Hamming window, which
inefficient. This is typically the major bottleneck for high- smoothly reduces the intensity to zero. Such an approach can
throughput 2D FFTs (Figure 1(b)). We address this problem be implemented on an FPGA as a preprocessing operation
by designing a custom memory mapping scheme (Section 5). by storing the window function in a look-up table (LUT)
and multiplying it with the image stream before calculating
2.3. Edge Artifacts. While calculating 2D DFTs, it is assumed the FFT [16]. Although this approach is not extremely
that the image is periodic, which is usually not the case. computationally intensive for small images, it inadvertently
The nonperiodic nature of the image leads to artifacts in the removes necessary information from the image. Loss of this
Fourier transform, usually known as edge artifacts or series information may have serious consequences if the image
termination errors. These artifacts appear as several crosses is being further processed with several other images to
of high-amplitude coefficients in the frequency domain reconstruct a final image that is used for diagnostics or other
(Figure 3(b)). Such edge artifacts can be passed to subsequent decision-critical applications. Another common method is
stages of processing, and in biomedical applications they may by mirroring the image from 𝑁 × 𝑁 to 2𝑁 × 2𝑁. Doing
lead to critical misinterpretations of results. No current 2D so makes the image periodic and reduces edge artifacts.
FFT FPGA implementation addresses this problem directly. However, this not only increases the size of the image by 4×,
These artifacts may be removed during preprocessing, using but also makes the transform symmetric, which generates an
mirroring, windowing, zero padding, or postprocessing, e.g., inaccurate phase component.
International Journal of Reconfigurable Computing 5

decreasing the number of required 1D FFT invocations and


by reducing external DRAM utilization (Section 4).

2.4. LabVIEW FPGA High-Level Design Environment. A


major concern while designing complex image processing
hardware accelerators is how to fully harness on the divide-
and-conquer approach. Algorithms that have to be mapped
to multiple FPGAs are often marred by communication
problems, and custom FPGA boards reduce flexibility for
large-scale and evolving designs. For rapid prototyping of
(a) (b) our algorithms, we used LabVIEW FPGA 2016 (National
Instruments), a robust data-flow-based graphical design
environment. LabVIEW FPGA provides integration with
National Instruments (NI) Xilinx-based reconfigurable hard-
ware, allowing efficient communication with a host PC and
high-throughput communication between multiple FPGAs
through a PXIe (PCI eXtensions for Industry Express) bus.
LabVIEW FPGA also enables us to integrate external Hard-
ware Description Language (HDL) code and gives us the
flexibility to expand our design for future processing stages.
We used NI PXIe-7976R FPGA boards that have a Xilinx
(c) (d)
Kintex 7 FPGA and 2GB high-bandwidth (10GB/s) external
memory. This platform has already been extensively used for
rapid prototyping of communication standards and protocols
before moving to ASIC designs. The optimizations and
designs we present here are scalable to most reconfigurable
computing-based systems. Moreover, LabVIEW FPGA pro-
vides efficient high-level control over memory via a smart
memory controller.

3. Periodic Plus Smooth Decomposition (PSD)


for Edge Artifact Removal (EAR)
(e) (f)
Periodic plus smooth decomposition (PSD) involves decom-
Figure 3: (1a) An image with nonperiodic boundary. (1b) 2D DFT posing the image into a periodic and smooth component
of (1a). (1c) DFT of the smooth component, i.e., the removed to remove edge artifacts with minimal loss of information
artifacts from (1a). (1d) Periodic component, i.e., DFT of (1a) with from the image [4]. This section presents an overview of
edge artifacts removed. (1e) Reconstructed smooth component. (1f) the PSD algorithm and profiles the algorithm for possible
Reconstructed periodic component. parallelization and optimization to achieve efficient FPGA
implementation.
Let us have discrete 𝑛 by 𝑚 gray-scale image 𝐼 on a finite
domain Ω = {0, 1, . . . , 𝑛 − 1} × {0, 1, . . . , 𝑚 − 1}. The discrete
Simultaneously removing the edge artifacts while cal- Fourier transform (DFT) of 𝐼 is defined as
culating a 2D FFT imposes an additional design challenge,
regardless of the method used. However, these artifacts must ̂𝐼 (𝑠, 𝑡) = ∑ 𝐼 (𝑖, 𝑗) exp (−𝜄2𝜋 ( 𝑠𝑖 + 𝑡𝑗 )) . (1)
be removed in applications where they may be propagated to (푖,푗)∈Ω
𝑛 𝑚
subsequent processing levels. An ideal method for removing
these artifacts should involve making the image periodic This is equivalent to a matrix multiplication 𝑊𝐼𝑉, where
while removing minimal information from the image. Peri-
odic plus smooth decomposition (PSD), first presented by 1 1 1 ... 1
Moisan [4] and used in [23–25], is an ideal method for 2
removing edge artifacts (specifically for biomedical appli- 1 𝑤 𝑤 ... 𝑤푛−1
( )
cations) because it does not directly intervene with pixels ( 1 𝑤2 𝑤4 ... 𝑤2(푛−1) )
( )
beside those of the boundary and does not increase image 𝑊=( ) (2)
(. . . . . . ... ... ... )
size. Moreover, its inherently parallel nature makes it ideal for ( )
a high-throughput, FPGA-based implementation. We have 1 𝑤푛−2 𝑤2(푛−2) . . . 𝑤(푛−2)(푛−1)
further optimized the original PSD decomposition algorithm 푛−1
to make the overall implementation much more efficient, by (1 𝑤 𝑤2(푛−1) . . . 𝑤(푛−1)(푛−1) )
6 International Journal of Reconfigurable Computing

I/O
FPGA 1
2D FFT DMA Host PC
DMA (RCD with DRAM FIFO Compute
FIFO Intermediate Periodic
Smart
Storage) Component
Imaging
Step A
Device / Step E
Host PC
FPGA 2
2D FFT Compute
Compute
(RCD with DRAM Smooth
Periodic Intermediate
DMA Boarder Component
FIFO Storage)
Step B Step C Step D

Figure 4: A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus. The steps are associated
with Algorithm 1.

and It is obvious that the structure of the border image 𝐵 is simple


with nonzero values only in the edges as shown below:
2𝜋 푘 2𝜋𝑘
𝑤푘 = exp (−𝑖 ) = exp (−𝑖 ). (3) 𝐵= 𝑅+𝐶
𝑛 𝑛
𝑏11 𝑏12 . . . 𝑏1,푚−1 𝑏1푚
𝑉 has the same structure as 𝑊 but is m-dimensional. 𝑤푘 has
period 𝑛 which means that 𝑤푘 = 𝑤푘+푙푛 , ∀𝑘, 𝑙 ∈ N; therefore, 𝑏21 0 ... 0 −𝑏21
(6)
( ...
=( ... ... ... ... )
).
1 1 1 ... 1 1 𝑏푛−1,1 0 ... 0 −𝑏푛−1,1
2 푛−2
1 𝑤 𝑤 ... 𝑤 𝑤푛−1
( 𝑏푛1 −𝑏12 . . . −𝑏1,푚−1 −𝑏푛푚 )
( 푛−2 )
( 1 𝑤2 𝑤 4
... 𝑤 푛−4
𝑤 )
𝑊=(
(. . . . . .
). (4) The DFT of the smooth component 𝑆 can be then found by
( ... ... ... ... )
) the following formula:
1 𝑤푛−2 𝑤푛−4 . . . 𝑤4 𝑤2 ̂ (𝑠, 𝑡)
𝐵
푛−1
𝑤푛−2 . . . 𝑤2 𝑤1 ) 𝑆̂ (𝑠, 𝑡) = ,
(1 𝑤 2 cos (2𝜋𝑠/𝑛) + 2 cos (2𝜋𝑡/𝑚) − 4 (7)

Since in general 𝐼 is not (𝑛, 𝑚)-periodic, there will be ∀ (𝑠, 𝑡) ∈ Ω \ {(0, 0)} .
high-amplitude edge artifacts present in the DFT stemming
The DFT of the image 𝐼 with edge artifacts removed is
from sharp discontinuities between the opposing edges of ̂ Figures 3(c) and 3(d) show the DFT of
then 𝑃̂ = ̂𝐼 − 𝑆.
the image as shown in Figure 3(b). Reference [4] proposed
a decomposition of 𝐼 into a periodic component 𝑃, which the smooth and periodic components, respectively. Figures
is periodic and captures the essence of the image with all 3(e) and 3(f) show the reconstructed periodic and smooth
high-frequency details, and a smoothly varying background components. On reconstruction, it is evident that there is
𝑆, which recreates the discontinuities at the borders, so 𝐼 = negligible visual difference between the actual image and the
𝑃 + 𝑆. Periodic plus smooth decomposition can be computed periodic reconstructed image for this example.
by first constructing a border image 𝐵 = 𝑅 + 𝐶, where 𝑅
represents the boundary discontinuities when transitioning 3.1. Profiling PSD for FPGA Implementation. Algorithm 1
row-wise and 𝐶 when going column-wise. summarizes the overall PSD implementation. There are
several ways of arranging the algorithm. We have arranged
𝑅 (𝑖, 𝑗) it so that DFTs of the periodic and smooth components
are readily available for further processing stages. For best
{𝐼 (𝑛 − 1 − 𝑖, 𝑗) − 𝐼 (𝑖, 𝑗) , 𝑖 = 0 or 𝑖 = 𝑛 − 1 results, both the periodic and smooth components should
={ undergo similar processing stages and should be added back
0, otherwise together before displaying the result. However, depending
{
(5) on the application it might be acceptable to discard the
𝐶 (𝑖, 𝑗) periodic component completely. For an 𝑛 × 𝑚 image, steps
A and C have a complexity of O(𝑛𝑚 log(𝑛𝑚)) and steps B
{𝐼 (𝑖, 𝑚 − 1 − 𝑗) − 𝐼 (𝑖, 𝑗) , 𝑗 = 0 or 𝑗 = 𝑚 − 1 and D have complexity O(𝑚 + 𝑛) and O(𝑚𝑛), respectively.
={ Computationally the performance of PSD is limited by steps
0, otherwise
{ A and C. Figure 4 shows a proposed top-level architecture
International Journal of Reconfigurable Computing 7

where step A and steps B, C, and D are completed on separate Let us have a closer look at the first column, denoted by 𝐵⋅1 .
FPGAs while step E can be done on the host PC. Two high The 1D FFT of this vector is
end FPGAs are used instead of one because resources on
one FPGA are insufficient to compute a large size 2D FFT
as well its edge artifact removal components. The overall
̂ ⋅1 = 𝑊𝐵⋅1
𝐵
performance may be limited by FPGA 2 where most of the
serial part of the algorithm lies. There are two major factors
1 1 ... 1 𝑏11
which limit the throughput of such a design:
1 𝑤 ... 𝑤푛−1 𝑏21
( )( )
(1) While FPGA 1 and FPGA 2 can run in parallel, the ( 1 𝑤2 ... 𝑤 2(푛−1) ) ( 𝑏31 )
( )( )
result of step A from FPGA 1 has to be stored on the =( )( )
(. . . . . . ... ... )( ... )
host PC while steps B, C, and D are completed on ( )
FPGA 2 before step E can be completed on the host 푛−2 (푛−2)(푛−1) 𝑏푛−1,1
1 𝑤 ... 𝑤
PC.
(1 𝑤
푛−1
. . . 𝑤(푛−1)(푛−1) ) ( 𝑏푛1 )
(2) The DRAM intermediate storage problem explained
in Section 2.2 and Figures 1 and 2 has to be addressed
since strided access to the DRAM for column-wise 푛 (11)
operations can significantly limit throughput. ∑𝑏푖1
푖=1
푛
( )
As for (1), it has been addressed in the next section where ( ∑𝑏푖1 𝑤푖−1 )
( )
we make use of the inherent symmetry of the boundary ( 푖=1 )
( 푛 )
image to reduce the time required to compute the 2D FFT ( 2(푖−1) )
( ∑𝑏푖1 𝑤 )
of the boundary image. As for (2), it has been addressed by (
= ( 푖=1 ).
)
designing a semi-custom memory mapping controller which ( ... )
( )
”tiles” the DRAM floor and ”hops” between several tiles so as ( 푛 )
( )
to minimize strided memory access. ( ∑𝑏푖1 𝑤(푛−2)(푖−1) )
( )
푖=1
푛
4. Proposed: Optimized Periodic Plus Smooth ∑𝑏푖1 𝑤(푛−1)(푖−1)
Decomposition (OPSD) ( 푖=1 )

In this section, we optimize the original PSD algorithm. This


optimization is to effectively reduce the number of 1D FFT
invocations and the number of times the DRAM is accessed. It can be shown that the 1D FFT of the column 𝑗 ∈
Equation (6) shows that, except the corners, the boundary {2, 3, . . . , 𝑚 − 1} is
image 𝐵 is symmetrical in the sense that boundary rows and
columns are algebraic negation of each other. In total 𝐵 has 𝑛+
𝑚 − 1 unique elements, with the following relations between
corners with respect to columns and rows: ̂ ⋅𝑗 = 𝑊𝐵⋅𝑗
𝐵

1 1 ... 1 𝑏1푗
𝑏11 = 𝑟11 + 𝑐11 푛−1
1 𝑤 ... 𝑤 0
𝑏1푚 = 𝑟1푚 − 𝑐11 ( )( )
( 1 𝑤2 ... 𝑤 2(푛−1) ) 0 )
(8) ( )(( )
=( )(
𝑏푛1 = −𝑟11 + 𝑐푛1 (. . . . . . ... ... )( ... ) )
( )
𝑏푛푚 = −𝑟1푚 − 𝑐푛1 1 𝑤푛−2 ... 𝑤 (푛−2)(푛−1) 0

󳨐⇒ 𝑏푛푚 = −𝑏11 − 𝑏1푚 − 𝑏푛1 . (9) (1 𝑤


푛−1
. . . 𝑤(푛−1)(푛−1) ) (−𝑏1푗 ) (12)

0
In computing the FFT of 𝐵, one normally proceeds by first
running 1D FFTs column-by-column and then 1D FFTs row- 1 − 𝑤푛−1
( )
by-row or vice versa. An FFT of a column vector 𝑣 with length (1 − 𝑤푛−2 )
𝑛 is 𝑊𝑣, where 𝑊 is given in (4). The column-wise FFT of the = 𝑏1푗 ( )
( . . . ) = 𝑏1푗 ^,
matrix 𝐵 is then ( )
1 − 𝑤2
̂ = 𝑊𝐵.
𝐵 (10) ( 1−𝑤 )
8 International Journal of Reconfigurable Computing

Input: 𝐼(𝑖, 𝑗) of size 𝑛 × 𝑚


Output: 𝑃(𝑠, ̂ 𝑡)
̂ 𝑡) 𝑆(𝑠,
Step A: Compute the 2D DFT of image 𝐼(𝑖, 𝑗):
F
1: 𝐼(𝑖, 𝑗) 󳨀→ ̂𝐼(𝑠, 𝑡)
Step B: Compute periodic border 𝐵:
2: while 1 < 𝑗 < 𝑚 do
3: while 1 < 𝑖 < 𝑛 do
4: if (𝑖 = 0 ∨ 𝑖 = 𝑛 − 1) then
5: 𝑅(𝑖, 𝑗) ←󳨀 𝐼(𝑛 − 1 − 𝑖, 𝑗) − 𝐼(𝑖, 𝑗)
6: else
7: 𝑅(𝑖, 𝑗) ←󳨀 0
8: end if
9: if (𝑗 = 0 ∨ 𝑗 = 𝑚 − 1) then
10: 𝐶(𝑖, 𝑗) ←󳨀 𝐼(𝑖, 𝑚 − 1 − 𝑗) − 𝐼(𝑖, 𝑗)
11: else
12: 𝐶(𝑖, 𝑗) ←󳨀 0
13: end if
14: end while
15: end while
16: 𝐵 ←󳨀 𝑅 + 𝐶
Step C: Compute the 2D DFT of 𝐵(𝑠, ̂ 𝑡):
F
̂
17: 𝐵(𝑖, 𝑗) 󳨀→ 𝐵(𝑠, 𝑡)
Step D: Compute the Smooth Component 𝑆(𝑠, ̂ 𝑡):
̂
18: 𝐷(𝑠, 𝑡) ←󳨀 2 cos(2𝜋𝑠/𝑛) + 2 cos(2𝜋𝑡/𝑚) − 4
̂ 𝑡) ←󳨀 𝐵(𝑠,
19: 𝑆(𝑠, ̂ 𝑡) ÷ 𝐷(𝑠,
̂ 𝑡)
Step E: Compute the Periodic Component 𝑃(𝑠, ̂ 𝑡):
̂ 𝑡)
̂ 𝑡) ←󳨀 ̂𝐼(𝑠, 𝑡) − 𝑆(𝑠,
20: 𝑃(𝑠,
̂ 𝑡)
̂ 𝑡) 𝑆(𝑠,
21: return 𝑃(𝑠,

Algorithm 1: Periodic plus smooth decomposition (PSD).

and the 1D FFT of the last column 𝐵⋅𝑚 is to be calculated in a row burst normal way. Algorithm 2
presents a summary of the shortcut for calculating 𝐵(𝑠, ̂ 𝑡).
̂ ⋅𝑚 = 𝑊𝐵⋅𝑚
𝐵 (13) The steps presented in Algorithm 2 can replace step C
푛 in Algorithm 1. By reducing column-by-column 1D FFT
−∑𝑏푖1 computations for the boundary image, this method can
푛
푖=1 significantly reduce the number of 1D FFT invocations,
( ) reduce the overall DRAM access, and eliminate problematic
(
( − ∑ 𝑏푖1 𝑤푖−1 + (𝑏11 + 𝑏1푚 ) (1 − 𝑤푛−1 ) )
)
( 푖=1 ) column-wise strided DRAM access for an efficient FPGA-
( 푛 )
( 2(푖−1) 2(푛−1) ) based implementation. For column-wise operations, a single
( −∑𝑏푖1 𝑤 + (𝑏11 + 𝑏1푚 ) (1 − 𝑤 ) )
=( ) (14) 1D FFT of size 𝑚 is required rather than 𝑛𝑚 1D FFTs of size
( 푖=1 )
(
( . . . )
)
𝑚. Moreover, since one has to simply store one column of
( 푛 ) data, it can be stored on the on-chip local memory (BRAM
( )
( −∑𝑏푖1 𝑤(푛−2)(푖−1) + (𝑏11 + 𝑏1푚 ) (1 − 𝑤(푛−2)(푛−1) ) ) or SRAM). This can be implemented by temporarily storing
( )
푖=1
푛 the initial vector 𝐵 ̂ ⋅1 and scaling factors 𝑏1푗 in the block
−∑𝑏푖1 𝑤(푛−1)(푖−1) + (𝑏11 + 𝑏1푚 ) (1 − 𝑤(푛−1)(푛−1) ) RAM/register memory, drastically reducing DRAM access,
( 푖=1 ) and lowering the number of required 1D FFT invocations.
̂ ⋅1 + (𝑏11 + 𝑏1푚 ) ^.
̂ ⋅𝑚 = −𝐵 Performance evaluation for this has been presented in the
𝐵 (15)
results section.
Therefore, the column-wise FFT of the matrix 𝐵 is Table 1 shows a comparison of mirroring, PSD, and
our proposed OPSD with respect to DRAM access points.
̂ = (𝐵
𝐵 ̂ ⋅1 + (𝑏11 + 𝑏1푚 ) ^) .
̂⋅1 𝑏12 ^ . . . 𝑏1,푚−1 ^ −𝐵 (16) Mirroring has been used for comparison purposes because it
is an alternative technique that reduces edge artifacts while
To compute the column-by-column 1D FFT of the matrix, maintaining maximum amplitude information. However,
𝐵, we only have to compute the FFT of the first vector due to replication of the image, most of the phase information
and then use the appropriately scaled vector, ^, to derive is lost. Figure 5 graphically shows that our OPSD method sig-
the remainder of the columns. The row-by-row FFT has nificantly reduces reading from external memory and reduces
International Journal of Reconfigurable Computing 9

Input: 𝐵(𝑖, 𝑗) of size 𝑛 × 𝑚


Output: 𝐵(𝑠,̂ 𝑡)
F𝑐𝑤 F
2D F(𝐵) ⇐⇒ 𝐵 󳨀󳨀󳨀→ 𝐵 ̂ 𝑐𝑤 󳨀󳨀󳨀𝑟𝑤
→𝐵 ̂
Column-by-Column DFT via Symmetrical Short-cut:
F
1: 𝐵⋅1 󳨀→ 𝐵 ̂ ⋅1
2: while 1 < 𝑗 < 𝑚 do
3: 𝐵 ̂ ⋅𝑗 ←󳨀 𝑏1푗 ^
4: end while
̂ ⋅𝑚 ←󳨀 −𝐵
5: 𝐵 ̂ ⋅1 + (𝑏11 + 𝑏1푚 )^
̂ 𝑐𝑤 ←󳨀 𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒 𝐵
6: 𝐵 ̂ ⋅1 𝐵
̂ ⋅2 . . . 𝐵̂ ⋅𝑚−1 𝐵
̂ ⋅𝑚
Row-by-Row DFT:
F
̂ 𝑐𝑤 󳨀󳨀󳨀𝑟𝑤
7: 𝐵 ̂ 𝑡)
→ 𝐵(𝑠,
̂ 𝑡)
8: return 𝐵(𝑠,
cw: column-wise/column-by-column
rw: row-wise/row-by-row

̂ 𝑡).
Algorithm 2: Proposed symmetrically optimized computation of 𝐵(𝑠,

×108 DRAM Access - PSD vs OPSD


2
Mirroring
1.8 PSD
OPSD (Proposed)
1.6

1.4
DRAM Access Points

1.2

0.8

0.6

0.4

0.2

0
0 1000 2000 3000 4000 5000
Image Size (N=M)

Figure 5: Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring, periodic
plus smooth decomposition (PSD), and our proposed optimized period plus smooth decomposition (OPSD).

the overall number of DFT computations required. It should during intermediate storage between row-wise and column-
be noted that such optimization is only possible for either wise 1D FFT operations to calculate a 2D FFT. As explained
column-wise or row-wise operations because after either of in Section 2.2 and Figure 2, column-wise reads from DRAM
these operations the output is not symmetrical anymore. can be costly due to the overhead associated with activating
Completing the column-wise operation first prevents strided and precharging. In the worst case scenario it can limit
reading; however, this results in strided writing to the DRAM DRAM bandwidth up to 80% [26]. This is a problem with
before row-wise traversal can start. This can be minimized by all such image processing operations where one stage of the
making efficient use of the local block RAM. Output columns processing has to be completed on all elements before the
are stored in the block RAM before being written to the next stage can start. In the past there have been several
DRAM in patches such that each row buffer access writes implementations using local memory; however, with growing
elements in several columns.
demand for larger image sizes external memory has to be
used. There have been several DRAM remapping attempts
5. Proposed: Tile-Hopping Memory Mapping
before, such as [5, 13]. They propose a tile-based approach
for 2D FFTs where an 𝑛 × 𝑛 image (input array) is divided into 𝑛/𝑘 × 𝑛/𝑘
In this section we propose a tile-hopping external memory tiles where 𝑘 is the size of the DRAM row buffer which
access pattern for efficiently addressing external memory allows for very high-bandwidth DRAM access. Although, this
10 International Journal of Reconfigurable Computing

Table 1: Comparing mirroring, PSD, and OPSD. to 𝑝.ℎ rather than 𝑛 rows), the number of DRAM row referrals
during column-wise read is reduced to 𝑛.𝑞/𝑘 which is 𝑛.(1 −
DRAM Access DFT 𝑞/𝑘) less row referrals for a single column read.
Algorithm
Points Points We refer to this method as tile-hopping because it entails
Mirroring 8𝑁𝑀 8𝑁𝑀 mapping data onto several DRAM tiles and then hopping
P + S Decomposition (PSD) 4𝑁𝑀 4𝑁𝑀 between the tiles such that several elements of the image
Optimized PSD (Proposed) 3𝑁𝑀 + 𝑁 + 𝑀 − 1 3𝑁𝑀 + 𝑀 column exist in a DRAM row which has been referred to
in the row bank. Although this mapping scheme has been
developed for column-wise access required during 2D FFT
calculation, the scheme is general and can be adapted to other
method may be ideal to maximize the DRAM performance applications. Performance evaluation of this method has been
for 2D FFTs, it incurs a high resource cost associated with presented in the experiments section.
local memory transposition and storing large chunks of data
(entire row/column of tiles) in the local memory. Moreover,
tiling in the image domain also requires remapping row- 6. Experimental Results and Analysis
by-row operations. Another approach to reducing strided
6.1. Hardware Configuration and Target Selection. Since 2D
DRAM access has been presented in [27]. They present a
DFTs are usually used for simplifying convolution operations
2D decomposition algorithm which decomposes the problem
in complex image processing and machine vision systems, we
into smaller subblock 2D FFTs which can be performed
needed to prototype our design on a system that is expandable
locally. This introduces extra row and column data exchanges,
for next levels of processing. As mentioned earlier, for rapid
and total number of operations are increased from O(𝑛2 log 𝑛)
prototyping of our proposed OPSD algorithm and tile-
to O(𝑛2 (1+log 𝑛)). Other implementations do not address the
hopping memory mapping scheme, we used a PXIe-based
external memory issue in detail.
reconfigurable system. PXIe is an industrial extension of
We propose tile-hopping address mapping which reduces
a PCI system with an enhanced bus structure that gives
the number of row activations required to access a single
each connected device dedicated access to the bus with a
column. Unlike [5], our approach does not require significant
maximum throughput of 24GB/s. This allows a high-speed
local operations or storage. The proposed memory mapping
dedicated link between a host PC and several FPGAs. The
controller was designed on top of LabVIEW FPGA’s existing
LabVIEW FPGA graphical design environment is efficient for
memory controller which efficiently controls interleaving and
rapid prototyping of complicated signal and image processing
issues activation and precharge commands in parallel with
systems. It allows us to effectively integrate external HDL
data transfer. The reduced number of row activations also
code and LabVIEW graphical design on a single platform.
reduces the amount of energy required by the DRAM. This
Moreover, it allows a combination of high-level synthesis
will be further discussed in the energy evaluation presented
(HLS) and custom logic. Since current HLS tools have
in Section 6.4 and Table 3 where we demonstrate that the
limitations when it comes to complex image and signal
proposed tile-hopping address mapping can reduce energy
processing tasks, LabVIEW FPGA tries to bridge these gaps
consumption up to 53%.
by streamlining the design process.
Instead of writing the results of the row-by-row 1D FFT
We used FlexRIO (Flexible Reconfigurable I/O) FPGA
in row-major order we remap the results in a blocked or tiled
boards plugged into a PXIe chassis. PXIe FlexRIO FPGA
pattern as shown in Figure 6. This means that when accessing
boards are adaptable and can be used to achieve high
an image column, several elements of that column can be
throughput, because they allow direct data transfer between
retrieved from a single DRAM row access. For an 𝑛 × 𝑛 image,
multiple FPGAs at rates as high as 8GB/s. This can signifi-
each row of size 𝑛 can be divided into ℎ tiles (i.e., 𝑛 = ℎ.𝑁(𝑡),
cantly simplify multi-FPGA systems, which usually commu-
where 𝑁(𝑡) is the number of elements in each tile). These tiles
nicate via a host PC. This feature allows expansion of our
can be remapped onto the DRAM floor as shown in Figure 6.
system to further processing stages, making it flexible for a
If the size of the row is small enough, it may be possible to
variety of applications. Figure 7 shows a basic overview of
convert it into a single tile (i.e., ℎ = 1). However, this is
a PXIe-based, multi-FPGA system with a host PC controller
unlikely for realistic image sizes. For a tile of size 𝑝×𝑞, a single
connected through a high-speed bus on a PXIe chassis.
row of the image is written into the DRAM by transitioning
Specifically, we used two NI PXIe-7976R FlexRIO boards
through 𝑝.ℎ rows. If 𝑘 is the size of the row buffer, there
which have Kintex 7 FPGA and 2GB external DRAM with
are 𝑘/𝑞 distinct tiles represented in each DRAM row and it
theoretical data bandwidth up to 10GB/s. This FPGA board
contains the same number of elements from a single image
was plugged into a PXIe-1085 chassis along with a PXIe-8880
column. Given regular row-major storage when accessing
Intel Xeon PC controller. PXIe-1085 can hold up to 16 FPGAs
column-wise elements, one would have to transition through
and has 8 GB/s per-slot dedicated bandwidth and an overall
𝑛 DRAM rows to read a single image column. However,
system bandwidth of 24 GB/s.
with this approach, when accessing an image column, 𝑘/𝑞
elements of that column could be read from a single DRAM
row which has been referred to in the row buffer. Although 6.2. Experimental Setup. As per Algorithms 1 and 2, dis-
the cost of writing an image row is higher when compared to cussed in previous sections, implementation involves five
a standard row-major DRAM writing pattern, (i.e., referring stages: (A) calculating the 2D FFT of an image frame, (B)
International Journal of Reconfigurable Computing 11

Writing RR Output to DRAM

Row-by-Row (RR) Output

n
p

n k Row Buffer
(a) Dataset view (b) Memory view
Reading from DRAM

k Row Buffer

(c) Memory view

Figure 6: Image showing tile-hopping. (a) Image-level view showing tiles. (b) DRAM-level view showing tile placement while writing. (c)
DRAM-level view showing column reading from the tiles.

PXIe Chassis due to A is reduced removing the so-called “memory wall”


High-throughput PXIe BUS by using our proposed tile-hopping-based memory mapping.
External The limitations due to the serial part of the algorithm are
I/O PXIe PXIe PXIe PXIe reduced by using OPSD rather than PSD. For quantification,
Host PC FPGA FPGA FPGA
Display Controller Board 1 Board 2 Board 3 the delay for A is 0.62ms for a 512 × 512 image.
Device External I/O External I/O External I/O The design flow presented in Figure 4 was followed.
Data-flow is clearly shown in a graphical programming
Figure 7: Block diagram of a PXIe-based multi-FPGA system with environment making it easier to visualize how a design
a host PC controller connected through a high-speed bus on a PXIe efficiently fits on an FPGA. Highly efficient implementations
chassis [8]. of 1D FFT were used from LabVIEW FPGA for parallel
row-by-row operations and by integrating Xilinx LogiCORE
for column-by-column operations. Each stage of the design
was dynamically tested and benchmarked. The image was
calculating the boundary image, (C) calculating the 2D streamed from the host PC using a Direct Memory Access
FFT of the boundary image, (D) calculating the smooth (DMA) FIFO. 1D FFTs are performed in parallel rows of 8
component, and (E) subtracting the smooth component from and stored in the DRAM via local memory in a tiled pattern
the 2D FFT of the original image to achieve the periodic as explained in the previous section. This follows reading
component. The bottleneck consistently occurs in A and the several rows to extract a single column which is Fourier-
serial part of the algorithm (A 󳨀→ B 󳨀→ C). The limitation transformed using Xilinx LogiCORE and is sent back to
12 International Journal of Reconfigurable Computing

PXIe Chassis (NI-PXIe1085)


Host PC Controller FPGA-1 Host PC Controller
(NI PXIe-8880) (NI PXIe-7976R) (NI PXIe-8880)
i Row-Major Memory Mapping Tile-Hopping Memory Mapping
I(0,0)
Original Image

External Memory

External Memory
+
j FFT of
Periodic Component
I(n-1, m-1) (With Edge Artifact
N x M size image Row-wise 1D FFT Column-wise 1D FFT∗ − Removed)

FPGA-2
Block Memory (NI PXIe-7976R)
a  ·1

I(n-1, j) – I(0, j)
1D FFT

External Memory
Block Memory
I(i, 0) – I(i, m-1)
I(i, m-1) – I(i, 0)
Boundary Image

Block Memory

Trigonometric
b b1j v Functions
Equation (7)

c
Row-wise 1D FFT
I(0, j) – I(n-1, j)  ·1 + (b11 +b1m )v
- FFT of
c Shortcut for Column-wise 1D FFT Smooth Component
a b

Figure 8: Functional block diagram of PXIe-based 2D FFT implementation with simultaneous edge artifact removal using optimized periodic
plus smooth decomposition. The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and a
host PC connected over a high-bandwidth bus. The image is streamed from the PC controller to FPGA 1 and FPGA 2. FPGA 1 calculates the
row-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back to
the host PC. FPGA 2 receives the image, calculates the boundary image, and proceeds to calculate the 1D FFT column-by-column FFT using
the shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC.

FPGA Board 1 data between local and external memory. CU is based on


LabVIEW’s existing memory controller and our memory
FPGA mapping scheme presented in Section 5.
1D FF41 Step B was accomplished using standard LabVIEW
Host DMA 1D FF42
PC FPGA HLS tools for programming (5) using the graphical
FIFO 1D FF4H
PXIe BUS

programming environment. In step C the 2D FFT of the


boundary image needs to be calculated by row and column
Camera FIFO IN FIFO OUT
CU (Local Memory - (Local Memory –
decomposition. However, as shown mathematically in the
Write) Read) previous section, the initial row-wise FFTs can be calculated
by computing the 1D FFT of the first (boundary) vector and
the FFTs of reaming vectors can be computed by appropriate
External Memory scaling of this vector.
We need the boundary column vector for 1D FFT calcula-
Figure 9: Block diagram of 2D FFT showing data transfer between
tion of the first and last columns. We also need the boundary
external memory and local memory scheduled via a Control Unit row vector for appropriate scaling of 𝑣̂ for the 1D FFT of
(CU). every column between the first and last columns. Row and
column vectors of the boundary image are stored in block
RAM (BRAM). Figure 8 shows a functional block diagram
the host PC. If the image is being streamed directly from of the overall 2D FFT with optimized PSD process. Steps D
an imaging device which scans and provides random or and E are performed on the host PC to minimize memory
nonlinear sequence of rows, it is necessary to store a frame clashes and to access the periodic and smooth components
of the image in a buffer. This can also be accomplished of each frame as they become available. 86% of resources are
by streaming the image flow from the host PC or using a used on FPGA 1 and 41% resources are used on FPGA 2. The
smart camera which can delay image delivery by a single resource utilization is reported according to LabVIEW FPGA
frame. Local memory shown in Figure 9 is used to buffer synthesis and compilation experiments. It should be noted
data between external memory and 1D FFT cores. This that part of the reason for high resource utilization is because
memory is divided into read and write components and is of using LabVIEW FPGA high-level synthesis tools as well
implemented using FPGA slices. Block RAM (BRAM) is used as Xilinx LogiCORE tools. Using standardized tools makes
for temporary storage of vectors required for calculating the it harder to optimize for resource utilization. The current
2D FFT of the boundary image (in the case of FPGA 2). implementation was optimized for performance in terms of
The Control Unit (CU) organizes scheduling for transferring run time and energy consumption.
International Journal of Reconfigurable Computing 13

105 104
CPU(i7) CPU(i7) (PSD)
Kintex-7 (No DRAM Opt.) Kintex-7 (PSD)
Kintex-7 (DRAM Opt.) Kintex-7 (OPSD)
104 Theoretical Peak
103
Frames/Second (1/s)

Frames/Second (1/s)
103

102

102

101
101

100 100
128 256 512 1024 2048 4096 128 256 512 1024 2048 4096
x x x x x x x x x x x x
128 256 512 1024 2048 4096 128 256 512 1024 2048 4096
2D FFT Size 2D FFT Size
(a) 2D FFT performance: DRAM tile-hopping (b) 2D FFT with EAR performance: PSD versus OPSD

Figure 10: Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern. (b) 2D FFTs with edge
artifact removal (EAR) using OPSD. The performance evaluation shows the significance of the two optimizations proposed. Both axes are on
a log scale.

6.3. Performance Evaluation. The overall performance of the least one of the components and lowers the overall sys-
system was evaluated using the setup presented in Figure 9. tem throughput. A throughput-optimized system minimizes
The data was streamed from the host PC; in certain cases instances where certain components of the architecture are
high frame rate videos as well as direct camera input were idle. The proposed optimizations in Sections 4 and 5 clearly
streamed from the host. All results presented are for 16-bit reduce throughput disparity and minimize the idle time of the
fixed-point precision. Figure 10(a) presents the effectiveness system. Thus, besides causing delays due to significant over-
of our proposed tile-hopping memory mapping scheme. It head, standard column-wise DRAM access also contributes
clearly shows the effectiveness of our proposed memory to the overall energy consumption. This is not only due to
mapping since it is closer to the theoretical peak performance. high count of DRAM row charges but also because of energy
Figure 10(b) presents the overall results comparing PSD and consumed by the FPGA in idle state. Ideally, maximizing the
OPSD and demonstrating the effectiveness of our proposed DRAM bandwidth limits the amount of energy consumption.
optimization. PSD was also implemented on the same plat- Proposed “tile-hopping” memory mapping scheme improves
form but the optimization presented in Section 4 was not the DRAM bandwidth as seen in Figure 10 and hence reduces
used. This rendered the serial portion of the algorithm to be the overall energy consumption. The same is true for the
the bottleneck which reduced overall performance. Table 2 proposed OPSD method where reduced DRAM access and
shows a comparison of our implementation in contrast to 1D FFT invocations lead to reduced energy consumption. In
recent 2D FFT FPGA implementation approaches and shows this section we analyze the amount of improvement in energy
that we achieve a better performance even with simultaneous consumption based on the proposed optimizations.
edge artifact removal. Although, our implementation is tested We estimate the DRAM power consumption for both
with 16-bit fixed-point precision which limits the accuracy of the baseline (standard, strided) and the optimized (“tile-
the transform, the precision may be sufficient for a variety hopping”) memory access using MICRON DRAM power
of speed critical applications where alternative edge artifact calculator. The energy is calculated in 𝑛𝐽 for each read,
removal methods (e.g., filtering) may decrease overall system i.e., energy per read. This is accomplished by calculating
performance. Although, the dynamic range of fixed-point the run time for a specific 2D FFT and estimating the
data is smaller than floating-point data, which can lead to amount of energy consumed by the DRAM power calculator.
errors in the 2D FFT; for certain applications it is more Table 4 depicts the DRAM energy consumption for 2D FFTs
important to have an artifact-free transform as compared to before and after the proposed “tile-hopping” optimization for
a highly accurate transform. column-wise DRAM access. As mentioned earlier, row-wise
DRAM access is fast and row buffer size data can be accessed
6.4. Energy Evaluation. The overall energy consumption by a single row activation. According to Table 4 the energy
of the custom computing system depends on (1) power required for DRAM access is reduced by 42.7%, 48.8%, and
performance of the system components and (2) throughput 52.9% for 1024 × 1024, 2048 × 2048, and 4096 × 4096 size 2D
disparity. Throughput disparity results in idle time for at FFTs, respectively.
14 International Journal of Reconfigurable Computing

Table 2: Comparison of OPSD1 2D FFT with regular RCD-based implementation.

SEAR2 Precision RT3 (ms)


Platform
Yes/No bits 512 × 512 1024 × 1024
Kintex 7, 28nm (ours) Yes 16 (fixed) 1.5 4.8
Kintex 7, 28nm (ours) No 16 (fixed) 0.9 4.1
Kintex 7, 28mm [8] Yes 16 (fixed) 32.4 116.7
Stratix IV [5] No 64 (double) - 6.1
Virtex-5-BEE3, 65nm[14] No 32 (single) 24.9 102.6
Virtex-E, 180nm [21] No 16 (fixed) 28.6 76.9
ASIC, 180nm No 32 (single) 21.0 -
1
Optimized periodic + smooth decomposition (OPSD)
2
Simultaneous edge artifact removal
3
Runtime (ms).

Table 3: DRAM energy consumption baseline vs tile-hopping.

1024 × 1024 2048 × 2048 4096 × 4096


𝑛𝐽𝑜𝑢𝑙𝑒 𝑛𝐽𝑜𝑢𝑙𝑒 𝑛𝐽𝑜𝑢𝑙𝑒
EPR∗ CW∘ Read
(Baseline) 4.46 5.77 7.12
EPR CW Read
Tile-Hopping 2.54 2.95 3.36
(Proposed)
Reduction (%) 42.7% 48.8% 52.9%

Energy per read (EPR).

Column-wise memory access (CW).

Table 4: 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping).

1024 × 1024 2048 × 2048 4096 × 4096


𝑛𝐽𝑜𝑢𝑙𝑒 𝑛𝐽𝑜𝑢𝑙𝑒 𝑛𝐽𝑜𝑢𝑙𝑒
EPP† 2D FFT + EAR
Baseline 36.92 41.25 48.35
2D FFT+EAR (Opt.)
Tile-Hopping + OPSD 15.88 17.06 18.11
(Proposed)
Improvement 2.3× 2.4× 2.8×

Energy per point (EPP).

The metric used to compare the overall energy optimiza- 7. Application: Filtered Back-Projection
tion achieved for 2D FFTs with EAR is energy per point, i.e., for Tomography
the amount of average energy required to compute the 2D
FFT of a single point in an image with simultaneous edge In order to further demonstrate the effectiveness of our
artifact removal. This was achieved by calculating the energy implementation, we use the created 2D FFT module as
consumed by Xilinx LogiCORE IP for 1D FFTs, the DRAM, an accelerator for reducing the run time for filtered back-
and the edge artifact removal part separately. The estimated projection (FBP). FBP is a fundamental analytical tomo-
energy calculated does not include energy consumed by the graphic image reconstruction method. In depth details
PXIe chassis and the host PC. Essentially, the FPGA-based regarding the basic FBP algorithm have been left out for
architecture presented here could be used without the host brevity, but can be found in [28, 29]. The method can be used
controller. The energy consumption incorporates dynamic to reconstruct primitive 3D tomograms from 2D data, which
as well as static power. The overall energy consumption per can then be used as a basis for more complex regularization-
point is reduced by 56.9%, 58.6%, and 62% for calculating based methods such as [30–32]. The algorithmic flow is based
1024 × 1024, 2048 × 2048, and 4096 × 4096 size 2D FFTs with on the Fourier slice theorem; i.e., 2D Fourier transforms
EAR, respectively. of projections are an angular component of the 3D Fourier
International Journal of Reconfigurable Computing 15

Table 5: Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR).

CPU (i7) FPGA + Host PC (i7)


3D Density
Sec Sec
128 × 128 × 128 21.3 sec 19.5 sec
256 × 256 × 256 47.5 sec 42.4 sec
512 × 512 × 512 94.8 sec 81.3 sec
1024 × 1024 × 1024 322.3 sec 275.3 sec
2048 × 2048 × 2048 1687.7 sec 1364.4 sec
4096 × 4096 × 4096 16463.1 sec 12599.4 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph.


8. Conclusion
2D FFTs often become a major bottleneck for high-
performance imaging and vision systems. The inherent
computational complexity of the 2D FFT kernel is fur-
ther enhanced if effective removal (using PSD) of spurious
artifacts introduced by the nonperiodic nature of real-life
images is taken into account. We developed and implemented
Original Phantom
CPU BP
Original Phantom
FPGA+CPU BP
an FPGA-based design for calculating high-throughput 2D
1 1 DFTs with simultaneous edge artifact removal. Our approach
0.8 0.8
is based on a PSD algorithm that splits the frequency domain
of a 2D image into a smooth component which contains the
Intensity

Intensity

0.6 0.6
high-frequency, cross-shaped artifacts and can be subtracted
0.4 0.4
from the 2D DFT of the original image to obtain a periodic
0.2 0.2
component that is artifact-free. Since this approach calculates
0
0 20 40 60 80 100 120
0
0 20 40 60 80 100 120
two 2D DFTs simultaneously, external memory addressing
X Direction X Direction
and repeated 1D FFT invocations become problematic. To
Figure 11: Figure showing a thin slice of filtered back-projection solve this problem we optimized the original PSD algorithm
results by reconstructing a 128 × 128 × 128 Shepp-Logan phantom. to reduce the number of DFT samples to be computed and
The 3D density was reconstructed from 180 equally spaced simulated DRAM access. Moreover, to reduce strided access from the
projections using standard linearly interpolated FBP and using a DRAM during column-wise reads we presented and analyzed
Ram-Lak filter. It can be seen that the results from the FPGA + CPU “tile-hopping”, a memory mapping scheme which reduces the
solution have some errors; this is due to the fact that the 2D FFT of number of DRAM row activations when reading a single
each projection is less accurate. The results are good enough to be column of data. This memory mapping scheme is general
used as a basis for further optimization based refinement methods. and may be used for a variety of other applications. We
demonstrate that “tile-hopping” memory mapping can reduce
the DRAM energy consumption by 52.9%. Moreover, we
show that the proposed optimizations lead to 2.8× less energy
transform of the 3D reconstructed volume. Our 2D FFT consumption for the overall 2D FFT with EAR architecture.
accelerator was used to calculate the 2D FFTs of the projec- Our methods were tested using extensive synthesis and
tions as well as for initial stages of the 3D FFT which was then benchmarking using a Xilinx Kintex 7 FPGA communicating
completed on the host PC. Similar to the 2D FFT, the 3D FFT with a host PC on a high-speed PXIe bus. Our system is
is separable and can be divided into 2D FFTs and 1D FFTs. expandable to support several FPGAs and can be adapted
The results have been shown in Table 5. It can be seen that to various large-scale computer vision and biomedical appli-
the improvement for smaller size densities is not significant cations. Despite decomposing the image into periodic and
because their FFTs are quite fast on general-purpose CPUs. smooth frequency components, our design requires less
However, for larger densities the FFT accelerator can give a run time, compared to traditional FPGA-based 2D DFT
significant improvement. If the remaining components are implementation approaches and can be used for a variety
also implemented on an FPGA, significant speed increase of highly demanding applications. One such application,
can be achieved. Results of a thin slice from a 3D simulated filtered back-projection, was accelerated using the proposed
Shepp-Logan [33] phantom have been shown in Figure 11. It implementation to achieve better results specifically for larger
can be seen that the results from the hardware accelerated size raw tomographic data.
FBP are of slightly lower quality. This is due to the fact
that our 2D FFT implementation is less accurate (16 bit,
fixed-point) as compared to the CPU-based implementation Conflicts of Interest
(FFTW, double-precision floating). The accelerated FBP was
also tested with real Electron Tomography (ET) data. The authors declare that they have no conflicts of interest.
16 International Journal of Reconfigurable Computing

Acknowledgments [13] B. Akin, F. Franchetti, and J. C. Hoe, “Understanding the


design space of DRAM-optimized hardware FFT accelerators,”
This work was supported by Japanese Government OIST Sub- in Proceedings of the 25th IEEE International Conference on
sidy for Operations (Ulf Skoglund) Grant no. 5020S7010020. Application-Specific Systems, Architectures and Processors, ASAP
Faisal Mahmood and Märt Toots were additionally supported 2014, pp. 248–255, Switzerland, June 2014.
by the OIST Ph.D. Fellowship. The authors would like to [14] C.-L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, “Mul-
thank National Instruments Research for their technical tidimensional DFT IP generator for FPGA platforms,” IEEE
support during the design process. The authors would also Transactions on Circuits and Systems I: Regular Papers, vol. 58,
like to thank Dr. Steven D. Aird for language assistance and no. 4, pp. 755–764, 2011.
Shizuka Kuda for logistical arrangements. [15] D. He and Q. Sun, “A practical print-scan resilient watermarking
scheme,” in Proceedings of the IEEE International Conference on
Image Processing 2005, ICIP 2005, pp. 257–260, Italy, September
Supplementary Materials 2005.
[16] D. G. Bailey, Design for embedded image processing on FPGAs,
File: 2D FFT-EAR-Demo.mp4: video demonstrating the
John Wiley Sons, 2011.
FPGA output of 2D FFT with edge artifact removal.
(Supplementary Materials) [17] B. G. Batchelor, “Implementing Machine Vision Systems Using
FPGAs,” in Machine Vision Handbook, 1136, p. 1103, Springer,
London, UK, 2012.
References [18] T. Lenart, M. Gustafsson, and V. Öwall, “A hardware accelera-
tion platform for digital holographic imaging,” Journal of Signal
[1] R. N. Bracewell, The fourier transform and iis applications, vol. Processing Systems, vol. 52, no. 3, pp. 297–311, 2008.
5, New York, NY, USA, 1965.
[19] G. H. Loh, “3D-Stacked Memory Architectures for Multi-core
[2] Theory and application of digital signal processing, vol. 1, Processors,” ACM SIGARCH Computer Architecture News, vol.
Prentice-Hall, Inc, Englewood Cliffs, NJ, USA, 1975.
36, no. 3, pp. 453–464, 2008.
[3] R. A. Brooks and G. Di Chiro, “Theory of image reconstruction
[20] Z. Qiuling, B. Akin, H. E. Sumbul et al., “A 3D-stacked logic-
in computed tomography,” Radiology, vol. 117, no. 3 I, pp. 561–
in-memory accelerator for application-specific data intensive
572, 1975.
computing,” in Proceedings of the IEEE International 3D Systems
[4] L. Moisan, “Periodic plus smooth image decomposition,” Jour- Integration Conference, pp. 1–7, October 2013.
nal of Mathematical Imaging and Vision, vol. 39, no. 2, pp. 161–
[21] I. S. Uzun, A. Amira, and A. Bouridane, “FPGA implementa-
179, 2011.
tions of fast Fourier transforms for real-time signal and image
[5] B. Akin, P. A. Milder, F. Franchetti, and J. C. Hoe, “Memory processing,” pp. 283–296.
bandwidth efficient two-dimensional fast Fourier transform
algorithm and implementation for large problem sizes,” in [22] T. Dillon, “Two Virtex-II FPGAs deliver fastest, cheapest, best
Proceedings of the 20th IEEE International Symposium on Field- high-performance image processing system,” ” Xilinx Xcell
Programmable Custom Computing Machines, FCCM 2012, pp. Journal, vol. 41, pp. 70–73, 2001.
188–191, Canada, May 2012. [23] A. Hast, “Robust and invariant phase based local feature
[6] J. W. Cooley and J. W. Tukey, “An algorithm for the machine matching,” in Proceedings of the 22nd International Conference
calculation of complex Fourier series,” Mathematics of Compu- on Pattern Recognition, ICPR 2014, pp. 809–814, Sweden, August
tation, vol. 19, no. 90, pp. 297–301, 1965. 2014.
[7] H. Kee, N. Petersen, J. Kornerup, and S. S. Bhattacharyya, [24] B. Galerne, Y. Gousseau, and J.-M. Morel, “Random phase
“Systematic generation of FPGA-based FFT implementations,” textures: theory and synthesis,” IEEE Transactions on Image
in Proceedings of the 2008 IEEE International Conference on Processing, vol. 20, no. 1, pp. 257–267, 2011.
Acoustics, Speech and Signal Processing, ICASSP, pp. 1413–1416, [25] R. Hovden, Y. Jiang, H. L. Xin, and L. F. Kourkoutis, “Periodic
USA, April 2008. Artifact Reduction in Fourier Transforms of Full Field Atomic
[8] F. Mahmood, M. Toots, L.-G. Ofverstedt, and U. Skoglund, Resolution Images,” Microscopy and Microanalysis, vol. 21, no.
“2D Discrete Fourier Transform with simultaneous edge artifact 2, pp. 436–441, 2014.
removal for real-Time applications,” in Proceedings of the Inter- [26] D. G. Bailey, “The advantages and limitations of high level
national Conference on Field Programmable Technology, FPT synthesis for FPGA based image processing,” in Proceedings of
2015, pp. 236–239, New Zealand, December 2015. the the 9th International Conference, pp. 134–139, Seville, Spain,
[9] M. Frigo and S. G. Johnson, “FFTW: an adaptive software archi- September 2015.
tecture for the FFT,” in Proceedings of the IEEE International [27] W. Wang, B. Duan, C. Zhang, P. Zhang, and N. Sun, “Acceler-
Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. ating 2D FFT with non-power-of-two problem size on FPGA,”
1381–1384, IEEE, May 1998. in Proceedings of the 2010 International Conference on Recon-
[10] M. Püschel, J. M. F. Moura, J. R. Johnson et al., “SPIRAL: code figurable Computing and FPGAs, ReConFig 2010, pp. 208–213,
generation for DSP transforms,” Proceedings of the IEEE, vol. 93, Mexico, December 2010.
no. 2, pp. 232–273, 2005. [28] F. Natterer, The Mathematics of Computerized Tomography, John
[11] E. Wang, Q. Zhang, B. Shen et al., High-Performance Computing Wiley & Sons, 1986.
on the Intel Xeon Phi, Springer International Publishing, 2014. [29] R. A. Crowther, D. J. DeRosier, and A. Klug, “The Reconstruc-
[12] B. Akin, F. Franchetti, and J. C. Hoe, “FFTS with near-optimal tion of a Three-Dimensional Structure from Projections and its
memory access through block data layouts,” in Proceedings of Application to Electron Microscopy,” Proceedings of the Royal
the 2014 IEEE International Conference on Acoustics, Speech, and Society A Mathematical, Physical and Engineering Sciences, vol.
Signal Processing, ICASSP 2014, pp. 3898–3902, Italy, May 2014. 317, no. 1530, pp. 319–340, 1970.
International Journal of Reconfigurable Computing 17

[30] U. Skoglund, L.-G. Öfverstedt, R. M. Burnett, and G. Bricogne,


“Maximum-entropy three-dimensional reconstruction with
deconvolution of the contrast transfer function: A test applica-
tion with adenovirus,” Journal of Structural Biology, vol. 117, no.
3, pp. 173–188, 1996.
[31] F. Mahmood, N. Shahid, P. Vandergheynst, and U. Skoglund,
“Graph-based sinogram denoising for tomographic reconstruc-
tions,” in Proceedings of the 38th Annual International Confer-
ence of the IEEE Engineering in Medicine and Biology Society,
EMBC 2016, pp. 3961–3964, USA, August 2016.
[32] F. Mahmood, N. Shahid, U. Skoglund, and P. Vandergheynst,
“Adaptive Graph-Based Total Variation for Tomographic
Reconstructions,” IEEE Signal Processing Letters, vol. 25, no. 5,
pp. 700–704, 2018.
[33] L. A. Shepp and B. F. Logan Jr., “The Fourier reconstruction of a
head section,” IEEE Transactions on Nuclear Science, vol. 21, no.
3, pp. 21–43, 1974.

You might also like