0% found this document useful (0 votes)
3 views

a-high-performance-fpga-based-image-feature-detector-and-matcher-based-on-the-fast-and-brief-algorithms

The paper presents a high-performance FPGA-based system for image feature detection and matching using the FAST and BRIEF algorithms. It emphasizes the system's capability to process hundreds of frames per second while maintaining low power consumption and flexibility for various applications in computer vision. The architecture is designed for real-time processing, making it suitable for demanding tasks like mobile robotics and surveillance.

Uploaded by

Noah Okitoi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

a-high-performance-fpga-based-image-feature-detector-and-matcher-based-on-the-fast-and-brief-algorithms

The paper presents a high-performance FPGA-based system for image feature detection and matching using the FAST and BRIEF algorithms. It emphasizes the system's capability to process hundreds of frames per second while maintaining low power consumption and flexibility for various applications in computer vision. The architecture is designed for real-time processing, making it suitable for demanding tasks like mobile robotics and surveillance.

Uploaded by

Noah Okitoi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

International Journal of Advanced Robotic Systems

ARTICLE

A High-performance FPGA-based Image


Feature Detector and Matcher Based on
the FAST and BRIEF Algorithms
Regular Paper

Michał Fularz1*, Marek Kraft1, Adam Schmidt1 and Andrzej Kasiński1

1 Poznan University of Technology, Institute of Control and Information Engineering, Poznan, Wielkopolska, Poland
*Corresponding author(s) E-mail: [email protected]

Received 15 April 2015; Accepted 03 September 2015

DOI: 10.5772/61434

© 2015 Author(s). Licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

Abstract
Keywords FPGA, Feature Detection, Feature Matching
Image feature detection and matching is a fundamental
operation in image processing. As the detected and
matched features are used as input data for high-level
computer vision algorithms, the matching accuracy 1. Introduction
directly influences the quality of the results of the whole
computer vision system. Moreover, as the algorithms are Point correspondences found in sequences of images are
frequently used as a part of a real-time processing pipeline, the input data for a wide range of computer vision algo‐
the speed at which the input image data are handled is also rithms, including tracking [1, 2], 3D reconstruction [3, 4],
a concern. The paper proposes an embedded system image stitching [5], visual odometry [6, 7], video surveil‐
architecture for feature detection and matching. The lance [8, 9] and simultaneous localization and mapping [10,
architecture implements the FAST feature detector and the 11]. As the quality of the input data directly influences the
BRIEF feature descriptor and is capable of establishing key final results produced by the aforementioned algorithms,
point correspondences in the input image data stream numerous solutions to the problem of automated image
coming from either an external sensor or memory at a speed feature extraction and matching have been proposed by the
of hundreds of frames per second, so that it can cope with research community. The most important characteristic of
most demanding applications. Moreover, the proposed a quality feature detector is its repeatability. The feature
design is highly flexible and configurable, and facilitates should be an accurate and stable projection of a 3D point to
the trade-off between the processing speed and program‐ a 2D image plane regardless of the transformations or
mable logic resource utilization. All the designed hardware distortions introduced by frame-to-frame camera move‐
blocks are designed to use standard, widely adopted ment. Robustness against varying acquisition parameters,
hardware interfaces based on the AMBA AXI4 interface like changes in illumination or noise, is also desirable.
protocol and are connected using an underlying direct Feature matching supplements feature detection by
memory access (DMA) architecture, enabling bottleneck- establishing point correspondences across two or more
free inter-component data transfers. views of the observed scene. The matching is usually

Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434 1


performed based on some kind of descriptor that capture enabled by the hardware implementation allows for the
the distinctive characteristics of the neighbourhood of the further alleviation of sensitivity to significant viewpoint
feature. The distance between the descriptors is then used changes.
as the similarity measure. Regardless of the transforma‐
The main contributions of the presented work are:
tions and distortions that the feature neighbourhood
undergoes, the matching method should allow reliable and • the implementation of a complete, compact, parameter‐
robust association. As the image feature detectors and izable system for feature detection, description and
descriptors are often used as a part of a real-time image matching in programmable hardware;
processing pipeline, the processing speed is also an
• the use of standard, efficient interfaces based on the
important parameter of this class of algorithms. Although
AMBA AXI4 protocol and DMA-based on-chip commu‐
numerous corner detection algorithms have been pro‐
nication mechanisms that allow for seamless integration
posed, none of them offer a ’one size fits all’ solution. The
with a wide range of architectures;
performance depends to a great extent on the processed
image content (type of features, contrast, noise type and • the selection of a trade-off between the consumption of
characteristics in the image, etc.) and the type and magni‐ programmable logic resources and the processing speed,
tude of inter-frame transformations (in-plane rotations, with a number of variants analysed in the paper.
perspective and affine distortions, scaling). Moreover,
while some of the algorithms offer high-quality results, 3. Related Work
their complexity makes them too slow for real-time
applications. The first commonly used point image feature detectors
were derived from the early work presented in [17] and
2. Goal and Motivation were based on the analysis of the autocorrelation function
of the second-order derivatives of the image [18, 19]. The
In this paper, we present a complete solution for the image detectors were usually paired with direct matching
feature detection and matching problem. The proposed methods, e.g., the sum of squared differences, the sum of
solution is based on the FAST (Features from Accelerated absolute differences or normalized cross-correlation of
Segment Test) [12] and BRIEF (Binary Robust Independent feature neighbourhoods. Further research in the field led to
Elementary Features) [13] algorithms, and is implemented the development of detectors that were invariant to
as a standalone coprocessor, based on a field programma‐ rotation, scale and illumination changes [20]. At the same
ble gate array (FPGA), capable of the continuous process‐ time, feature description and matching have evolved from
ing of incoming image stream data. The main advantages direct to descriptor-based methods. A prominent example
of such a hardware-based solution are its high processing of an algorithm making use of these developments is the
speed, relatively low power consumption, compactness SIFT (Scale Invariant Feature Transform) algorithm
and flexibility. In a typical application scenario, the introduced in [21]. While SIFT is highly regarded for its
proposed solution is capable of performing image feature accuracy, it is also one of the most computationally-
detection, description and matching, with the speed within intensive algorithms in its class. As real-time operation is a
the range of over 100 frames per second while consuming crucial requirement for a wide range of computer vision
less than five Watts of power. It can be used as a standalone applications, further advancements resulted in the emer‐
system or integrated as a part of a more complex solution gence of feature detection methods focusing mainly on
both in PC-centric applications and FPGA system-on-chip processing speed [22, 12], as well as feature detection and
(SoC) custom hardware. Such characteristics make it description methods aimed at improving the processing
particularly desirable in applications like mobile robotics, speed without sacrificing robustness [23, 24]. In the latter
vision-based surveillance, driver-assistance systems and case, the speed-up was the result of using integral images.
high-speed production systems. Another major breakthrough in feature description
methods was achieved with the advent of binary descrip‐
The choice of the detection and description algorithms for
tors [13]. Binary descriptor matching is done using the
the implementation was dictated by a few factors. First, the
Hamming distance, which can be computed very quickly
translation of the algorithms into a parallel, pipelined,
using modern commodity hardware. The binary descriptor
systolic, resource-efficient architecture comes in a natural
concept was widely adopted, and the solution presented in
way. Second, the algorithms selected for implementation
[13] was extended to achieve scale and rotation invariance
are not designed to be scale and rotation invariant, yet as
as shown in [25] and [26].
shown in [14, 15] and [16], the FAST-BRIEF combination
has the beneficial properties of good repeatability, accura‐ Feature detection and description operations are usually
cy, a high processing speed and a low memory footprint. performed on standard microprocessors or, less frequently,
Third, the implementation is intended for applications with on graphics processing units. Although such hardware
no explicit requirement for robustness to abrupt scale platforms are affordable and well established, they are not
changes or very good rotational invariance, e.g., robot best suited for power- and size-constrained applications
navigation or tracking. Finally, the very high frame rate [27, 28]. On the other hand, the embedded and mobile

2 Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434


microprocessors designed for use in such conditions have performance of the circuit presented in the paper is similar
a limited computational throughput. This has drawn to the performance of the solution described in this work
attention to alternative computational platforms, such as using 64 matching cores. The solution was tested in FPGA
programmable hardware. and finally implemented as a CMOS application-specific
integrated circuits (ASICs), and as such it is more power-
Early programmable logic implementations were focused
efficient and can be clocked at a higher frequency (up to 200
mostly on the feature detection process and were based on
MHz). It is worth noting that the costs associated with the
algorithms like SUSAN (smallest univalue segment
development of a dedicated ASIC circuit may be prohibi‐
assimilating nucleus) [29] and the Harris feature detector
tive in many applications, and it lacks the flexibility and
[30]. The advent of more complex, computationally
configurability of the design presented in this paper.
intensive methods for feature detection and the introduc‐
tion of feature descriptors such as SIFT has drawn attention Furthermore, the design is only capable of storing 4,000
to implementations in dedicated hardware for possible descriptors, as it fully relies on the on-chip memory. The
performance gains. The hardware architecture presented in limited number of features and their descriptors stored for
[31] can detect SIFT features on 320×240 pixel images with matching might be too low, especially when dealing with
a speed of 30 frames per second. In [32], a significantly high-resolution images. Moreover, the feature detection
enhanced system for a SIFT feature and descriptor extractor part of the design also lacks the non-maximum suppression
is presented, capable of processing 30 VGA frames per stage. An implementation of the SIFT feature detector with
second. Please note that the presented architectures do not the BRIEF feature descriptor and a matcher in programma‐
perform feature matching. Similarly, the speeded up robust ble logic was described in [43]. It does not compute the
features (SURF) feature detector and descriptor has been a feature orientation and scale, thus not providing rotation
basis for many FPGA implementations. In [33], an imple‐ and scaling invariance. This results in the use of a rather
mentation of the feature detection and descriptor extraction complex feature detector for tasks that could be performed
based on SURF is presented. The reported speed was 10 with simpler, more resource-effective hardware blocks like
frames per second for 1024×768 resolution images. The the FAST or Harris feature detector implementations. The
approach presented in [34] is a modified version of [33]. The authors have noted, however, that they plan to extend their
modifications are aimed at reducing resource usage but the design as a part of future work. The variant of the presented
functionality is essentially the same, although the process‐ system capable of operation at a similar speed consumes
ing speed is lower. The architecture was further developed fewer resources. Moreover, the highest performance
in [35]. The system presented in [36] focuses on the devel‐ variant of our architecture achieves three times the proc‐
opment of a highly accurate feature descriptor for imple‐ essing speed, with even further improvements possible
mentation in programmable hardware. The descriptor has thanks to its scaling capability. An implementation of FAST
the form of a ternary vector, which results in low memory and BRIEF in programmable hardware is presented in [44].
usage and enables fast feature matching. The design is compact but relatively slow and based on
Designs implemented in programmable hardware are proprietary interfaces. The use of on-chip memory for
difficult to compare, as the dependence on the device’s image data storage severely limits its expandability. Since
internal architecture and tool settings affecting the synthe‐ the article gives away only very general information on the
sis, placement and routing process have a major impact on implemented architecture, a detailed comparison is
their efficiency [37]. Moreover, most of the architectures difficult to perform.
presented in the literature are based on other algorithms,
making it especially hard to directly relate to. Nevertheless, 4. Description of the Implemented Algorithms
we would like to point out some of the previous work that
is related to the scope of this article and highlight the 4.1 FAST Feature Detection
characteristic features of the described solution.
The FAST feature detector was proposed by Rosten et al. in
The implementation of the FAST algorithm for feature [22] and [12]. The detector relies on the analysis of the
detection presented in [38] was the pioneering work to the structural properties of the key point candidates to decide
best knowledge of the authors. Since then, alternative whether or not they are valid, distinctive image features.
implementations have been presented, e.g., in [39], [40] and The detection is based on a 7×7 pixel neighbourhood
[41]. The resource usage and performance (in terms of the centred on the candidate point p . In order to decide
processing speed) of these circuits is similar to that pro‐ whether or not the pixel p with an image intensity value of
posed in this article, yet some of them lack the non- I p is a feature, the FAST detector performs a so-called
maximum suppression stage. The lack of this feature is segment test on a 16 pixel Bresenham circle surrounding
severely limiting, since without it establishing pixel- p . The test is passed if n contiguous pixels on the Bresenham
accurate feature location by finding the local maxima in the circle with the radius r around the pixel p are darker than
corner score function is impossible (see Section 4.1.). I p − t (’dark’ pixels) or brighter than I p + t (’bright’ pixels),
The FAST and BRIEF combination for feature detection, where t is a threshold value. An illustration is given in
description and matching was described in [42]. The peak figure 1.

Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 3


A High-performance FPGA-based Image Feature Detector and Matcher Based on the FAST and BRIEF Algorithms
an analysis of the influence of the pattern (the Bresenham
e that the circle) size on the detection process as well as the noise
matching. sensitivity and the effects of blur.
F) feature
ny FPGA
4.2 BRIEF Feature Description
he feature
SURF is
The BRIEF image feature descriptor proposed in [47] and
er second
presented Figure 1. image
FAST feature
image feature detector illustration. The extended in [13] uses binary vectors for the computation of
Figure 1. FAST detector illustration. The values in the middle
difications values correspond
picture in the middle picture correspond
to the brightness of individualtopixels.
the brightness
Example pixelsof the similarity measure based on the Hamming distance.
individual pixels. Example pixels passing the segment test in the
ctionality passing the segment test in the neighbourhood for a threshold value of 110
neighbourhood for a threshold value of 110 are marked with an
Such a similarity measure can be efficiently calculated
are marked with an arc. using modern microprocessors, making it an attractive
speed is arc.
[35]. The alternative to the commonly used L 1 or L 2 norms.
The authors
providing of theand
rotation original
scalingsolution
invariance.suggest that the
This results in
ment of a
ntation in algorithm
the use of aperforms best forfeature
rather complex n = 9. The way for
detector in which the
tasks that As the descriptor is highly sensitive to noise, the source
e form of segment
could be test is formulated
performed allows itmore
with simpler, to achieve a significant
resource-effective image is first smoothed with an averaging filter. Each bit in
usage and increase in speed if the individual ’bright’ feature
hardware blocks like the FAST or Harris and ’dark’ pixel
detector
the binary descriptor vector represents the result of a
implementations. The authors have noted, however,
tests are performed in a particular order, as many candidate that
comparison between the intensity values of two points
they plan to extend their design as a part of future
points may be discarded in the early processing stage. To work.
ware are The variant of the presented system capable of operation at inside an image region centred on the feature to be descri‐
determine this order, the original FAST detector uses the bed. The bit corresponding to a given point pair is set to ’1’
e device’s a similar speed consumes fewer resources. Moreover, the
cting the ID3 algorithm
highest [45] to construct
performance variant ofaour
decision tree performing
architecture achieves if the intensity value of the first point for a given pair is
e a major such a binary classification.
three times the processing speed, with even further higher than the intensity value of the second point, and
st of the improvements
As the segmentpossible test returnsthanksmultiple
to its scaling
adjacentcapability.
positive to ’0’ otherwise – for an illustration, see figure 2.
d on other An implementation of FAST and BRIEF in programmable
responses, the additional non-maximum suppression is The authors of the original proposal evaluated several
tly relate hardware is presented in [44]. The design is compact but
me of the applied
relatively to determine
slow and based the stable, exact coordinates
on proprietary of feature
interfaces. The sampling strategies for the selection of point pairs. The
his article points.
use of As the segment
on-chip memorytest forisimage
a Boolean datafail/pass
storage function,
severely results of the experiments led to the conclusion that
described the introduction
limits of an additional
its expandability. Since the corner
article score
gives function
away only is sampling according to the Gaussian distribution centred on
necessary.
very general The function V on
information is defined as the sumarchitecture,
the implemented of absolute the described feature point results in the best performance.
or feature differences between the
a detailed comparison central point’s
is difficult intensity and the
to perform. The initial smoothing was performed using a block
work to intensities of pixels 1-16 on the contiguous arc. A corner averaging filter defined over a 9×9 mask. The descriptor
lternative score for all theofpositive
4. Description segment test
the implemented responses is comput‐
algorithms length and the image patch size in which the descriptor is
[39], [40] ed, those with V
4.1. FAST feature detectionas the local maxima are retained as key computed can be changed and adapted to the requirements
(in terms points. If we denote the ’bright’ pixels by Sbright and of the application. The use of a 512 bit descriptor is sug‐
imilar to The FAST feature detector was proposed by Rosten et al. gested, but the 256 bit version performs equally well for
the ’dark’ pixels by Sdark , the corner score is given by
in [22] and [12]. The detector relies on the analysis of the
hem lack
equation small camera displacements and only marginally worse in
k of this structural1:properties of the key point candidates to decide
other cases [47]. An example sampling pattern for a 33×33
ablishing whether or not they are valid, distinctive image features.
The detection is based on a 7 × 7 pixel neighbourhood window with 256 binary tests is presented in figure 3. The
l maxima æ ö pairs of points are connected with blue lines.
tion 4.1.). V = max
centred onç the å candidate
ç xÎSbright
t , å p.I p -
I p ® x - I p -point InI order
p®x
- t ÷to decide
÷ (1)
whether orè not the pixel p with an image intensity
xÎSdark ø value
detection, As the tests performed for the computation of the descrip‐
of I p is a feature, the FAST detector performs a so-called
The peak tor are binary, BRIEF is inherently robust to illumination
segment test on a 16 pixel Bresenham circle surrounding p.
is similar The concept
The test was iffurther
is passed developed
n contiguous pixelsinon[46]
the by using a
Bresenham changes but not to rotation and scaling. However, as shown
this work generic
circle withdecision
the radiustree, rwhich
around does
the not require
pixel learning
p are darker to
than in [13], generating multiple versions of feature templates
in FPGA adapt to the target
I p − t (’dark’ pixels)environment.
or brighter than TheI p paper also includes
+ t (’bright’ pixels), for matching may help to overcome this issue.
n-specific where t is a threshold value. An illustration is given in
is more figure 1.
requency
ssociated The authors of the original solution suggest that the
it may be algorithm performs best for n = 9. The way in which the
flexibility segment test is formulated allows it to achieve a significant
his paper. increase in speed if the individual ’bright’ and ’dark’
ing 4,000 pixel tests are performed in a particular order, as many
memory. candidate points may be discarded in the early processing
escriptors stage. To determine this order, the original FAST detector
lly when uses the ID3 algorithm [45] to construct a decision tree
he feature performing such a binary classification.
maximum As the segment test returns multiple adjacent positive
FT feature responses, the additional non-maximum suppression is
a matcher applied to determine the stable, exact coordinates of
. It does Figure
Figure
feature2. 2.
Formation
points. of the
Formation
As of BRIEF
thethe binary
BRIEF
segment descriptor
binary descriptor.
test is a Boolean fail/pass
thus not function, the introduction of an additional corner score
4 Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434

: 3
30
ased image feature detector and matcher based on the FAST and BRIEF algorithms
Figure 2. Formation of the BRIEF binary descriptor.

AXI4-Lite interfaces, and use DMA and specially designed


controller cores. This leaves the CPUs free to use for high-
30
level processing algorithms, operating on the matched
feature set. Moreover, the use of streaming interfaces for
25 coprocessors inputs enables the use of a wide range of
image sources, such as cameras, HDMI, memory, etc.
20

15

10

0
0 5 10 15 20 25 30

Figure 3. Exemplary 256 point pairs for the BRIEF feature descriptor
Figure 3. Exemplary 256 point pairs for the BRIEF feature
descriptor.
5. The Implemented Architecture
buffers the data. It also offers some additional, subtle
5.1 Outline oflike
advantages, the an
Architecture
ability to fully flush the streaming core
pipeline or inject some generated values into it. Another
Image processing
feature of the core algorithms can be
is built-in error dividedand
handling into different
signalling
groups
(e.g., outbased on defined
of the the type range,
of data buffer
they are working on, like
over/underflow).
Figure 4. Block diagram of the implemented system.

pixelparameter
The or regionconfiguration
processing orand methods that analyse
event handling is donethe
meta-data
using extracted from
a register-based images. To facilitate the imple‐
interface.
mentation of such a broad range of algorithms, the proc‐
essing
5.3. platform
Ensuring has to
parallel be flexible
access to imageand dataeasy to reprogram as
for neighbourhood
new methods are invented and old ones are refined. The
operations
most common and
Contemporary natural approach
programmable is the use
logic devices of standard
contain a pool
microprocessors
of dual-port memory running
blockscomputer
for local datavision software.
storage and
However, The
buffering. computer
memory vision methods
blocks can beusually require with
used together high Figure 4. Block diagram of the implemented system
computational
register banks power due to simultaneous
to provide the vast amounts of data
access to tothe
be
processed.
pixel At the same
neighbourhood time,
in the they have
currently to be image.
processed power-
5.2 Universal Controller for Streaming Processors
efficient
Such an enough to be used
input block does innot
power-constrained
use the externalapplica‐RAM
tions (e.g., robotics or smart cameras). These contradictory
memory – which is a major advantage – as communication The process of converting the software implementation of
with external RAM
requirements can be caneased
be a bottleneck
by using in a data-intensive
heterogeneous an image processing algorithm into a hardware coproces‐
applications. A general block diagram of the input block is
processing platform, like Xilinx Zynq. These devices sor can be difficult and time consuming. Integrating the
given in figure 6.
contain FPGA logic alongside a relatively high-perform‐ coprocessor with the system (e.g., external memory) and
ance input
The processor-based
block operatessubsystem.
underSuch the aassumption
solution enables
that the other coprocessors is another problem that has to be
the partitioning of image processing tasks between hard‐
the source of the image data must feed the pixels in dealt with. To simplify system integration, the universal
Figure 5. The block diagram of universal controller for streaming
processors.
ware, hand-tailored coprocessors implemented in the
progressive scan mode, which is common in contemporary controller for the streaming processors’ IP core was created.
FPGA fabric and software-based implementations that run Its block diagram is given in figure 5.
on general purpose Cortex-A9 cores.
www.intechopen.com This is a utility core that allows connecting streaming: 5
The system described in Athis paper was implemented in a processors compliant with the AXI4-Stream standard to the
high-perfomance FPGA-based image feature detector and matcher based on the FAST and BRIEF algorithms

Xilinx Zynq-7000 All Programmable SoC device. The block DMA engines and controlling them by writing to their
diagram of the whole solution is given in figure 4. The internal registers using the AXI4 system bus. It can serve as
processor system houses two ARM Cortex-A9 cores along simple data feeder that adjusts communication interface
with an external DRAM memory controller and communi‐ data widths, handles different operating frequencies and
cation infrastructure for interfacing with the programma‐ buffers the data. It also offers some additional, subtle
ble logic part. The processors are used for controlling the advantages, like an ability to fully flush the streaming core
flow of data, visualization and communication with the pipeline or inject some generated values into it. Another
host computer. The programmable logic part contains a feature of the core is built-in error handling and signalling
dedicated FAST detector, a BRIEF descriptor and matching (e.g., out of the defined range, buffer over/underflow). The
coprocessors. They are connected to the memory and parameter configuration and event handling is done using
processor subsystem through the AXI4-Stream and the a register-based interface.

Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 5


A High-performance FPGA-based Image Feature Detector and Matcher Based on the FAST and BRIEF Algorithms
processed region by one pixel. Writing the complete image,
pixel by pixel, causes the window to slide over the whole
image in a single sweep. The dual-port memory blocks can
perform a read and write operation independently in a
single clock cycle. Although the storage capacity of these
memory blocks is relatively small, they are fast, tightly
coupled with the programmable logic, and can be flexibly
arranged and interconnected to form configurations with
different buffer depths and data/address bit widths [48].
The main part of the described circuit is an arrangement of
dual-port memory blocks, with each one of them perform‐
ing the function of a first in, first out (FIFO) buffer. The
depth (capacity) of each FIFO is equal to the horizontal
resolution of the processed image. The width of the data
bus is dependent on the image colour palette (binary,
Figure 5. The block diagram of universal controller for streaming processors
greyscale, colour, etc.) and colour depth. It is equal to the
number of bits required to present complete information
5.3 Ensuring Parallel Access to Image Data for Neighbourhood on a single pixel. The data flow is controlled by a simple
Operations state machine, consisting of a counter used for read and
Contemporary programmable logic devices contain a pool write address generation. The outputs of the FIFO memory
of dual-port memory blocks for local data storage and blocks are connected to a register file. The register file is
buffering. The memory blocks can be used together with composed of programmable logic flip-flops. This facilitates
register banks to provide simultaneous access to the pixel independent, simultaneous access to all the pixel values.
neighbourhood in the currently processed image. Such an Assuming that L is the number of memory blocks required
input block does not use the external RAM memory – which to implement a single delay line, given the horizontal
is a major advantage – as communication with external RAM resolution of the image and the number of bits needed to
can be a bottleneck in data-intensive applications. A general represent a single pixel, constructing an architecture giving
block diagram of the input block is given in figure 6. access to a region of the size of M × N (width×height) is
(N − 1)L . If B is the number of bits required to store the
complete information of one pixel, B ⋅ M ⋅ N flip-flops are
elementtoofimplement
required the register thefile (see figure register
accompanying 6). If the depth of
file.
the single FIFO in the circuit is denoted by D and the
The
sizecircuit
of theintroduces a pipeline delay
region implemented withdependent
the registeron the
file is
size
M× ofNthe image×region
(width height), it the
generates.
delay δThe can delay is the time
be computed using
(measured
equation 2. in clock cycles) that elapses between writing the
pixel data to the circuit and its appearance on the central
element of the register file (see N figure 6). IfMthe depth of the
δ = ceil
single FIFO in the circuit is (denoted
) D +by ceilD( and) the size of the (2)
2 2
region implemented with the register file is M × N
Figure 6.
Figure 6.The
Theinput block
input consisting
block of dual-port
consisting RAM memories
of dual-port and flip-
RAM memories Properly arranged
(width×height), data δarecan
the delay fedbetocomputed
the inputusing of the
flopsflip-flops
and used to organize
used data
to in a way allowing
organize data for
insimultaneous access to all
a way allowing for
coprocessor, which executes the intended algorithm. If the
pixels in the currently
simultaneous access processed region
to all pixels in the currently processed region. equation 2.
processor is constructed in a way that enables pipelined,
systolic stream processing, the resulting processing speed
The input block operates under the assumption that the æNö æMö
imagers and digital video receiver integrated circuits, etc.
source
is very high, d = ceil ç ÷ D + ceil
especially if çits ÷architecture takes (2) full
The dataofstream
the image
maydata must feed theOn
be continuous. pixels in progressive
every clock cycle, advantage of theè parallel2 ø 2 ø
èprocessing capabilities of
one portion of the input data – a pixel intensityimagers
scan mode, which is common in contemporary and
or a colour programmable logic.
digital video receiver integrated circuits,
value – enters the circuit, and the output values of those etc. The data Properly arranged data are fed to the input of the copro‐
stream may be continuous.image On every clockarecycle, one cessor, which executesandthe intended algorithm. If the
pixels forming a rectangular subregion available
5.4. Feature detection description coprocessor
atportion of the input
the outputs. Suchdata – a pixel intensity
organization of data oris
a colour value
particularly processor is constructed in a way that enables pipelined,
– enters the circuit,
advantageous for imageand the output values
processing of thosethat
algorithms pixels
are The block
systolic streamdiagram of a complete
processing, the resultingsystem performing
processing speed the
forming
local. a rectangular
’Local’ means, here, image that
subregion
for aare available
region at the
subjected detection of features using the FAST algorithm
is very high, especially if its architecture takes full advant‐ and the
tooutputs. Such organization
processing, the result ofof the datacomputations
is particularly advanta‐
depends description and matching using the BRIEF
age of the parallel processing capabilities of programmable
algorithm is
geous
only onforthe
image
pixel processing
intensityalgorithms
values withinthat are local.
this ’Lo‐
region. given in figure 7. A circuit allowing for the formation of
logic.
cal’memory
The means, here,
accessthat for a region
pattern subjected
is structured to predictable.
and processing, a 7 × 7 pixel neighbourhood (as described in the section
the result
Every pixelof the operation
write computations depends
on the only onthe
input causes thesliding
pixel 5.3) is connected directly to the input of the system. In
ofintensity 5.4 Feature Detection and Description Coprocessor
the whole values within this region. The memory access
processed region by one pixel. Writing the this location, the processing is split into two independent
complete image, pixel by pixel, causes the window
pattern is structured and predictable. Every pixel write to slide The block diagram of a complete systemand
data paths – one for feature detection one for feature
performing the
over the whole
operation on the image
inputincauses
a single
thesweep.
sliding The
of thedual-port
whole description.
detection of features using the FAST algorithm and the
memory blocks can perform a read and write operation
Due to the specific nature of the FPGA hardware
6 independently in a 2015,
Int J Adv Robot Syst, single12:141
clock |cycle. Although the storage
doi: 10.5772/61434 architecture, the implemented version of the FAST
capacity of these memory blocks is relatively small,
algorithm, although returning the same results, differs
they are fast, tightly coupled with the programmable
significantly from the original proposal in [22] and [12].
logic, and can be flexibly arranged and interconnected
The original solution uses a decision tree to determine
to form configurations with different buffer depths and
whether the pixels on the Bresenham circle fall into the
Figure7. 7.
Figure Block
Block diagram
diagram of theofimplemented
the implemented
feature feature
detectiondetection and description
and description coprocessor coprocessor (FAST The
(FAST with BRIEF). withnumbers
BRIEF).inThe numbers
square indenote
brackets squarethe
brackets denote the bit-width of the data path (e.g., [7:0] for an 8-bit connection).
bit-width of the data path (e.g., [7:0] for an 8-bit connection).

pixels on the Bresenham circle as either ’bright’ or ’dark’, computation of the differences between the intensity value
respectively. and
description The matching
input datausing
are the
the centre signal andis
BRIEF algorithm of the the
from central pixel proposal
original and the intensity
in [22] andvalues
[12].ofThe
the original
pixels
the pixel_XX signals corresponding to the central pixel on the Bresenham circle. Again, 16 subtracters are used
given in figure 7. A circuit allowing for the formation of a solution uses a decision tree to determine whether the
intensity value and the 16 Bresenham circle pixel intensity to perform these operations. In the following step, the
7×7 pixel neighbourhood (as described in the section 5.3) is pixels on the Bresenham circle fall into
values and the threshold value corresponding to the threshold value is subtracted from thetheresults
’dark’ obtained
or ’bright’
connected directly
FAST detector to theThe
threshold. input of the
’bright’ system.
data In this
path begins category.
in Such stage.
the previous a concept was proposed
The results produced with sequential
in both data
location,
with the computation of the differences between data
the processing is split into two independent the processors
paths in mind.
are then fed intoThe decision tree
comparators. algorithm
If the runs
results are
paths – one
intensity for of
values feature detection
the pixels andBresenham
on the one for feature
circle relatively
greater thanefficiently
zero, theyonare
such hardware
passed to the and
next allows us to
processing
description.
and the intensity value of the central pixel. This is done stages
discardtothecontribute to theas corner
candidate point score (not
a non-feature computation
satisfying
using 16 subtracters. In the next step, the threshold value process as a sc_part_XX_YY signal, where XX is a
Due to the specific nature of the FPGA hardware architec‐ the ’dark’ or ’bright’ criteria) early on at the cost of in‐
is deducted from the 16 results using another set of number within the range of 1-16 (one for each pixel in the
ture, the implemented
16 subtracters. version
The ’dark’ of the FAST
classification starts algorithm,
with the creased memory
Bresenham circle) consumption.
and YY is br for Asthe
the’bright
presented design
pixels or dkis
although returning the same results, differs significantly based on a different computational platform, its principle

www.intechopen.com Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński:
: 77
A High-performance
A high-perfomance FPGA-based
FPGA-based Image
image Feature
feature Detector
detector andand Matcher
matcher Based
based on the
on the FAST
FAST andand BRIEF
BRIEF Algorithms
algorithms
Figure 8. Schematic diagram of the bright/dark classifier block
Figure 8. Schematic diagram of the bright/dark classifier block.

of operation
for the is alsoThis
’dark’ pixels. different. In place using
is achieved of a decision tree, an
multiplexers intensity
to perform valueall
of the central pixel and
48 comparison the intensity
operations values in a
in parallel,
exhaustive
controlled by thesearch
resultis of
performed
comparison,for all
asthe
shownimage in pixels.
figure ofsingle
the pixels
step.onThethe points
Bresenham circle. Again,
coincident 16 subtracters
with the local maximum
Because the programmable
8. Furthermore, logic facilitates
the results returned by thethe implemen‐
comparators are
ofused
the to perform
corner thesefunction
score operations. inInathe7× following step,
7 neighbourhood
tation
set the of architectures
corresponding bit that
in thecanis_bright
perform computations
or is_dark in the threshold value is subtracted from the results
which additionally satisfy the segment test are the final obtained
vector to confirm inresultant
the previous stage. The results produced in both data
parallel, such the meeting has
an approach of the
no ’bright’
negativeor ’dark’
effect on test
the features.
criteria, respectively. If Despite
the resulting value is approach
negative paths are then fed into comparators. If the results are
overall performance. using a brute force
or equal to zero, its value is set to zero. Moreover, the greater than zero,
The detector parttheyhasare passed
been to the next
enhanced withprocessing
two additional
for ’bright’ and ’dark’ pixel classifications and the corner
corresponding bit in the is_bright or is_dark vector stages to contribute to the corner score computation image
counters which keep track of the numerical
score computation, the coprocessor consumes only a small process as a sc_part_XX_YY
coordinates of the detected signal, where Such
features. XX is afunctionality
number is
is reset.
portion of the available programmable logic resources. within the range of 1-16 (one for each pixel in the Bresen‐
necessary, as the image detector data path only puts a mark
The final outputs of the classifier block are, therefore, two ham circle) and
whenever YY is br
it detects for the ’bright
a feature. The firstpixels or dkreflects
counter for the
After the pixels have been organized into a 7×7 neighbour‐
sets of 16 8-bit values used for the computation of the the
image’dark’ pixels. number,
column This is achieved
while theusing secondmultiplexers
stands for the
hood, providing access to the central pixel (the centre in
corner score according to equation 1 and two 16-bit binary controlled
image row by the result ofAs
number. comparison,
the raw data as shown
frominthe figure
counters
figure 7) and the Bresenham circle (pixel_01 to pixel_16 in
vectors holding the results of the classification of the pixels 8.isFurthermore,
out of sync thewith
resultsthereturned
output by of
the the
comparators
feature set detector,
figure 7) used by FAST, the intensity values of these 17
on the Bresenham circle as either the corresponding
delaybit in the is_bright or is_dark vector to
specific pixels are passed to the’dark’
inputsorof’bright’ used in
the bright/dark additional lines have been added so that the detected
the next stages to perform the segment test. confirm
features the meeting
are in syncofwith the the
’bright’ or ’dark’ of
information test criteria,
their respective
classifier block. A schematic diagram of this functional respectively. If The
the resulting value
coordinates. coordinates areispassed
negative toortheequal to
coordinate’s
Theseblock of the circuit
outputs is shown into
are connected figure
two8. other functional zero, its value is set to zero. Moreover, the corresponding
intermediate FIFO buffer.
blocks. The first block is used
The block consists of two independent data for the paths
corner score
compris‐ bit in the is_bright or is_dark vector is reset.
computation. The block consists of two pipelined
ing a set of subtracters, comparators and multiplexers. The adder The data path for feature description begins with the
The final outputs of the classifier block are, therefore, two
trees.data
Each adder
paths tree to
are used hasclassify
16 inputs
the 16for theon
pixels corner score
the Bresen‐ averaging filter block. The original implementation of
sets of 16 8-bit values used for the computation of the corner
components from either the ’bright’ or the ’dark’
ham circle as either ’bright’ or ’dark’, respectively. The group of BRIEF uses an averaging filter with a square-shaped
score according to equation 1 and two 16-bit binary vectors
pixels.input data are the centre signal and the pixel_XX signals
The adders produce two values – the sum of the mask [47]. The motivation behind it was the fact that
holding the results of the classification of the pixels on the
corresponding
components for both to the central
of the pixel intensity groups
aforementioned value and the
– and the response of such a filter can be computed relatively
Bresenham circle as either ’dark’ or ’bright’ used in the next
16 Bresenham circle pixel intensity values and the thresh‐
the greater of those values is passed as the final value of quickly on a standard PC using integral images [49]. As
stages to perform the segment test.
old value
the corner scorecorresponding
function astoper the equation
FAST detector
1. The threshold.
second the hardware platform used to implement the described
module The performs
’bright’ datathepath beginstest
segment withfor
theboth
computation
the ’dark’ ofand
the These
systemoutputs are connected
is capable to two otherparallel
of massively functionaloperations,
blocks. a
differences
’bright’ between for
pixels, looking the intensity values
at least nine of the pixelslogical
consecutive on the The first block
decision wasismade
used for
to the
usecorner score computation.
an averaging filter withThe
a circular
’1’s inBresenham
the contents circle and16-bit
of the the intensity
is_brightvalue and
of the central
is_dark block
mask consists of two
instead. Suchpipelined
a filteradder trees. Each
has better adder tree than
characteristics
pixel.
vectors. TheThis is done
block using 16
consists of subtracters.
two groupsInof the16next step, the
nine-input has
one16using
inputsa for the corner
square window,score
ascomponents
its responsefrom either The
is isotropic.
ANDthreshold
gates. If value
any two is deducted from thea 16
vectors contains trainresults
of atusing
least the
mask’bright’
was or the ’dark’
defined on a group of pixels.
7 × 7 image The and
patch, adders
its shape
nine another set of logical
consecutive 16 subtracters.
’1’s, the The ’dark’
result ofclassification
the segment starts
test produce two values
is depicted – the9.sum
in figure Theofinput
the components
data are takenfor both
from the
with the as
is recognized computation
positive. of the differences between the ofsame
the aforementioned
7 × 7 image patch groups – and
as the inputthedata
greater of those
for the FAST data
path.
The
8 resulting corner
Int J Adv Robot Syst, score values| doi:
2015, 12:141 score and segment test
10.5772/61434
results is_corner are passed to the circuit composed The filtered image data are then passed further to yet
of FIFO memories and a register bank to form a 7 × another FIFO and register bank circuit, so that a complete
7 neighbourhood, arranging them for the subsequent 33 × 33 pixel window is available for instantaneous
non-maximum suppression. The arrangement allows us processing. The sampling pattern used by the described
values is passed as the final value of the corner score The filtered image data are then passed further to yet
function as per equation 1. The second module performs another FIFO and register bank circuit, so that a complete
the segment test for both the ’dark’ and ’bright’ pixels, 33×33 pixel window is available for instantaneous process‐
looking for at least nine consecutive logical ’1’s in the ing. The sampling pattern used by the described architec‐
contents of the 16-bit is_bright and is_dark vectors. The ture is the same as that given in figure 3. As the target
block consists of two groups of 16 nine-input AND gates. descriptor length is 256, 256 parallel comparators are used
If any two vectors contains a train of at least nine consecu‐ to compute the complete descriptor in a single clock cycle.
tive logical ’1’s, the result of the segment test is recognized The resulting binary vector is stored in a 256-bit register.
as positive. The block diagram of the module for the BRIEF descriptor
computation is given in figure 10. The output of the register
The resulting corner score values score and segment test
holding the computed descriptor is connected to the
results is_corner are passed to the circuit composed of FIFO
descriptor’s intermediate FIFO.
memories and a register bank to form a 7×7 neighbourhood, Figure 9. The shape of the averaging filter mask implemented in
arranging them for the subsequent non-maximum sup‐ the system.

pression. The arrangement allows us to perform all 48


comparison operations in parallel, in a single step. The
points coincident with the local maximum of the corner
score function in a 7×7 neighbourhood which additionally
satisfy the segment test are the final resultant features.

The detector part has been enhanced with two additional


counters which keep track of the numerical image coordi‐ Figure
nates of the detected features. Such functionality is neces‐
sary, as the image detector data path only puts a mark adapta
whenever it detects a feature. The first counter reflects the The ab
image column number, while the second stands for the the po
image row number. As the raw data from the counters is receive
out of sync with the output of the feature detector, addi‐ Figure 10.10.
Figure The The
blockblock
diagram of the module
diagram of theformodule
the BRIEF
fordescriptor’s
the BRIEF
The c
tional delay lines have been added so that the detected computation computation.
descriptor’s
tempo
features are in sync with the information of their respective architecture is the consists
same asofthat and c
The system output two given in figure
independent 3. As the
intermedi‐
coordinates. The coordinates are passed to the coordinate’s target data,
ate FIFOs – one holds the feature coordinates and one isare
descriptor length is 256, 256 parallel comparators
intermediate FIFO buffer. used to compute the complete descriptor in a single clock outpu
dedicated to the descriptor storage. Such a solution was
cycle. The resulting binary vector is stored in a 256-bit comm
The data path for feature description begins with the adopted because there is a significant difference between
register. The block diagram of the module for the BRIEF
averaging filter block. The original implementation of delays of computation
the detector is and descriptor
descriptor given in figuredata paths.
10. The The of
output 5.5. Fe
BRIEF uses an averaging filter with a square-shaped mask implementation of athespacious buffer for the coordinate
the register holding computed descriptor is connected
[47]. The motivation behind it was the fact that the response storage is still less resource-consuming
to the descriptor’s intermediate FIFO.for synchronization The fe
of such a filter can be computed relatively quickly on a than a solution in which the coordinate data are delayed attach
standard PC using integral images [49]. As the hardware The system output consists of two independent
alongside the data path for descriptor computation. The descri
platform used to implement the described system is intermediate FIFOs – one holds the feature coordinates
signal indicating the detection of the feature is used as the the co
capable of massively parallel operations, a decision was and one is dedicated to the descriptor storage. Such cores.
write-enable signal for both of the intermediate FIFOs. The
made to use an averaging filter with a circular mask a solution was adopted because there is a significant paralle
signal is distributed through digital delay andlines,descriptor
which
instead. Such a filter has better characteristics than one difference between delays of the detector perfor
ensure the necessary synchronization –
data paths. The implementation of a spacious buffer the coordinates andfor
using a square window, as its response is isotropic. The the cap
the coordinate
the descriptor written
storageinto intermediate
is still FIFOs are properlyfor
less resource-consuming
mask was defined on a 7×7 image patch, and its shape is system
aligned. As the indicator
synchronization than a signal
solutionis only a singlethe
in which bit,coordinate
it can be
depicted in figure 9. The input data are taken from the same The bl
delayed
data are with
delayeda minimum
alongside FPGA theresource
data path cost.for descriptor
7×7 image patch as the input data for the FAST data path. The in
computation. The signal indicating the detection of
At the receiver side, the presence of data is signalled by the
the feature is used as the write-enable signal for both • Np
descriptor’s intermediate FIFO. As this FIFO is located in
of the intermediate FIFOs. The signal is distributed ma
the data path
through with
digital longer
delay delay,
lines, the corresponding
which coordi‐
ensure the necessary
nates can be read from the coordinate’s
synchronization – the coordinates and the descriptor intermediate FIFO • the
without the risk of buffer underflow.
written into intermediate FIFOs are properly aligned. As
to b
the indicator
As the signalbuffers
intermediate is onlyare a single
composed bit, of
it can be delayed
internal dual- • the
with a minimum FPGA resource cost.
port memory blocks, their inputs and outputs are capable
The o
of working
At independently.
the receiver This applies
side, the presence bothistosignalled
of data the clocksby
N pat
anddescriptor’s
the the data/address lines. Such
intermediate a solution
FIFO. enables
As this FIFO the safe
is located cores.
crossing
in of clock
the data pathdomains and the delay,
with longer flexiblethe
adaptation of the
corresponding vector
Figure
Figure9. 9.
TheThe
shape of theofaveraging
shape filter mask
the averaging implemented
filter in the system
mask implemented in
the system. circuits to can
coordinates different processing
be read speeds. The intermediate
from the coordinate’s ability to compa
FIFO without the risk of buffer underflow. in the
Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 9
A High-performance FPGA-based Image Feature
As the Detector and Matcherbuffers
intermediate Based onare
the FAST and BRIEFofAlgorithms
composed internal respec
dual-port memory blocks, their inputs and outputs are for th
capable of working independently. This applies both to better
the clocks and the data/address lines. Such a solution finishe
configure the ports on the receiver side gives the possibility correspondence are updated whenever a better matching
of trouble-free operations with a range of receivers. descriptor is found. The operation is finished after M such
tests are performed, and each matching core holds the
The coprocessor takes full advantage of spatial and
index and the Hamming distance of the descriptor from the
temporal parallelism. The architecture is systolic and
compare vector that was the closest match to its stored
capable of operating on a continuous stream of data,
pattern vector. The resulting data – feature correspondenc‐
resulting in very high performance. Additional output
es – can be read from the coprocessor and the operation can
buffering further increases the reliability of communica‐
be repeated until all mutual matches are found. The
tion.
detailed structural diagram presenting the feature match‐
ing coprocessor is given in figure 12.
5.5 Feature Matching Coprocessor
The dotted line in the diagram separates the common part
The feature matching coprocessor is a streaming core and the matching part that is duplicated depending on the
attached to an instance of the universal controller described number of matching cores. Each matching core consists of
in Section 5.2. The coprocessor consists of the common, two parts – the first is responsible for calculating the
shared control logic part and matching cores. The matching Hamming distance between the two descriptors, while the
cores are capable of independent parallel operation, and so other performs the comparisons and updates the informa‐
increasing their number boosts performance. The number tion on the best match index and its Hamming distance
of matching cores is limited by the capacity of the target from the stored pattern based on the comparison results.
device. The tests of the described system were performed
To achieve a high processing speed, extensive pipelining is
with a maximum of 32 such cores. The block diagram of the
employed. The distance is computed using 256 XOR gates,
coprocessor is given in figure 11. The input data consists of:
and the number of bits set in the resulting 256-bit register
• N pattern descriptors, where N indicates the number of is computed using a pipelined adder tree. The implement‐
matching cores; ed counter keeps track of the index of the currently
processed element from the compare vector. To equalize
• the compare vector, namely the vector of M descriptors
the delays resulting from the Hamming distance computa‐
to be compared with the stored pattern descriptors;
tion pipeline, the output of the counter is delayed by nine
• the number of descriptors in the compare vector M . clock cycles using a FIFO. The comparator keeps track of
whether or not all the elements from the compare vector
were processed and sets a flag if that is the case.

6. Results and Discussion

All of the described modules and systems were implement‐


ed using the low-cost Zedboard evaluation board. It hosts
the Xilinx Zynq-7000 SoC (XC7Z020-CLG484-1), 512 MB of
DDR3 RAM, a Gigabit Ethernet port, a USB host, an HDMI
output and an FMC connector. All the systems used the 100
MHz main system clock.

The presented accelerators were created in VHDL, tested


and then synthesized and implemented in the Zynq device
as IP cores. The AXI universal controller for the streaming
processors (see Section 5.2.) was used for connection with
the processor system and memory (see figure 5). For the
final verification, sets of images (an example pair of images
is shown in figure 13) were downloaded to the DDR RAM
memory. For processing, they were sent through the DMA
engine to the detection and description accelerator. The
Figure 11. The block diagram of the matching coprocessor result of this operation – a set of described features from
two images – was returned and written back to the DDR
The operation of the coprocessor begins with receiving N RAM via DMA. In the next step, they were read via DMA
pattern descriptors and storing them in the matching cores. by the feature matching coprocessor, and the resulting
The number of the descriptors in the compare vector M is feature correspondences were written back to the memory.
received next. Afterwards, each matching core compares The results generated by the implemented hardware were
its pattern vector with each of the M descriptors in the compared with the results of a reference software imple‐
incoming compare vector by computing their respective mentation based on the OpenCV library. As expected, the
Hamming distance. The index and the distance for the best results were identical.

10 Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434


Figure 12. The detailed internal structure of the matching coprocessor

both parts are similar. As the detection and description time


are constant for a given image resolution and system clock,
only the matching procedure processing speed can be fine-
tuned. This can be achieved by changing the number of
matching cores to the matching coprocessor. However, as
the number of matching cores increases, so does the
programmable logic resource utilization. Based on the type
of performed task and the performance required, the
Figure 13. Images used for verification with matches drawn maximum number of features can be estimated and an
appropriate number of matching cores can be selected.

Image resolution fps


no. of matching cores fps
640x480 325
4 308
1280x720 109
8 607
1920x1080 48
16 1187
Table 1. Number of frames that the FAST detector/BRIEF descriptor
accelerator can process in one second for different image resolutions 32 2289

64 4349
In table 2, the number of frames per second that can be
Table 2. Number of frames that each matching accelerator can process in
processed by the detection-description coprocessor clocked one second for a different number of matching cores and 512 features
at 100 MHz is given. Tests for VGA, HD and FullHD
resolution images were performed. As the detection and In table 2, the relationship between the number of matching
matching coprocessor is a fully systolic functional block cores and the processing speed is given. The same data are
employing a very long pipeline, the computation time does also plotted in figure 14. As shown in the figure, increasing
not change with the number of features and is linearly the number of matching cores results in a linear increase of
dependent on the number of pixels to process. the number of frames per second that can be processed,
The matching procedure is performed in parallel with the assuming that the number of features remains constant.
detection and description procedure – the features from the However, it should be noted that the speed of matching
previous image are matched while the detection and procedure is affected by the number of features. The
description are performed on the next incoming image. relationship is illustrated by table 3 and figure 15. For
Such a solution works best when the processing times of reference, the processing speed of a software implementa‐

Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 11


A High-performance FPGA-based Image Feature Detector and Matcher Based on the FAST and BRIEF Algorithms
HW SW description and matching speed up HW/SW

no. of fps w/o POPCNT w/ POPCNT w/o POPCNT w/ POPCNT


features fps fps

512 2289 90 145 25,4 15,8

1024 654 31 65 21,1 10,1

2048 175 10 27 17,5 6,5

4096 44 2,7 10 16,3 4,4

8192 11 0,7 3,2 15,7 3,4

Table 3. Number of frames that the description and matching accelerator can process in one second for different numbers of features and 32 matching cores
compared to a pure software implementation (tested on Core i7)

tion presented in [13] is also given in the table. For both the
2500
software and the hardware implementation, the processing

Achievable frames per second


speed of the matching is inversely proportional to the 2000
number of features squared. However, in the case of the
presented solution the feature description is performed 1500

independently in parallel, while the software approach is 1000


sequential by nature. The last two columns of table 3
present the increase in speed achieved using the hardware 500

approach with regard to the software approach. The


0
hardware accelerator with 32 matching cores is approxi‐ 0 512 1024 1536 2048 2560 3072 3584 4096
mately 4-16 times faster than the software implementation Number of matches

on processors with a specialized bit-counting instruction


(POPCNT on the x86 Intel Core processor family) and 15-25 Figure 15. Achievable frames per second based on the number of features
for a 32-core matching coprocessor
times faster on those without. It should be noted that the
presented matching coprocessor achieves a relatively high The resource usage for each module and the whole system
processing speed even with a low number of matching with different numbers of matching cores is given in table
cores. It can be useful in systems with more than one video 4. For easier analysis, the table contains the summary for a
input. In such a case, two or more instances of the detector- few variants of the design as well as the summary for its
descriptor modules serve as input data sources and a high- main functional blocks (the indented items). The summary
performance matching accelerator can be shared between for the detection and description coprocessor encompasses
multiple video inputs. Such an approach is advantageous, its own DMA module, a width converter and an intercon‐
as it allows us to reduce the load on the DMA and the use nect. Similarly, the summary for the matching coprocessor
of programmable logic resources, as only one read/write also includes the DMA module, a data-width converter and
channel is needed. Since the feature detection and matching a memory interconnect. Additional parts of the system,
operation is usually followed by other processing stages, including the reset module and the AXI4-Lite bus, were
completing this task as quickly as possible might leave summed up as the rest of the system. It should be noted that
more time for high-level processing. If the performance the presented solution is scalable and can use as many
requirements are not critical, the clock frequency can be resources as are available on the target hardware platform.
reduced to lower the power consumption. In this article, systems with 4-32 matching cores were
implemented.
5000 For all the configurations, the detection and description
coprocessor with the DMA and interconnect uses the same
Achievable frames per second

number of resources, namely 9,866 (19%) LUTs, 17,412


4000

3000 (16%) FFs and 38 (27%) BRAMs. In the case of the smallest
system with only four matching cores, it was roughly three
2000 times the number of resources used by the feature matching
1000
coprocessor (considering LUTs and FFs). It is the other way
around in the case of the biggest system with 32 matching
0 cores, which is over twice the size of the rest of the elements
0 8 16 24 32 40 48 56 64
of the system.
Number of matching cores
The performance of the system can be further improved by
Figure 14. Achievable frames per second based on the number of matching using the scatter-gather engine in the DMA module. This
cores for 512 features
would allow it to avoid the costly operation of packet

12 Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434


LUT FF BRAMs

whole system (4 matching cores) 13206 (25) 22488 (21) 46,5 (33)

whole system (8 matching cores) 14723 (28) 26616 (25) 46,5 (33)

whole system (16 matching cores) 18729 (35) 34869 (33) 46,5 (33)

whole system (32 matching cores) 26926 (50) 51381 (48) 46,5 (33)

matcher 4 cores 2927 (6) 5076 (5) 8,5 (6)

matcher 8 cores 4987 (9) 9204 (9) 8,5 (6)

matcher 16 cores 9001 (17) 17457 (16) 8,5 (6)

matcher 32 cores 17205 (32) 33969 (32) 8,5 (6)

DMA and memory interconnection for


2771 (5) 3800 (4) 3,5 (3)
matcher

det. and desc. 4118 (8) 9543 (9) 31 (22)

DMA and memory interconnection for


2378 (5) 3321 (3) 3,5 (3)
det. desc.

rest of the system 600 (1) 748 (1) 0 (0)

XC7Z020 53200 (100) 106400 (100) 140 (100)

Table 4. Resource usage of each of the parts of the implemented design and the whole system (designations: LUTs - lookup tables, FFs - flip-flops, BRAMs -
block RAM memory blocks). The values in percent () are given with respect to all corresponding resources available in the XC7Z020 device.

ordering for memory transfers. It would be especially neck-free, DMA-based input and output data transfers,
beneficial in the case of the smaller systems, with fewer freeing the CPU in the SoC to perform other tasks. The
matching cores. To perform frame-to-frame feature architecture can be easily parametrized to meet the specific
matching, they need to perform more memory transfer design goals by choosing a trade-off between processing
number _ of _ vectors speed and resource utilization. Moreover, while perfectly
operations, equal to number _ of _ matching _ cores . This,
capable of working as a standalone solution, the design can
however, would come at the cost of increasing the com‐ be easily expanded or integrated as a part of a more
plexity of the DMA engine and interconnect. This in turn complex system. The choice of algorithms results in a
would result in the DMA module consuming twice as many compact, high performance, low power architecture,
programmable logic resources. It is also possible to use a comparing favourably with the state of the art in terms of
more sophisticated and larger FPGA, like Xilinx Zynq processing speed, hardware utilization and operational
Z-7030, Z-7045 or Z-7100. This would it allow to create an characteristics.
even faster solution. Thanks to their increased capacity,
they could hold more matching cores and also operate with Future work will focus on integrating the design as a part
a higher clock frequency. of a computer vision processing pipeline, performing a
complete application starting from acquisition, through
It should be noted that the presented solution uses the feature detection, description and matching, up to high-
standard DDR RAM memory for storing all the images and level processing, all integrated in a single chip. Considering
the results of operations. This enables easy access to the the characteristic properties of the described system,
data for other coprocessors or ARM cores performing possible candidates are applications in robot navigation
further processing. Another advantage is the relatively and vision-based surveillance.
high capacity of the memory (hundreds of megabytes) and
its low cost in comparison to other available solutions. In 8. Acknowledgements
addition, the system is based on commonly used commu‐
nication interfaces, like AXI4-Stream and AXI4-Lite, which This research was financed by the Polish National Science
eases up the process of the integration of the presented Centre grant funded according to the decision
solution with other systems. DEC-2011/03/N/ST6/03022, which is gratefully acknowl‐
edged.
7. Conclusions and Future Work
9. References
This paper presents an architecture for real-time image
feature detection, description and matching implemented [1] Huiyu Zhou, Yuan Yuan, and Chunmei Shi. Object
in programmable hardware. The presented solution is tracking using SIFT features and mean shift.
supplemented with an underlying communication infra‐ Computer Vision and Image Understanding, 113(3):345
structure based on standard interfaces, ensuring bottle‐ – 352, 2009. Special Issue on Video Analysis.

Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 13


A High-performance FPGA-based Image Feature Detector and Matcher Based on the FAST and BRIEF Algorithms
[2] Georg Nebehay and Roman Pflugfelder. Consen‐ International Conference on Multimedia, MM ’13,
sus-based matching and tracking of keypoints for pages 685–688, New York, NY, USA, 2013. ACM.
object tracking. In Winter Conference on Applications [15] O. Miksik and K. Mikolajczyk. Evaluation of local
of Computer Vision. IEEE, March 2014. detectors and descriptors for fast feature matching.
[3] Noah Snavely, Steven M. Seitz, and Richard In Pattern Recognition (ICPR), 2012 21st International
Szeliski. Photo tourism: Exploring photo collections Conference on, pages 2681–2684, Nov 2012.
in 3d. ACM Transactions on Graphics, 25(3):835–846, [16] Adam Schmidt, Marek Kraft, Michał Fularz, and
July 2006. Zuzanna Domagala. The comparison of point
[4] M.J. Westoby, J. Brasington, N.F. Glasser, M.J. feature detectors and descriptors in the context of
Hambrey, and J.M. Reynolds. Structure-from- robot navigation. Journal of Automation, Mobile
Robotics & Intelligent Systems, 7(1), 2013.
Motion photogrammetry: A low-cost, effective tool
for geoscience applications. Geomorphology, 179(0): [17] Hans Moravec. Visual mapping by a robot rover. In
300 – 314, 2012. Proceedings of the 6th International Joint Conference on
[5] Matthew Brown and David G. Lowe. Automatic Artificial Intelligence, pages 599–601, August 1979.
panoramic image stitching using invariant features. [18] C. Harris and M. Stephens. A combined corner and
International Journal of Computer Vision, 74(1):59–73, edge detection. In Proceedings of The Fourth Alvey
2007. Vision Conference, pages 147–151, 1988.
[6] F. Fraundorfer and D. Scaramuzza. Visual odome‐ [19] Jianbo Shi and C. Tomasi. Good features to track. In
try: Part II: Matching, robustness, optimization, and Proceedings of the IEEE Computer Society Conference on
applications. Robotics Automation Magazine, IEEE, Computer Vision and Pattern Recognition, 1994, pages
19(2):78–90, June 2012. 593–600. IEEE, June 1994.
[7] Kurt Konolige, Motilal Agrawal, and Joan Solà. [20] Krystian Mikolajczyk and Cordelia Schmid. Scale &
Large-scale visual odometry for rough terrain. In affine invariant interest point detectors. Internation‐
Makoto Kaneko and Yoshihiko Nakamura, editors, al journal of computer vision, 60(1):63–86, 2004.
Robotics Research, volume 66 of Springer Tracts in [21] David G. Lowe. Distinctive image features from
Advanced Robotics, pages 201–212. Springer Berlin scale-invariant keypoints. International Journal of
Heidelberg, 2011. Computer Vision, 60(2):91–110, 2004.
[8] O. Hamdoun, F. Moutarde, B. Stanciulescu, and B. [22] Edward Rosten and Tom Drummond. Fusing points
Steux. Person re-identification in multi-camera and lines for high performance tracking. In Comput‐
system by signature based on interest point descrip‐ er Vision, 2005. ICCV 2005. Tenth IEEE International
tors collected on short video sequences. In Distrib‐ Conference on, volume 2, pages 1508–1515 Vol. 2, Oct
uted Smart Cameras, 2008. ICDSC 2008. Second ACM/ 2005.
IEEE International Conference on, pages 1–6, Sept [23] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and
2008. Luc Van Gool. Speeded-up robust features (SURF).
[9] Ming-yu Chen and Alexander Hauptmann. Mo‐ Computer vision and image understanding, 110(3):346–
SIFT: Recognizing human actions in surveillance 359, 2008.
videos (technical report). 2009. [24] Motilal Agrawal, Kurt Konolige, and Morten Rufus
[10] AJ. Davison, ID. Reid, N.D. Molton, and O. Stasse. Blas. CenSurE: Center surround extremas for
realtime feature detection and matching. In David
MonoSLAM: Real-time single camera SLAM.
Forsyth, Philip Torr, and Andrew Zisserman,
Pattern Analysis and Machine Intelligence, IEEE
editors, Computer Vision - ECCV 2008, volume 5305 of
Transactions on, 29(6):1052–1067, June 2007.
Lecture Notes in Computer Science, pages 102–115.
[11] Rainer Kümmerle, Giorgio Grisetti, and Wolfram Springer Berlin Heidelberg, 2008.
Burgard. Simultaneous parameter calibration, [25] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and
localization, and mapping. Advanced Robotics, Gary Bradski. ORB: an efficient alternative to SIFT or
26(17):2021–2041, 2012. SURF. In Computer Vision (ICCV), 2011 IEEE Interna‐
[12] Edward Rosten and Tom Drummond. Machine tional Conference on, pages 2564–2571. IEEE, 2011.
learning for high-speed corner detection. In [26] Stefan Leutenegger, Margarita Chli, and Roland
European Conference on Computer Vision, volume 1, Yves Siegwart. BRISK: Binary robust invariant
pages 430–443, May 2006. scalable keypoints. In Computer Vision (ICCV), 2011
[13] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, IEEE International Conference on, pages 2548–2555.
C. Strecha, and P. Fua. BRIEF: Computing a local IEEE, 2011.
binary descriptor very fast. Pattern Analysis and [27] Alejandro Nieto, D López Vilarino, and V Brea.
Machine Intelligence, IEEE Transactions on, 34(7): Towards the optimal hardware architecture for
1281–1298, July 2012. computer vision. Machine Vision. InTech, 2011.
[14] Song Wu and Michael Lew. Evaluation of salient [28] Lars Struyf, Stijn De Beugher, Dong Hoon Van
point methods. In Proceedings of the 21st ACM Uytsel, Frans Kanters, and Toon Goedemé. The

14 Int J Adv Robot Syst, 2015, 12:141 | doi: 10.5772/61434


battle of the giants: a case study of GPU vs FPGA hardware implementation. In Field Programmable
optimisation for real-time image processing. Logic and Applications (FPL), 2011 International
Proceedings PECCS 2014, 1:112–119, 2014. Conference on, pages 478 –481, September 2011.
[29] C. Torres-Huitzil and M. Arias-Estrada. An FPGA [40] Chunmeng Bi and T. Maruyama. Real-time corner
architecture for high speed edge and corner detec‐ and polygon detection system on FPGA. In 22nd
tion. In Computer Architectures for Machine Percep‐ International Conference on Field Programmable Logic
tion, 2000. Proceedings. Fifth IEEE International and Applications (FPL), 2012, pages 451 –457, August
Workshop on, pages 112–116, 2000. 2012.
[30] A. Benedetti and P. Perona. Real-time 2-d feature [41] Julio Dondo, Felix Villanueva, David Garcia, David
detection on a reconfigurable computer. In Comput‐ Vallejo, Carlos Glez-Morcillo, and Juan Carlos
er Vision and Pattern Recognition, 1998. Proceedings. Lopez. Distributed FPGA-based architecture to
1998 IEEE Computer Society Conference on, pages 586– support indoor localisation and orientation serv‐
593, Jun 1998. ices. Journal of Network and Computer Applications,
[31] V. Bonato, E. Marques, and G.A. Constantinides. A 45(0):181 – 190, 2014.
parallel hardware architecture for scale and rotation [42] Jun-Seok Park, Hyo-Eun Kim, and Lee-Sup Kim. A
invariant feature detection. Circuits and Systems for 182mW 94.3 fps in full HD pattern-matching based
Video Technology, IEEE Transactions on, 18(12):1703– image recognition accelerator for embedded vision
1712, Dec 2008. system in 0.13um cmos technology. IEEE Transac‐
[32] Feng-Cheng Huang, Shi-Yu Huang, Ji-Wei Ker, and tions on Circuits and Systems for Video Technology,
Yung-Chang Chen. High-performance SIFT hard‐ PP(99):1, 2012.
ware accelerator for real-time image feature [43] Jianhui Wang, Sheng Zhong, Luxin Yan, and
extraction. Circuits and Systems for Video Technology, Zhiguo Cao. An embedded system-on-chip archi‐
IEEE Transactions on, 22(3):340–351, March 2012. tecture for real-time visual detection and matching.
[33] J. Svab, T. Krajnik, J. Faigl, and L. Preucil. FPGA Circuits and Systems for Video Technology, IEEE
based speeded up robust features. In Proc. of IEEE Transactions on, 24(3):525–538, March 2014.
International Conference on Technologies for Practical [44] Hoon Heo, Jung yong Lee, Kwang yeob Lee, and
Robot Applications, TePRA 2009., pages 35 –41, Chan ho Lee. Fpga based implementation of FAST
November 2009. and BRIEF algorithm for object recognition. In
[34] M. Schaeferling and G. Kiefer. Flex-SURF: A flexible TENCON 2013 - 2013 IEEE Region 10 Conference
architecture for FPGA-based robust feature extrac‐ (31194), pages 1–4, Oct 2013.
tion for optical tracking systems. In Reconfigurable [45] J. R. Quinlan. Induction of decision trees. Machine
Computing and FPGAs (ReConFig), 2010 International Learning, 1(1):81–106, March 1986.
Conference on, pages 458 –463, December 2010.
[46] Elmar Mair, Gregory D. Hager, Darius Burschka,
[35] M. Schaeferling and G. Kiefer. Object recognition on Michael Suppa, and Gerhard Hirzinger. Adaptive
a chip: A complete SURF-based system on a single and generic corner detection based on the acceler‐
FPGA. In Proc. of International Conference on Recon‐ ated segment test. In Proceedings of the 11th European
figurable Computing and FPGAs (ReConFig), 2011, Conference on Computer Vision: Part II, ECCV’10,
pages 49 –54, December 2011. pages 183–196, Berlin, Heidelberg, 2010. Springer-
[36] S.G. Fowers, Dah-Jye Lee, D.A. Ventura, and J.K. Verlag.
Archibald. The nature-inspired BASIS feature
[47] Michael Calonder, Vincent Lepetit, Christoph
descriptor for UAV imagery and its hardware
Strecha, and Pascal Fua. BRIEF: Binary Robust
implementation. IEEE Transactions on Circuits and
Independent Elementary Features. In Kostas
Systems for Video Technology, 23(5):756–768, May
Daniilidis, Petros Maragos, and Nikos Paragios,
2013.
editors, Computer Vision - ECCV 2010, volume 6314
[37] Raphael Njuguna. A survey of FPGA benchmarks. of Lecture Notes in Computer Science, chapter 56,
Technical report, Technical report, CSE Depart‐ pages 778–792. Springer Berlin / Heidelberg, Berlin,
ment, Washington University in St. Louis, 2008. Heidelberg, 2010.
[38] Marek Kraft, Adam Schmidt, and Andrzej Kasinski. [48] Xilinx Inc. UG4737 Series FPGAs Memory Resources
High-speed image feature detection using FPGA User Guide, v1.11 edition, November 2014.
implementation of FAST algorithm. Proc. 3rd Int.
[49] Rainer Lienhart and Jochen Maydt. An extended set
Conf. on Computer Vision Theory and Applications
of Haar-like features for rapid object detection. In
(VISAPP 2008), 1:174–179, 2008.
Proceedings of the International Conference on
[39] K. Dohi, Y. Yorita, Y. Shibata, and K. Oguri. Pattern
Image Processing, volume 1, pages 900–903, 2002.
compression of FAST corner detection for efficient

Michał Fularz, Marek Kraft, Adam Schmidt and Andrzej Kasiński: 15


A High-performance FPGA-based Image Feature Detector and Matcher Based on the FAST and BRIEF Algorithms

You might also like